Sketch engine presentation

Introduction to Sketch Engine
http://www.sketchengine.co.uk/
–
1

Basic Terminology
Introduction
How to Use Sketch Engine ?
Research Issues
Outline
2

BasicTerminology
English Term
Corpus - Corpora
≠Blog .
Parallel corpora
Comparable Corpus .
Written Corpora
Spoken Corpora
3

BasicTerminology
English Term
Collocation
)( ()()
()
Concordances –
:
.
.
.
..
Lemma
4

BasicTerminology
English Term
Part-of-Speech
(PoS) Tagging codetag
.
Thesaurus
()
5

What is Sketch Engine ?
 It is a corpus query tool which takes as input a corpus of any
language and a corresponding grammar patterns, and which
generates, amongst other things, word sketches for the words
of that language.
 The Sketch Engine is designed for anyone wanting to research
how words behave.
6
SkE
Corpus
Word Sketches

What is Sketch Engine ?
7
Upload
your own
corpus
Access to
public
corpora
Advanced
search
options

Sketch Engine Features
1
• Web based tool – No installation
2
• Support Arabic corpora
3
• The Concordancer with advanced options
4
• The Word Sketches
8

Sketch Engine Features
5
• The Thesaurus (find similar words)
6
• Support for parallel corpora, virtual sub- and
super corpora
7
• Full regular-expression searching using CQL
8
• Corpus Architect: user corpora, uploaded by
users or created by WebBootCaT
9

Who Use Sketch Engine ?
10
Language
learners
WritersLinguists
Researchers

Sketch engine usage:
11
Common
words/colloc
ations
synonyms grammar
Words
behavior

Available corpora
12
200+ corpora in 60+ languages


14
How to create your corpus using SKE?

Steps to create a Corpus in SKE
15
Word Sketches
Sketch Diff
Thesaurus
Raw text
Tokenizati
on
Lemmatiz
ation
POS
tagging
Sketch
Grammar
SKE
Features

16
1- Upload your text:
- Sketch engine accepts types of files such as (.xml .doc, .docx, .htm,
.html, .pdf,.txt, …)

17
2- Tokenization:
- The process of splitting words and adding structure tags
(<s>,<doc>,<p>).
- The output will be a vertical line file

18
3- Lemmatization (optional):
- The process of attaching a word with its lemma.

19
4- POS tagging:(mandatory for word sketch)
- The process of attaching a word with its part-of-speech tag.
- SKE Arabic tagger is not avaliable.
•
V
•
PN
•
N

20
5- uploading Sketch Grammar:
- A file describing the grammatical relations in a langauge.
Example: 1: ”V” “(DET|NUM|ADJ|ADV|N)”* 2:”N”

Vertical line file with annotations
21

Adding data to the corpus by uploading a file
22

Adding data to the corpus usingWebBootCat
23
Seeds/URLs WebBootCat Your corpus

How to Use Sketch Engine ?
 As a Corpus User (Querying Corpora)
Concordance Word Lists Word Sketches
Sketch Diff Thesaurus
24

Concordance
What is Concordancer?
A concordancer looks through the
whole corpus and finds every
example of a particular word or
phrase, then displays it with its
immediate context.
.
.
26

Query Types
Context
Text Types
28

Concordance
Query'sTypes
Query’s
Types
Simple
Lemma
Phrase
Word
Character
CQL
29

Concordance
Query'sTypes
Simple Will match the lemma (the stemmed form)
as well as the word
+ work for phrases.
«
» ...
30

Concordance
Query'sTypes
Lemma Will match any lemma
+ you can select PoS (Not for Arabic corpus).
This option will not work for phrases
« » ...
31

Concordance
Query'sTypes
Phrase Will match a phrase
+ any capitalized variant (Not for Arabic corpus)
but will not match the lemma
«
»
«
»
32

Concordance
Query'sTypes
Word Will match any word form exactly.
+you can select the PoS (Not for Arabic corpus)
+you can select "match case“ (Not for Arabic corpus)
« »« »
33

Concordance
Query'sTypes
Character Matches a character string.
« » ...
34

Concordance
Query'sTypes
CQL Is for inputting complex queries using Corpus
Query Language
35

 The general form is: [attr="value"]
o«»
 “Match any character“ operator: *
o«...»
 Or , And operators: | , &:
o«»«»
36
Concordance
Corpus Query Language (Basics)

 “Match any token" operator: []
o«..»«»
 Specifying number of tokens operator: {}
o«..»«»
o«..»0-3
«»
37
Concordance
Corpus Query Language (Basics)

Concordance
Exercises (CQL)
 Ex1:
: «»
 Ex2:

38

Concordance
Exercises (CQL)
 Ex1:
: «»
"" [] "“
 Ex2:

"" [] {0,3} "|"
39

 Here you can specify criteria on the context for
your query.
 Ex1:
«»«»
 Ex2:
«»«»
41
Concordance
Context

42
Concordance
Context (Exercise)

43
Concordance
Context (Exercise)

 Here you can:
 Select a sub-corpus or
 Create a new sub-corpus from a subset
of the current corpus
 You can also select constraints on the
text types for documents that will be
searched for your query
45
Concordance
TextTypes

47
Concordance
Concordance Menu Options
 Save
 View Options
 Sort
 Sample
 Filter
 Frequency
 Collocations
 ConcDesc
 Visualize

Concordance
Exercises
 Ex1: Filter

 Ex2: Collocation
«»
 Ex3: Frequency – Node Tags
«»,
 Ex4: CQL - Frequency – Node Forms
: «» «»
48

Concordance
Exercises
 Ex1: Concordance:  Make Concordance
 Filter  select negative, Simple query:
 Collocation  Attribute: word  Make Candidate List
 Click Node Tags
 Ex4: Concordance  CQL: « » « | »
49

WordList
What is theWord List?
 Word List: for obtaining word lists ranked by
frequency for an entire corpus, or a
specified sub-corpus
 It can be useful for investigating whether a
word is used most frequently in its verb or
noun form, for instance.
51

52
Input: RE pattern or any
attribute (word, tag, lemma…)
Word List
Output:
Filtered list of lemma and/
words with frequencies

WordList
Exercises
 Ex1:
«»
«»
54

Choose lemma at Search attribute
Type the lemma (e.g. ) into
the RE pattern box.
Tick the box that says change
output attribute(s).
In the first two levels, select
“lemma" and "Tag".
55

WordList
Exercises
 Ex1:
«»
57

WordSketch
What isWord Sketch?
 Word Sketch: this allows you to explore the
grammatical and collocational behaviour of
a word.
 The Word Sketch function doesn’t just tell
you what words are commonly found in the
company of your search word, but also tells
you what their grammatical relationship is
to the search word.
61

62
Input: Lemma
Word Sketch
Output:
Collocations
in grammatical
relation

WordSketch
Exercises
 Ex1:
«»
 Ex2:
«»
65

Thesaurus
What isThesaurus?
 Thesaurus: this allows you to find other
words that have similar grammatical and
collocational behaviour to a given word.
 Note that this thesaurus is produced
automatically from statistics on word co-
occurrences.
 It is not a manually constructed thesaurus and
will list words for each entry which are
distributionally related but not necessarily
synonyms.
67

68
Input: Lemma +
POS tag
Thesaurus
Output:
Similar lemma

Sketch-Diff
What isWord Sketch Difference?
 Sketch-Diff: this allows you to compare the
behavior of two words
 This function is also very useful for
comparing/deciding between two possible
translations of an item.
73

74
Input: two words or
lemmas
Sketch-Diff
Output: the different and
common collocations of
the two lemmas.

Sketch-Diff
Exercises
 Ex1:
/
 Ex2:
/
77

Research Issues!
Please visit: http://goo.gl/HqhUir
Limitations!
Usage!

References
 http://www.sketchengine.co.uk/
 http://lisan1.com/wordpress/?p=146
 Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D.
(2004). Itri-04-08 the sketch engine. Information
Technology, 105, 116.
81

Sketch engine presentation

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Sketch engine presentation

Semelhante a Sketch engine presentation (20)

Mais de iwan_rg

Mais de iwan_rg (20)

Último

Último (20)

Sketch engine presentation

Notas do Editor