6. What is Sketch Engine ?
It is a corpus query tool which takes as input a corpus of any
language and a corresponding grammar patterns, and which
generates, amongst other things, word sketches for the words
of that language.
The Sketch Engine is designed for anyone wanting to research
how words behave.
6
SkE
Corpus
Word Sketches
7. What is Sketch Engine ?
7
Upload
your own
corpus
Access to
public
corpora
Advanced
search
options
8. Sketch Engine Features
1
• Web based tool – No installation
2
• Support Arabic corpora
3
• The Concordancer with advanced options
4
• The Word Sketches
8
9. Sketch Engine Features
5
• The Thesaurus (find similar words)
6
• Support for parallel corpora, virtual sub- and
super corpora
7
• Full regular-expression searching using CQL
8
• Corpus Architect: user corpora, uploaded by
users or created by WebBootCaT
9
10. Who Use Sketch Engine ?
10
Language
learners
WritersLinguists
Researchers
15. Steps to create a Corpus in SKE
15
Word Sketches
Sketch Diff
Thesaurus
Raw text
Tokenizati
on
Lemmatiz
ation
POS
tagging
Sketch
Grammar
SKE
Features
16. 16
1- Upload your text:
- Sketch engine accepts types of files such as (.xml .doc, .docx, .htm,
.html, .pdf,.txt, …)
17. 17
2- Tokenization:
- The process of splitting words and adding structure tags
(<s>,<doc>,<p>).
- The output will be a vertical line file
19. 19
4- POS tagging:(mandatory for word sketch)
- The process of attaching a word with its part-of-speech tag.
- SKE Arabic tagger is not avaliable.
•
V
•
PN
•
N
20. 20
5- uploading Sketch Grammar:
- A file describing the grammatical relations in a langauge.
Example: 1: ”V” “(DET|NUM|ADJ|ADV|N)”* 2:”N”
26. Concordance
What is Concordancer?
A concordancer looks through the
whole corpus and finds every
example of a particular word or
phrase, then displays it with its
immediate context.
.
.
26
33. Concordance
Query'sTypes
Word Will match any word form exactly.
+you can select the PoS (Not for Arabic corpus)
+you can select "match case“ (Not for Arabic corpus)
« »« »
33
36. The general form is: [attr="value"]
o«»
“Match any character“ operator: *
o«...»
Or , And operators: | , &:
o«»«»
36
Concordance
Corpus Query Language (Basics)
37. “Match any token" operator: []
o«..»«»
Specifying number of tokens operator: {}
o«..»«»
o«..»0-3
«»
37
Concordance
Corpus Query Language (Basics)
45. Here you can:
Select a sub-corpus or
Create a new sub-corpus from a subset
of the current corpus
You can also select constraints on the
text types for documents that will be
searched for your query
45
Concordance
TextTypes
51. WordList
What is theWord List?
Word List: for obtaining word lists ranked by
frequency for an entire corpus, or a
specified sub-corpus
It can be useful for investigating whether a
word is used most frequently in its verb or
noun form, for instance.
51
52. 52
Input: RE pattern or any
attribute (word, tag, lemma…)
Word List
Output:
Filtered list of lemma and/
words with frequencies
55. Choose lemma at Search attribute
Type the lemma (e.g. ) into
the RE pattern box.
Tick the box that says change
output attribute(s).
In the first two levels, select
“lemma" and "Tag".
55
61. WordSketch
What isWord Sketch?
Word Sketch: this allows you to explore the
grammatical and collocational behaviour of
a word.
The Word Sketch function doesn’t just tell
you what words are commonly found in the
company of your search word, but also tells
you what their grammatical relationship is
to the search word.
61
67. Thesaurus
What isThesaurus?
Thesaurus: this allows you to find other
words that have similar grammatical and
collocational behaviour to a given word.
Note that this thesaurus is produced
automatically from statistics on word co-
occurrences.
It is not a manually constructed thesaurus and
will list words for each entry which are
distributionally related but not necessarily
synonyms.
67
73. Sketch-Diff
What isWord Sketch Difference?
Sketch-Diff: this allows you to compare the
behavior of two words
This function is also very useful for
comparing/deciding between two possible
translations of an item.
73
74. 74
Input: two words or
lemmas
Sketch-Diff
Output: the different and
common collocations of
the two lemmas.
It is a corpus query tool which takes as input a corpus of any language (with an appropriate level of linguistic mark-up) and a corresponding grammar patterns, and which generates, amongst other things, word sketches for the words of that language.Those other things include a corpus-based thesaurus and ‘sketch differences’, which specify, for two semantically related words, what behaviour they share and how they differ. We anticipate that sketch differences will be particularly useful for lexicographers interested in near synonym differentiation.Word sketches were first used in the production of the Macmillan English Dictionary (Rundell 2002) and were presented at Euralex 2002 (Kilgarriff and Rundell 2002). Following that presentation, the most-asked question was “can I have them for my language?” In response, we have now developed the Sketch Engine.
It is a corpus query tool which takes as input a corpus of any language (with an appropriate level of linguistic mark-up) and a corresponding grammar patterns, and which generates, amongst other things, word sketches for the words of that language.Those other things include a corpus-based thesaurus and ‘sketch differences’, which specify, for two semantically related words, what behaviour they share and how they differ. We anticipate that sketch differences will be particularly useful for lexicographers interested in near synonym differentiation.Word sketches were first used in the production of the Macmillan English Dictionary (Rundell 2002) and were presented at Euralex 2002 (Kilgarriff and Rundell 2002). Following that presentation, the most-asked question was “can I have them for my language?” In response, we have now developed the Sketch Engine.
The Sketch Engine has a number of language-analysis functions, the core ones being:the Concordancer A program which displays all occurrences from the corpus for a given query. The program is very powerful with a wide variety of query types and many different ways of displaying and organising the results. (concordancing, sorting, sampling, wordlists, collocation lists)the Word Sketch program This program provides a corpus-based summary of a word's grammatical and collocationalbehaviour.
With Corpus Architect, you can build your own corpora from documents in various format: TXT, PDF, PS, DOC, HTML, VERT. When processed, you can search and query them within Sketch Engine.
With Corpus Architect, you can build your own corpora from documents in various format: TXT, PDF, PS, DOC, HTML, VERT. When processed, you can search and query them within Sketch Engine.
With Corpus Architect, you can build your own corpora from documents in various format: TXT, PDF, PS, DOC, HTML, VERT. When processed, you can search and query them within Sketch Engine.
Concordance: for querying a corpus and obtaining concordances which you can then further refine, filter and use for generating frequency information and collocation listsWord List: for obtaining word lists for an entire corpus, or a specified subcorpusWord Sketch: this allows you to explore the grammatical and collocational behaviour of a word.Thesaurus: this allows you to find other words that have similar grammatical and collocational behaviour to a given word. Note that this thesaurus is produced automatically from statistics on word co-occurrences. It is not a manually constructed thesaurus and will list words for each entry which are distributionally related but not necessarily synonyms.Sketch-Diff: this allows you to compare the behaviour of two words
Main Sketch Engine Links:https://www.sketchengine.co.uk/documentation/wiki/SkE/Help/MainLinkHelp
Concordance Query:https://www.sketchengine.co.uk/documentation/wiki/SkE/Help/PageSpecificHelp/ConcordanceQueryQuery Types: Using Query Type, you can refine the type of query you wish to make in the main panel.Context : If Context is selected in the LHS menu, on the main panel you can specify criteria on the context for your query. You can choose to specify the context in terms of surrounding lemma(s) and/or PoS tag(s).Text Types: Here you can select a subcorpus or create a new subcorpus from a subset of the current corpus. You can also stipulate constraints on the text types for documents that will be searched for your query
Ex1:Lemma filter:Window: right, 1 tokensLemma(s): عن none
Concordance Menu options:https://www.sketchengine.co.uk/documentation/wiki/SkE/Help/PageSpecificHelp/Concordance Menu optionsNote that the options in the left hand side panel are all available when you are viewing the concordance. Some of the options will not be shown if you have already selected from this menu. If so, you can click view concordance to get back to the concordance.View OptionsClicking on View Options will allow you to alter how the concordance looksWith this you can select what attributes of the words in the concordance you seeKWIC/Sentence Toggle betweenthe KWIC mode where the queried text (node) is in a central column and context is displayed on either sideSentence where the queried text (node) is provided in the context of the sentence in which it occursSave Click on this to see options for saving the concordance in the main panel (or the frequency list or collocation candidates).Sort Click on this to see complex sorting options. If the concordance is sorted based on the context, an option to"Jump to" a page with context starting with a certain letter occurs.Alternatively, you can click onLeft (Right): to sort by the text left (Right) of the nodeNode: to sort by the text in the central column (referred to as the node or KWIC)References: to sort by the document references at the left hand side of the concordanceShuffle: the concordance will be jumbled to avoid bias from a user only looking at the first portionSample Click this to select a random sample of the concordance linesFilter Click this to further specify contextual features to filter the concordance, for example by words to the left or right of the node word, or by text typeFrequency Click on this to see a variety of complex methods for obtaining frequency listsAlternatively, you can click onNode tags: to get a frequency list over the part of speech tags of the node word/s in the central columnNode forms: to get a frequency list over the node word forms in the central columnDoc IDs: to get a frequency list over the Doc ID's for the node word/s in the central columnText Types: to get a frequency list over all the text types of the node word/s in the central columnCollocations Click on this to specify criteria and build collocation lists for the node word/s in the central columnConcDesc You can see the query in detail (for technical people) and you can go back in the history if the query consists of several subsequent actions.Visualize This link will show you the distributional graph of the concordance within the corpus. On x-axis there are concordance positions (by default 100 columns for 100 slices of the corpus, you may change its granularity with the slider + click on Redraw button), on y-axis there is a relative frequency of the query hits within a concordance part (=column). Columns are clickable: by clicking on a column, you will filter the concordance and will see only the appropriate concordance part.
Word List Options:Left hand side options:select All words to generate a list of words in the corpus ranked by frequencyselect All lemmas to generate a list of lemmas in the corpus ranked by frequency. Lemma is the base (stem) form of a word.In the main panel of the interface you have further options:Subcorpus: where you can specify a subcorpus for the source data, or create a new one.Search Attribute: you can specify word, lemma, tag (part of speech tag) etc.. depending on the attributes defined for the corpus or you can specify one of the text types defined for the corpus. The default attribute is word.Filter Options: You can either do this for all words (or lemmas or whichever attribute you specify) or you can filter the list.Output Options:You can select different types of the produced list.
Choose a corpus and click on Word List in the left hand side menu.Choose lemma at Search attributeType the lemma (e.g. حار) into the RE pattern box. Tick the box that says change output attribute(s).In the first two levels, select “lemma" and "Tag".Click on Make Word List.