SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
Lexicographic Evidence
In this part how to design acquire and process a
design, acquire,
collection of linguistic data which will form the raw
material for a dictionary is going to be explained
explained.
Comprehension Q
C
h
i Questions (1)
ti
1. What is a reliable dictionary?
2. What is subjective evidence and its limits?
3.
3 What is a citation?
4. What should be the basic steps in setting up a
reading programme?
5. What
5 Wh t are th advantages and di d
the d
t
d disadvantages of
t
f
citations?
Comprehension Q
C
h
i Questions (2)
ti
6. What is a corpus?
7. What are the points that should be considered in
g g
p
designing a corpus?
8. How large should a corpus be?
9. How do we decide what kinds of written or spoken
material our corpus should include?
10. Can a corpus be representative?
Comprehension Q
C
h
i Questions (3)
ti
11. What i ‘ k i ’?
11 Wh t is ‘skewing’?
12. What
12 Wh are the questions that should be
h
i
h
h ld b

answered before starting to form the corpus?

13. What is linguistic annotation?
A ‘R li bl ’ Di ti
‘Reliable’ Dictionary
A reliable dictionary is one whose
generalizations
about
word
behavior
approximate closely to the ways in which
people normally use language when engaging
in real communicative acts. Yet, it is
difficult to determine how people normally
p p
y
use words. There is a need for evidence.
Subjective Evidence and Its Limits
Introspection: consulting your own mental l
l
l lexicon, is a

form of evidence, but it cannot form the basis of a
reliable dictionary alone, since one individual’s store of
linguistic k
li
i ti
knowledge i
l d
is i
inevitably i
it bl
incomplete and
l t
d
idiosyncratic.

Informant-testing: in which speakers of a language are

questioned about their use of words, is also of limited
value for mainstream lexicography for similar reasons.
g p y
Both f h
B h of them are essentially subjective f
i ll
bj
i
forms of
f
evidence.
Creating a reliable dictionary involves a number of
challenging tasks, but it is for sure that the observation
of language in use is the indispensable first stage in the
f
g g
p
f
g
process.
Citations
Cit ti
A citation is a short extract from a
text which provides evidence for a
word, phrase, usage, or meaning i
d h
i
in
authentic use.
Until the late
twentieth
century,
the
OED’s
citations would be written in
longhand on index cards known as
slips.
slips These were filed alphabetically
according to the keyword of the
citation.
it ti
DNA

If a blog has a common ancestor
with the diary one can say that it
diary,
has a DNA.
E.g. MySpace
E g ‘MySpace’ shares at least some
of its DNA with the ‘scrapbook’.
Setting up a Reading Programme
d
Some dictionary publishers provide online
forms to enable members of the public to
contribute citations Most of these publishers
citations.
get unusable citations since their programmes
are not well-planned. A good reading
p
g
g
programme, on the other hand, will often have
great value.
Setting up a Reading Programme
d
There is a need for at least four main data fields:
1- keyword or phrase: the usage that the citation illustrates,
filed under the headword to which it relates.
2- the citation itself: usually a single sentence is adequate, but
there may be more than one.
3- Information about the source of the citation: the date, title,
and author’s name are all important; additional information
(
(such as the page number) may be useful for specialized or
p g
)
y
p
historical dictionaries.
4- a comment field: this gives readers the option of adding a
c mm nt f
th g
r a r th
pt n f a ng
note to clarify the citation; it may, for example, be a new
meaning that needs explaining, or it may be characteristic of
one particular dialect.
Advantages of Cit ti
Ad
t
f Citations
1- they are helpful to monitor language change
y
p
g g
g
2
2- They give information about the terminology
from a specific subject field or a particular
variety or dialect.
y
3
3- They are helpful in training the
lexicographers
Disadvantages of Cit ti
Di d
t
f Citations
11 Collecting data in this way is labour intensive
labour-intensive,
so volumes will always be low.
2- Although instances of usage are authentic,
there is a bi s bj ti
th
big subjective element in th i
l m nt
their
selection
The Central Role of the Corpus
h
l
l
f h
Citation bank alone - even the largest one –
cannot usually supply language data in the
required volumes so the case for a large
q
m
f
g
corpus is clear.
A “corpus” is a collection of pieces of
language text in electronic form, selected
g g
,
according to external criteria to represent,
as far as possible, a language or language
variety as a source of data for li
i
fd
f linguistic
i i
research (Sinclair 2005).
Some I
S
Inescapable T th
bl Truths
There is no such thing as a perfect corpus for
g
p
p
lexicography.
F
First of all, the corpus is a sample. It is not possible to
f
,
p
mp .
p
examine every extant example of usage for the languages. To
create a sample that fairly reflects the wider population,
there is a need for carefully selected criteria.
Secondly, selecting texts on the basis of their ‘quality’, and
excluding those which fail this test, is fundamentally at odds
with th d s ipti
ith the descriptive ethos of corpus lin isti s Wh is t
th s f
p s linguistics. Who
to
judge which texts are ‘good’, and on what basis? It is clear
that a lexicographic corpus must be a genuine – and inclusivesnapshot of a language, not a set of texts that have been
specially chosen to advance someone’s notion of what
constitutes ‘good’ usage.
Corpora: Design Issues
C
D i I
Designing a corpus means making decisions about:
11 how large it will be
be.
22 which broad categories of text it will include
include.
33 what proportions of each category it will include
include.
4- hi h i di id l texts
4 which individual t t it will include.
ill i l d
Size: How large is large enough?
It i f sure th t th more d t we h
is for
that the
data
have th more we
the
learn. Yet, there are also some hypotheses on the size
of the corpus. Zipf’s Law predicts that the tenth most
frequent word in a corpus will occur twice as often as
the 20th most frequent word, ten times as often as the
100th most frequent word, and 100 times as often as
q
,
the 1000th most frequent word. Thus, it can be said
that in a corpus of 100 million words, a simple right or
left sorted corpus clearly shows most of the normal
patterns of usage for all words except the very rare.
Different texts, different styles
ff
d ff
l
However large its size may be if the words are
be,
taken from only a limited area (for instance from
newspapers), they cannot represent all aspects of
the language, and th results m
th l n
nd the
s lts may b misl din
be misleading.
(For instance; the meaning of the word party will
most frequently occur as a political organization
q
y
p
g
rather than a social event. A corpus consisting of a
single type of text will reflect only the stylistic
and subject-matter features of that particular
genre. It will as corpus linguistics say, a ‘skewed’
corpus. Therefore, the corpus should include
different texts and d ff
d ff
d different styles.
l
Can a Corpus be Representative?
The standard way of avoiding bias is to collect a ‘random sample’.
Yet
Y t random s
d
sampling may not represent th l
li
t
s t the language well. O
ll One
partial solution is to apply stratified sampling. This involves
breaking up the total population into a number of subcategories or
types, then creating independent random samples from each of
these groupings. But this immediately raises two questions:
g
p g
y
q
1- How do we define these subcategories?
2
2- How do we decide what proportions of each subcategory the
corpus should include?
It is almost impossible to define the population
that the corpus should be representative of,
and since the population is unlimited, it is
d
h
l
l
d
logically impossible to establish ‘correct’
proportions of each component. An achievable
ti
f
h
t A
hi
bl
objective should be “a balanced corpus”.
Selecting Texts
S l ti T t
The corpus collection is usually recursive.
p
y
First some texts from a range of sources are gathered
Next the texts are analyzed to identify recurring clusters
f g
f
.
of linguistic features.
It enables us to establish provisional categories of texts,
grouped on the basis of shared linguistic features.
Then more texts are collected to reflect these feature
distributions.
Then the analysis is repeated on the enlarged corpus, on
more texts.
The process thus proceeds in a cyclical fashion until we
collect a large corpus whose contents reflect the proportions
in which the various key features are observable in large
bodies f text.
b di of t t
Spoken D t A S
S k
Data:
Special C
i l Case
With a corpus of spoken language, there are no
language
obvious objective measures that can be used to
define the target population. The spoken data
population
should represent the variables like gender,
social class, age and religion. The conversations
, g
g
that form the corpus should reflect the
diversity of the spoken language.
A Note on ‘Skewing’
N t
‘Sk i ’
Skewing refers to a form of bias in data
whereby a particular feature is either over or
under represented to a degree that distorts
the general picture. As corpora grow larger,
usually problems with skewing gradually recede.
yp
gg
y
There are some questions that should be answered
before starting to form the corpus.
Language: Will the corpus be monolingual, bilingual, or
g g
p
g
g
multilingual? This is an important question before
starting to form the corpus.
Time: Will the corpus be synchronic or diachronic? In
a synchronic corpus, the constituent texts come from
one specific period of time, whereas the texts making
p
p
g
up a diachronic corpus come from an extended period.
Mode: Will the corpus include written texts, spoken
texts or both? The status of the chat room
conversations which have the characteristics of both
spoken and written texts is another point that require
p
p
q
attention in corpus formation.
Medium
M di
Medium refers to the channel in which the text
appears. A simple classification here would
distinguish print media and spoken media. The
former in l d
f m
include b ks n sp p s m
books, newspapers, magazines,
in s
journals, dissertations, movie scripts, government
documents and legal statutes. Spoken media
g
p
include face-to-face conversations, broadcasts and
podcasts, public meetings, and educational settings.
Once again traditional categories became blurred
again,
when we add the web to the mix. Some ‘new’ text
types (blogs and social networking sites, for
example) are exclusive to the web, b
l )
l
h
b but many
documents exist in both print and electronic media.
Dealing ith S bl
D li with Sublanguages
When we think about the vocabulary of a
language, it is useful to make a broad
distinction
between
core
usages
and
sublanguages. The word deuce is part of a
sublanguage: it belongs to the vocabulary of
tennis.
tennis A word like important, on the other
hand, belongs to the core vocabulary of
English. The following question arises at this
g
f
g q
point: will we include the sublanguages?
Collecting Written Data
In the past, the work of lexicographers was
p
g p
not so easy. Earlier corpora made extensive
use of scanning and keyboarding which were
both l
b h slow and l b
d labour-intensive processes.
Today it is possible to find the digital form of
various t t
i
texts.
Collecting Spoken Data
Traditionally, spoken data has been difficult
rad t onally,
d ff cult
and extensive to collect. Consequently,
although the majority of communicative events
g
j
y
in a language occur in spoken mode, few
corpora include high proportions of spoken
material. For instance, only 10 per cent of the
BNC is spoken. Nowadays, web-derived spoken
data hi h ff
d t which offers up-to-date material i l
t d t
t i l in large
quantities and at low cost begins to look like an
attractive alternative
alternative.
Collecting Data from the Web
The
Th question of ‘‘whether th web is a
sti
f h th the
b
corpus’ is a hotly debated topic in
language engineering circles. For
g p y,
lexicography, it is better to see the
web as a source of texts from which
a lexicographic corpus can be
assembled.
Sample Size
There are arguments for using complete
texts rather than extracts. In many
registers, the discourse structure and
g
rhetorical f
h
l features of a text may vary as it
f
proceeds from its opening paragraphs,
through its central sections, to the
concluding chapters. The BNC’s solution to
this was to ensure that 40000 word samples
were taken variously from the beginning
beginning,
middle, and end of its source documents.
Copyright and Permissions
C
i ht
d P
i i
Unless a corpus is made up of much older texts, most
of its source material is likely to be protected by
copyright. S corpus-builders should get permissions
i ht So,
b ild
h ld
t
i i
from the copyright owners to include the documents in
their corpus. This is not an easy task. It is one of the
most time consuming aspects of the project It is
project.
recommended that the corpus builders should never
offer to pay for permission to include a text. Once
money starts changing hands a precedent would be
hands,
established that could have fatal consequences to
corpus-creation efforts worldwide.
Processing and Annotating
g
g
the Data
To give the final f
g
f
form to the corpus f
p from its raw
state, some operations are carried on.
Clean-up, standardization,
p
and text encoding
Essentially
the
process
of
taking
a
heterogeneous collection of input document
collect on
nput
and converting them all to a standard, usable
form. For instance, non-linguistic sounds in
g
spoken data (like erm, ooh, mhm) and unusable
texts in written data (like indexes, tables,
diagrams) are not included in the corpus.
Documentation
D
i
Providing each input text with a unique
‘header document’ which records its essential
header document wh ch
ts essent al
features. Headers typically give bibliographic
information (title, author’s name, date and
place of publication, and the like) and
precisely locate each text in whatever
typology is being used.
Linguistic Annotation
Enriching raw text by adding grammatical
information which will enable corpus users
to frame sophisticated queries and extract
p
q
maximum benefit from the data. For
instance, She is tagged as a personal
pronoun, and R ll is tagged as a general
d Really
d
l
adverb. A well-tagged corpus allows us to
focus on each pattern in turn and view a
manageable number of examples.
Final Thoughts
Fi l Th
ht
In this part, a methodology for building a
corpus for use in lexicography has been
p
g p y
outlined. It is for sure that this is a difficult
task, and there is no perfect corpus since
p
p
language is diverse and dynamic. The aim is to
form a balanced, standardized, well-tagged
gg
corpus. For many kinds of research, a corpus
with meticulously detailed headers and finey
grained linguistic annotation is precisely what
is needed.
Turkish Summary: Sözlüksel Kanıt
Bu bölümde oluşturulması planlanan bir sözlüğe kaynaklık edecek olan
verilerin nasıl tasarlandığı, toplandığını ve işlendiğini anlatılmıştır.
Öncelikle, sözlüğü hazırlayacak kişilerin kendi sözcük dağarcıklarının
önemi vurgulanmalıdır. Ancak şurası kesin ki ne kadar geniş olursa
olsun bireysel sözcük dağarcığı böyle bir çalışma için yeterli değildir.
Geçmişte en sık kullanılan veri toplama yöntemi alıntılama yöntemi idi
idi.
Günümüzde bu yöntem eskiye oranla önemini biraz yitirse de hala
kullanılmaktadır. Hatta, internet üzerinden, gönüllülerden veriler
toplanması amacıyla özel programlar geliştirilmiştir. Bilgisayar ve
internet teknolojilerinin gelişmesiyle en fazla kullanılan veri toplama
yöntemi derleme yöntemi olmuştur Bölüm boyunca derlemeyi
olmuştur.
hazırlayan kişilerin pek çok soruyla karşı karşıya oldukları ve işlerinin
ne derece zor olduğu vurgulanmıştır. Dil çeşitlilik gösteren ve dinamik
bir yapıdadır bu nedenle mükemmel bir derleme yapılması beklenemez.
Sözlükçülerin amacı dengeli, kullanılan dili en iyi temsil eden, dili
kullanan kişilere ve dilin kullanıldığı ortamlara göre ortaya çıkan
değişkenleri dikkate alan, ve sözlük hazırlanmasında işe yarayacak
şekilde düzenlenmiş bir derleme oluşturmak olmalıdır.

Mais conteúdo relacionado

Mais procurados

British national corpus
British national corpusBritish national corpus
British national corpusLaura P
 
The translation of neologism
The translation of neologismThe translation of neologism
The translation of neologismAuver2012
 
Second language acquisition (SLA)
Second language acquisition (SLA)Second language acquisition (SLA)
Second language acquisition (SLA)Mohsin Naqvi
 
Language, dialect and accent
Language, dialect and accentLanguage, dialect and accent
Language, dialect and accentMuslimah Alg
 
Morphological typology/ Morphological Operations
Morphological typology/ Morphological OperationsMorphological typology/ Morphological Operations
Morphological typology/ Morphological OperationsDr. Mohsin Khan
 
Theories related to semantic 2
Theories related to semantic 2Theories related to semantic 2
Theories related to semantic 2Gustina Savhira
 
SEMANTIC = LEXICAL RELATIONS
SEMANTIC = LEXICAL RELATIONS SEMANTIC = LEXICAL RELATIONS
SEMANTIC = LEXICAL RELATIONS Ani Istiana
 
Universal grammar (ug)
Universal grammar (ug)Universal grammar (ug)
Universal grammar (ug)RajpootBhatti5
 
Lexical and semantic features of Pakistani English
Lexical and semantic features of Pakistani EnglishLexical and semantic features of Pakistani English
Lexical and semantic features of Pakistani EnglishLaiba Yaseen
 
Second language acquisition
Second language acquisitionSecond language acquisition
Second language acquisitionkashmasardar
 

Mais procurados (20)

Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
British national corpus
British national corpusBritish national corpus
British national corpus
 
Case theory
Case theoryCase theory
Case theory
 
The translation of neologism
The translation of neologismThe translation of neologism
The translation of neologism
 
Sla theories part 1
Sla theories part 1Sla theories part 1
Sla theories part 1
 
Second language acquisition (SLA)
Second language acquisition (SLA)Second language acquisition (SLA)
Second language acquisition (SLA)
 
Language, dialect and accent
Language, dialect and accentLanguage, dialect and accent
Language, dialect and accent
 
Semantics
SemanticsSemantics
Semantics
 
Head Movement and verb movement
Head Movement and verb movementHead Movement and verb movement
Head Movement and verb movement
 
Semantics: Meanings of Language
Semantics: Meanings of LanguageSemantics: Meanings of Language
Semantics: Meanings of Language
 
Compiling Dictionaries
Compiling Dictionaries Compiling Dictionaries
Compiling Dictionaries
 
Morphological typology/ Morphological Operations
Morphological typology/ Morphological OperationsMorphological typology/ Morphological Operations
Morphological typology/ Morphological Operations
 
Theories related to semantic 2
Theories related to semantic 2Theories related to semantic 2
Theories related to semantic 2
 
SEMANTIC = LEXICAL RELATIONS
SEMANTIC = LEXICAL RELATIONS SEMANTIC = LEXICAL RELATIONS
SEMANTIC = LEXICAL RELATIONS
 
Universal grammar (ug)
Universal grammar (ug)Universal grammar (ug)
Universal grammar (ug)
 
Diglossia
DiglossiaDiglossia
Diglossia
 
linguistics
linguistics linguistics
linguistics
 
Diglossia
Diglossia Diglossia
Diglossia
 
Lexical and semantic features of Pakistani English
Lexical and semantic features of Pakistani EnglishLexical and semantic features of Pakistani English
Lexical and semantic features of Pakistani English
 
Second language acquisition
Second language acquisitionSecond language acquisition
Second language acquisition
 

Destaque

umair ijaz's Lexicography presentation
umair ijaz's Lexicography presentationumair ijaz's Lexicography presentation
umair ijaz's Lexicography presentationUmair Ijaz
 
Lexicography 2011
Lexicography 2011Lexicography 2011
Lexicography 2011Lenochka83
 
lexicography
lexicographylexicography
lexicographyayfa
 
The Use of Corpus Linguistics in Lexicography
The Use of Corpus Linguistics in LexicographyThe Use of Corpus Linguistics in Lexicography
The Use of Corpus Linguistics in LexicographyIhsan Ibadurrahman
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY mimisy
 

Destaque (9)

umair ijaz's Lexicography presentation
umair ijaz's Lexicography presentationumair ijaz's Lexicography presentation
umair ijaz's Lexicography presentation
 
Lexicography 2011
Lexicography 2011Lexicography 2011
Lexicography 2011
 
lexicography
lexicographylexicography
lexicography
 
Lexicography
LexicographyLexicography
Lexicography
 
The Use of Corpus Linguistics in Lexicography
The Use of Corpus Linguistics in LexicographyThe Use of Corpus Linguistics in Lexicography
The Use of Corpus Linguistics in Lexicography
 
Dictionaries
DictionariesDictionaries
Dictionaries
 
Lexicology
LexicologyLexicology
Lexicology
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY
 
Lexicology
LexicologyLexicology
Lexicology
 

Semelhante a lexicographic evidence

Your Annotated Bibliography must have 8 sources. Please go back to t.docx
Your Annotated Bibliography must have 8 sources. Please go back to t.docxYour Annotated Bibliography must have 8 sources. Please go back to t.docx
Your Annotated Bibliography must have 8 sources. Please go back to t.docxbudbarber38650
 
Corpus study design
Corpus study designCorpus study design
Corpus study designbikashtaly
 
Large-scale norming and statistical analysis of 870 American English idioms.pdf
Large-scale norming and statistical analysis of 870 American English idioms.pdfLarge-scale norming and statistical analysis of 870 American English idioms.pdf
Large-scale norming and statistical analysis of 870 American English idioms.pdfFaishaMaeTangog
 
Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Cornelius Puschmann
 
Computer assisted text and corpus analysis
Computer assisted text and corpus analysisComputer assisted text and corpus analysis
Computer assisted text and corpus analysisRubyaShaheen
 
ENG II Honors Curriculum Map
ENG II Honors Curriculum MapENG II Honors Curriculum Map
ENG II Honors Curriculum MapKatye Jones
 
Esp.753.language descriptions
Esp.753.language descriptionsEsp.753.language descriptions
Esp.753.language descriptionsNina Zotina
 
Esp.753.language descriptions
Esp.753.language descriptionsEsp.753.language descriptions
Esp.753.language descriptionsLubasweet
 
Language Descriptions
Language DescriptionsLanguage Descriptions
Language DescriptionsApelsinka
 
The Input Learner Learners Forward Throughout...
The Input Learner Learners Forward Throughout...The Input Learner Learners Forward Throughout...
The Input Learner Learners Forward Throughout...Tiffany Sandoval
 
Automatic Profiling Of Learner Texts
Automatic Profiling Of Learner TextsAutomatic Profiling Of Learner Texts
Automatic Profiling Of Learner TextsJeff Nelson
 
How to Write a Literature Review in 30 Minutes or Less
How to Write a Literature Review in 30 Minutes or LessHow to Write a Literature Review in 30 Minutes or Less
How to Write a Literature Review in 30 Minutes or LessJonathan Underwood
 
June2010 feedback How to tackle the yr 13 Language Exam
June2010 feedback How to tackle the yr 13 Language ExamJune2010 feedback How to tackle the yr 13 Language Exam
June2010 feedback How to tackle the yr 13 Language Examsteddyss
 

Semelhante a lexicographic evidence (20)

Your Annotated Bibliography must have 8 sources. Please go back to t.docx
Your Annotated Bibliography must have 8 sources. Please go back to t.docxYour Annotated Bibliography must have 8 sources. Please go back to t.docx
Your Annotated Bibliography must have 8 sources. Please go back to t.docx
 
Corpus study design
Corpus study designCorpus study design
Corpus study design
 
Large-scale norming and statistical analysis of 870 American English idioms.pdf
Large-scale norming and statistical analysis of 870 American English idioms.pdfLarge-scale norming and statistical analysis of 870 American English idioms.pdf
Large-scale norming and statistical analysis of 870 American English idioms.pdf
 
Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)
 
English 10.docx
English 10.docxEnglish 10.docx
English 10.docx
 
Computer assisted text and corpus analysis
Computer assisted text and corpus analysisComputer assisted text and corpus analysis
Computer assisted text and corpus analysis
 
APA Example Of Annotated Bibliography
APA Example Of Annotated BibliographyAPA Example Of Annotated Bibliography
APA Example Of Annotated Bibliography
 
ENG II Honors Curriculum Map
ENG II Honors Curriculum MapENG II Honors Curriculum Map
ENG II Honors Curriculum Map
 
Esp.753.language descriptions
Esp.753.language descriptionsEsp.753.language descriptions
Esp.753.language descriptions
 
Esp.language descriptions
Esp.language descriptionsEsp.language descriptions
Esp.language descriptions
 
Esp.753.language descriptions
Esp.753.language descriptionsEsp.753.language descriptions
Esp.753.language descriptions
 
language descriptions
language descriptionslanguage descriptions
language descriptions
 
Esp753
Esp753Esp753
Esp753
 
Esp.753.language descriptions
Esp.753.language descriptionsEsp.753.language descriptions
Esp.753.language descriptions
 
language_descriptions
language_descriptionslanguage_descriptions
language_descriptions
 
Language Descriptions
Language DescriptionsLanguage Descriptions
Language Descriptions
 
The Input Learner Learners Forward Throughout...
The Input Learner Learners Forward Throughout...The Input Learner Learners Forward Throughout...
The Input Learner Learners Forward Throughout...
 
Automatic Profiling Of Learner Texts
Automatic Profiling Of Learner TextsAutomatic Profiling Of Learner Texts
Automatic Profiling Of Learner Texts
 
How to Write a Literature Review in 30 Minutes or Less
How to Write a Literature Review in 30 Minutes or LessHow to Write a Literature Review in 30 Minutes or Less
How to Write a Literature Review in 30 Minutes or Less
 
June2010 feedback How to tackle the yr 13 Language Exam
June2010 feedback How to tackle the yr 13 Language ExamJune2010 feedback How to tackle the yr 13 Language Exam
June2010 feedback How to tackle the yr 13 Language Exam
 

Mais de Duygu Aşıklar

06 planning the dictionary
06 planning the dictionary06 planning the dictionary
06 planning the dictionaryDuygu Aşıklar
 
05 linguistic theory meets lexicography
05 linguistic theory meets lexicography05 linguistic theory meets lexicography
05 linguistic theory meets lexicographyDuygu Aşıklar
 
dictionary types and dictionary users
dictionary types and dictionary usersdictionary types and dictionary users
dictionary types and dictionary usersDuygu Aşıklar
 

Mais de Duygu Aşıklar (6)

07 planning the entry
07 planning the entry07 planning the entry
07 planning the entry
 
06 planning the dictionary
06 planning the dictionary06 planning the dictionary
06 planning the dictionary
 
05 linguistic theory meets lexicography
05 linguistic theory meets lexicography05 linguistic theory meets lexicography
05 linguistic theory meets lexicography
 
dictionary types and dictionary users
dictionary types and dictionary usersdictionary types and dictionary users
dictionary types and dictionary users
 
methods and resources
methods and resourcesmethods and resources
methods and resources
 
what's a dictionary?
 what's a dictionary? what's a dictionary?
what's a dictionary?
 

Último

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

lexicographic evidence

  • 1. Lexicographic Evidence In this part how to design acquire and process a design, acquire, collection of linguistic data which will form the raw material for a dictionary is going to be explained explained.
  • 2. Comprehension Q C h i Questions (1) ti 1. What is a reliable dictionary? 2. What is subjective evidence and its limits? 3. 3 What is a citation? 4. What should be the basic steps in setting up a reading programme? 5. What 5 Wh t are th advantages and di d the d t d disadvantages of t f citations?
  • 3. Comprehension Q C h i Questions (2) ti 6. What is a corpus? 7. What are the points that should be considered in g g p designing a corpus? 8. How large should a corpus be? 9. How do we decide what kinds of written or spoken material our corpus should include? 10. Can a corpus be representative?
  • 4. Comprehension Q C h i Questions (3) ti 11. What i ‘ k i ’? 11 Wh t is ‘skewing’? 12. What 12 Wh are the questions that should be h i h h ld b answered before starting to form the corpus? 13. What is linguistic annotation?
  • 5. A ‘R li bl ’ Di ti ‘Reliable’ Dictionary A reliable dictionary is one whose generalizations about word behavior approximate closely to the ways in which people normally use language when engaging in real communicative acts. Yet, it is difficult to determine how people normally p p y use words. There is a need for evidence.
  • 6. Subjective Evidence and Its Limits Introspection: consulting your own mental l l l lexicon, is a form of evidence, but it cannot form the basis of a reliable dictionary alone, since one individual’s store of linguistic k li i ti knowledge i l d is i inevitably i it bl incomplete and l t d idiosyncratic. Informant-testing: in which speakers of a language are questioned about their use of words, is also of limited value for mainstream lexicography for similar reasons. g p y Both f h B h of them are essentially subjective f i ll bj i forms of f evidence. Creating a reliable dictionary involves a number of challenging tasks, but it is for sure that the observation of language in use is the indispensable first stage in the f g g p f g process.
  • 7. Citations Cit ti A citation is a short extract from a text which provides evidence for a word, phrase, usage, or meaning i d h i in authentic use. Until the late twentieth century, the OED’s citations would be written in longhand on index cards known as slips. slips These were filed alphabetically according to the keyword of the citation. it ti
  • 8. DNA If a blog has a common ancestor with the diary one can say that it diary, has a DNA. E.g. MySpace E g ‘MySpace’ shares at least some of its DNA with the ‘scrapbook’.
  • 9. Setting up a Reading Programme d Some dictionary publishers provide online forms to enable members of the public to contribute citations Most of these publishers citations. get unusable citations since their programmes are not well-planned. A good reading p g g programme, on the other hand, will often have great value.
  • 10. Setting up a Reading Programme d There is a need for at least four main data fields: 1- keyword or phrase: the usage that the citation illustrates, filed under the headword to which it relates. 2- the citation itself: usually a single sentence is adequate, but there may be more than one. 3- Information about the source of the citation: the date, title, and author’s name are all important; additional information ( (such as the page number) may be useful for specialized or p g ) y p historical dictionaries. 4- a comment field: this gives readers the option of adding a c mm nt f th g r a r th pt n f a ng note to clarify the citation; it may, for example, be a new meaning that needs explaining, or it may be characteristic of one particular dialect.
  • 11. Advantages of Cit ti Ad t f Citations 1- they are helpful to monitor language change y p g g g 2 2- They give information about the terminology from a specific subject field or a particular variety or dialect. y 3 3- They are helpful in training the lexicographers
  • 12. Disadvantages of Cit ti Di d t f Citations 11 Collecting data in this way is labour intensive labour-intensive, so volumes will always be low. 2- Although instances of usage are authentic, there is a bi s bj ti th big subjective element in th i l m nt their selection
  • 13. The Central Role of the Corpus h l l f h Citation bank alone - even the largest one – cannot usually supply language data in the required volumes so the case for a large q m f g corpus is clear. A “corpus” is a collection of pieces of language text in electronic form, selected g g , according to external criteria to represent, as far as possible, a language or language variety as a source of data for li i fd f linguistic i i research (Sinclair 2005).
  • 14. Some I S Inescapable T th bl Truths There is no such thing as a perfect corpus for g p p lexicography. F First of all, the corpus is a sample. It is not possible to f , p mp . p examine every extant example of usage for the languages. To create a sample that fairly reflects the wider population, there is a need for carefully selected criteria. Secondly, selecting texts on the basis of their ‘quality’, and excluding those which fail this test, is fundamentally at odds with th d s ipti ith the descriptive ethos of corpus lin isti s Wh is t th s f p s linguistics. Who to judge which texts are ‘good’, and on what basis? It is clear that a lexicographic corpus must be a genuine – and inclusivesnapshot of a language, not a set of texts that have been specially chosen to advance someone’s notion of what constitutes ‘good’ usage.
  • 15. Corpora: Design Issues C D i I Designing a corpus means making decisions about: 11 how large it will be be. 22 which broad categories of text it will include include. 33 what proportions of each category it will include include. 4- hi h i di id l texts 4 which individual t t it will include. ill i l d
  • 16. Size: How large is large enough? It i f sure th t th more d t we h is for that the data have th more we the learn. Yet, there are also some hypotheses on the size of the corpus. Zipf’s Law predicts that the tenth most frequent word in a corpus will occur twice as often as the 20th most frequent word, ten times as often as the 100th most frequent word, and 100 times as often as q , the 1000th most frequent word. Thus, it can be said that in a corpus of 100 million words, a simple right or left sorted corpus clearly shows most of the normal patterns of usage for all words except the very rare.
  • 17. Different texts, different styles ff d ff l However large its size may be if the words are be, taken from only a limited area (for instance from newspapers), they cannot represent all aspects of the language, and th results m th l n nd the s lts may b misl din be misleading. (For instance; the meaning of the word party will most frequently occur as a political organization q y p g rather than a social event. A corpus consisting of a single type of text will reflect only the stylistic and subject-matter features of that particular genre. It will as corpus linguistics say, a ‘skewed’ corpus. Therefore, the corpus should include different texts and d ff d ff d different styles. l
  • 18. Can a Corpus be Representative? The standard way of avoiding bias is to collect a ‘random sample’. Yet Y t random s d sampling may not represent th l li t s t the language well. O ll One partial solution is to apply stratified sampling. This involves breaking up the total population into a number of subcategories or types, then creating independent random samples from each of these groupings. But this immediately raises two questions: g p g y q 1- How do we define these subcategories? 2 2- How do we decide what proportions of each subcategory the corpus should include?
  • 19. It is almost impossible to define the population that the corpus should be representative of, and since the population is unlimited, it is d h l l d logically impossible to establish ‘correct’ proportions of each component. An achievable ti f h t A hi bl objective should be “a balanced corpus”.
  • 20. Selecting Texts S l ti T t The corpus collection is usually recursive. p y First some texts from a range of sources are gathered Next the texts are analyzed to identify recurring clusters f g f . of linguistic features. It enables us to establish provisional categories of texts, grouped on the basis of shared linguistic features. Then more texts are collected to reflect these feature distributions. Then the analysis is repeated on the enlarged corpus, on more texts. The process thus proceeds in a cyclical fashion until we collect a large corpus whose contents reflect the proportions in which the various key features are observable in large bodies f text. b di of t t
  • 21. Spoken D t A S S k Data: Special C i l Case With a corpus of spoken language, there are no language obvious objective measures that can be used to define the target population. The spoken data population should represent the variables like gender, social class, age and religion. The conversations , g g that form the corpus should reflect the diversity of the spoken language.
  • 22. A Note on ‘Skewing’ N t ‘Sk i ’ Skewing refers to a form of bias in data whereby a particular feature is either over or under represented to a degree that distorts the general picture. As corpora grow larger, usually problems with skewing gradually recede. yp gg y
  • 23. There are some questions that should be answered before starting to form the corpus. Language: Will the corpus be monolingual, bilingual, or g g p g g multilingual? This is an important question before starting to form the corpus. Time: Will the corpus be synchronic or diachronic? In a synchronic corpus, the constituent texts come from one specific period of time, whereas the texts making p p g up a diachronic corpus come from an extended period. Mode: Will the corpus include written texts, spoken texts or both? The status of the chat room conversations which have the characteristics of both spoken and written texts is another point that require p p q attention in corpus formation.
  • 24. Medium M di Medium refers to the channel in which the text appears. A simple classification here would distinguish print media and spoken media. The former in l d f m include b ks n sp p s m books, newspapers, magazines, in s journals, dissertations, movie scripts, government documents and legal statutes. Spoken media g p include face-to-face conversations, broadcasts and podcasts, public meetings, and educational settings. Once again traditional categories became blurred again, when we add the web to the mix. Some ‘new’ text types (blogs and social networking sites, for example) are exclusive to the web, b l ) l h b but many documents exist in both print and electronic media.
  • 25. Dealing ith S bl D li with Sublanguages When we think about the vocabulary of a language, it is useful to make a broad distinction between core usages and sublanguages. The word deuce is part of a sublanguage: it belongs to the vocabulary of tennis. tennis A word like important, on the other hand, belongs to the core vocabulary of English. The following question arises at this g f g q point: will we include the sublanguages?
  • 26. Collecting Written Data In the past, the work of lexicographers was p g p not so easy. Earlier corpora made extensive use of scanning and keyboarding which were both l b h slow and l b d labour-intensive processes. Today it is possible to find the digital form of various t t i texts.
  • 27. Collecting Spoken Data Traditionally, spoken data has been difficult rad t onally, d ff cult and extensive to collect. Consequently, although the majority of communicative events g j y in a language occur in spoken mode, few corpora include high proportions of spoken material. For instance, only 10 per cent of the BNC is spoken. Nowadays, web-derived spoken data hi h ff d t which offers up-to-date material i l t d t t i l in large quantities and at low cost begins to look like an attractive alternative alternative.
  • 28. Collecting Data from the Web The Th question of ‘‘whether th web is a sti f h th the b corpus’ is a hotly debated topic in language engineering circles. For g p y, lexicography, it is better to see the web as a source of texts from which a lexicographic corpus can be assembled.
  • 29. Sample Size There are arguments for using complete texts rather than extracts. In many registers, the discourse structure and g rhetorical f h l features of a text may vary as it f proceeds from its opening paragraphs, through its central sections, to the concluding chapters. The BNC’s solution to this was to ensure that 40000 word samples were taken variously from the beginning beginning, middle, and end of its source documents.
  • 30. Copyright and Permissions C i ht d P i i Unless a corpus is made up of much older texts, most of its source material is likely to be protected by copyright. S corpus-builders should get permissions i ht So, b ild h ld t i i from the copyright owners to include the documents in their corpus. This is not an easy task. It is one of the most time consuming aspects of the project It is project. recommended that the corpus builders should never offer to pay for permission to include a text. Once money starts changing hands a precedent would be hands, established that could have fatal consequences to corpus-creation efforts worldwide.
  • 31. Processing and Annotating g g the Data To give the final f g f form to the corpus f p from its raw state, some operations are carried on.
  • 32. Clean-up, standardization, p and text encoding Essentially the process of taking a heterogeneous collection of input document collect on nput and converting them all to a standard, usable form. For instance, non-linguistic sounds in g spoken data (like erm, ooh, mhm) and unusable texts in written data (like indexes, tables, diagrams) are not included in the corpus.
  • 33. Documentation D i Providing each input text with a unique ‘header document’ which records its essential header document wh ch ts essent al features. Headers typically give bibliographic information (title, author’s name, date and place of publication, and the like) and precisely locate each text in whatever typology is being used.
  • 34. Linguistic Annotation Enriching raw text by adding grammatical information which will enable corpus users to frame sophisticated queries and extract p q maximum benefit from the data. For instance, She is tagged as a personal pronoun, and R ll is tagged as a general d Really d l adverb. A well-tagged corpus allows us to focus on each pattern in turn and view a manageable number of examples.
  • 35. Final Thoughts Fi l Th ht In this part, a methodology for building a corpus for use in lexicography has been p g p y outlined. It is for sure that this is a difficult task, and there is no perfect corpus since p p language is diverse and dynamic. The aim is to form a balanced, standardized, well-tagged gg corpus. For many kinds of research, a corpus with meticulously detailed headers and finey grained linguistic annotation is precisely what is needed.
  • 36. Turkish Summary: Sözlüksel Kanıt Bu bölümde oluşturulması planlanan bir sözlüğe kaynaklık edecek olan verilerin nasıl tasarlandığı, toplandığını ve işlendiğini anlatılmıştır. Öncelikle, sözlüğü hazırlayacak kişilerin kendi sözcük dağarcıklarının önemi vurgulanmalıdır. Ancak şurası kesin ki ne kadar geniş olursa olsun bireysel sözcük dağarcığı böyle bir çalışma için yeterli değildir. Geçmişte en sık kullanılan veri toplama yöntemi alıntılama yöntemi idi idi. Günümüzde bu yöntem eskiye oranla önemini biraz yitirse de hala kullanılmaktadır. Hatta, internet üzerinden, gönüllülerden veriler toplanması amacıyla özel programlar geliştirilmiştir. Bilgisayar ve internet teknolojilerinin gelişmesiyle en fazla kullanılan veri toplama yöntemi derleme yöntemi olmuştur Bölüm boyunca derlemeyi olmuştur. hazırlayan kişilerin pek çok soruyla karşı karşıya oldukları ve işlerinin ne derece zor olduğu vurgulanmıştır. Dil çeşitlilik gösteren ve dinamik bir yapıdadır bu nedenle mükemmel bir derleme yapılması beklenemez. Sözlükçülerin amacı dengeli, kullanılan dili en iyi temsil eden, dili kullanan kişilere ve dilin kullanıldığı ortamlara göre ortaya çıkan değişkenleri dikkate alan, ve sözlük hazırlanmasında işe yarayacak şekilde düzenlenmiş bir derleme oluşturmak olmalıdır.