This document summarizes a research paper on using statistical analysis of function words to analyze authorship in forensic linguistics. It discusses using t-tests to cluster texts by the same author and discriminate between authors based on frequency of words like "as", "it", "that", and "there". The summary found that t-tests were better at discrimination than clustering and that analysis broke down with shorter texts. It concludes that function word analysis may be a useful forensic linguistics tool but has limitations like being better for longer texts and requiring further analysis for shorter texts.
3. Playing detective?
In forensic science – investigators look for clues
that the culprit leaves unwittingly;
In linguistics – „unconscious language‟
i.e. Function Words (de Vel, 2001; Argamon
& Levitan, 2005; Burrows, 2003)
Rather old idea (Wallace & Mosteller, 1964);
revisited in Holmes & Forsyth (1995).
5. Advantages of Function Words in FL
„Unconscious language‟
Numerous even in a relatively short text.
Can be easily counted
Related to the Daubert Criteria
Enables corpus analysis (Key Words in Context)
6. The Daubert Criteria
1. The theory must have been tested;
2. It must have been subjected to peer review and
publication;
3. It must have a known error rate;
4. It must be generally accepted in the scientific community.
(Tiersma & Solan, 2002, cited in Coulthard,
2004; Chaski, 1997; Grant, 2007)
7. Implications for linguists
Increased pressure on the linguists to use
mathematical methods, repeatable procedures;
Forensic linguists must serve justice;
„Beyond reasonable doubt‟ in criminal cases
(Grant, 2010)
„Raise legitimate doubt‟ in civil cases (ibid.)
The method is King, not the expert.
8. It is „a challenge to the academic community to
test the error rate and at the same time to fix an
acceptable statistical equivalent for „beyond
reasonable doubt‟
Coulthard (2004: 476)
It is „the linguist‟s responsibility to create
theoretically sound hypotheses‟ and test them
Chaski (2001: 2)
.
9. Idiolect
Defined as the idiosyncratic use of dialect, idiolect
is a way of speaking (and, consequently, writing)
that is unique for each individual
Chaski (1997).
'the totality of the possible utterances of one
speaker at one time in using a language to
interact with one other speaker‟
Bloch (1948, cited in Grieve,
2007: 255).
10. Theory
Grant (2010) - two theoretical frameworks:
Idiolect is linked to neuroscience
The author is influenced by the language he/she
is exposed to.
De Vel‟s (2001) and Argamon & Levitan‟s
(2005) claims about certain function words
being unconscious linguistic choices – also a
theory.
11. Theory (cont.)
Grant (2010):
„simple detection of consistency and determination of
distinctiveness‟ would be able to help practical
authorship analysis more than even a strong theory.
12. Hypotheses
The use of function words is unique to each
individual (could be limited by context or genre) -
idiolect;
The frequency of certain function words is an
authorship marker (e.g. Holmes & Forsyth, 1995);
The frequency of semantic roles that certain
function words play is also an authorship
marker.
13. Semantic Roles
Semantic roles are the word‟s functions in the
specific context of the sentence.
The words I analysed were AS, IT, THAT and
THERE
Criteria: frequency (corpus) and explicit multiple
meanings
14. AS
Function Examples
Start of time adjunct clause As we approached the small hut;
as I followed the masses
Fixed Phrase as [adj/adv] as As easily as; as soon as, as well as
AS + Noun Phrase as a museum; as the red-light district
AS at the start of a manner adjunct as you can imagine; as the locals do
AS could be replaced with because big push for the Chinese people to learn English, as
they have now made it mandatory in their schools
AS is used for comparison as if they knew we were on their turf;
still as a board;
the same as fall back in Chicago;
15. IT
Function Examples
IT serves as s dummy subject
IT + [to be] + predicament + infinitive It's hard to enjoy a festival the same way
IT + [to be] or other verb phrase (+ It turns out I'll be going to at least four
adj/noun phrase) + relative clause (that, if
etc.)
IT + [to be] + time reference it's time for Pendulum
16. IT (cont.)
Function Examples
IT + seem/feel/any other perception verb it stops feeling like Hannover
IT + [to be] + noun phrase it would have been a great day
IT refers to something mentioned before We woke up early to catch the ferry and it
couldn't have been easier.
IT is a part of a fixed phrase We made it to Macau in less than 2 hours
17. THAT
Function Examples
THAT begins a subordinate clause I also couldn't help but notice that when I
looked toward the island
THAT could be replaced with which It was the spot on the beach that was
shaped like a triangle
THAT is a determiner That night, we all reconvened at the hotel
18. THERE
Function Examples
THERE serves as a dummy subject there are a few longhaired dogs
THERE refers to a place it was there strictly for the tourists
19. My Dataset
Author A Author B
Type of text Travel Blog Travel Blog
Gender (self- Female Male
declared)
Mother Tongue and English (American) English (perhaps Irish)
variety (self-declared)
Website URL - the http://www.travelblog.org http://www.getjealous.com
data source
Size of K corpus
9 texts 7 texts 5 texts 3 texts Q text
Author A 20,875 16,118 11,024 6,260 2,479
Author B 7,991 6,176 4,241 2,611 750
20. Methodology
Texts were imported into TEXTSTAT concordance software;
Words AS, IT, THAT and THERE were chosen for their
explicit diverse meanings in the sentence;
Quantitative analysis was used to determine how different
(or similar) the authors were in terms of their frequency of
use of function words and their meanings;
The number of texts was reduced to see if at some point
analysis breaks down (compare to Grant, 2007);
Statistical technique used – T-TEST
21. Matrix of Probabilities
Application PSA values Meaning
Clustering PSA > 90% Success
Clustering and Differentiating PSA ≥ 95% ‘Beyond Reasonable Doubt’
Differentiating PSA < 85% Definite Failure (error rate at
15% causes reasonable doubt).
Clustering and Differentiating PSA > 50% Balance of probabilities –
suitable for civil court.
PSA = probability of same authorship
Clustering = the author of both texts is likely to be the same
person
Differentiating = texts were written by different authors
Beyond reasonable doubt: 95%
22. Findings: T-Test
Clustering Discriminating
Analysing each marker Analysing each marker
of the same author of the one author
against the values of against the values of
that marker in the Q text that marker in the Q text
by the same author by the other author
How likely is that person How likely is that K and
to have produced the Q texts have been
text? produced by the same
person?
23. Findings: Reliability of markers
All texts by one author compared against each other
Every semantic role of each function word was included
Special attention: success of the test depends on the amount
of text
Not all markers are reliable; their frequency can be too
low in a short text
Marker Clustering Discrimination
AS Very inconsistent Consistent
IT Very consistent Very Consistent
THAT depends on the amount of depends on the amount of
text (A- yes; B - no) text (A- yes; B - no)
THERE Very consistent Very consistent
24. T-Test: Success
Beyond Reasonable Doubt: 95% or more
Functi Function Clustering Discrimi
on nating
Word
A B
AS Start of time adjunct clause FAIL YES BRD NO BRD
Fixed Phrase as [adj/adv] as BRD FAIL FAIL YES BRD
AS + Noun Phrase FAIL BRD YES YES NO
AS at the start of a manner FAIL YES BRD N/A NO
adjunct
AS could be replaced with BRD BRD N/A N/A N/A
because
AS is used for comparison YES BRD BRD FAIL NO
25. Function Function Clustering Discrimin
Word ating
A B
IT YES YES BRD FAIL BRD
Dummy subject
Dummy subject at the FAIL FAIL FAIL FAIL NO
start of the sentence
THAT That begins a subordinate BRD YES FAIL FAIL NO
clause
That could be replaced with FAIL FAIL BRD BRD BRD
which
That is a determiner FAIL FAIL FAIL YES BRD
THERE YES BRD N/A FAIL NO
Dummy subject
Dummy subject at the FAIL FAIL N/A FAIL BRD
start of the sentence
26. Results
Marker Success Failure Explanation
AS 50% 33.33% A fairly reliable marker. Would do in civil court.
IT 80% 20% The most reliable marker in this study.
IT at the start of the sentence has no linguistic
theory behind it, and failure was expected.
THAT 46.67% 53.33% Also in Mackevic (2011):
“Very unreliable across all authors – enormous
error rates; PSA shooting over 50% most of the
time. ”
THERE 30% 50% Marker totally unreliable.
27. Discussion of Results
Most of the markers – much better at
discriminating that at clustering;
A lot depends of the text’s length– when I
started removing texts from the corpus (9, then 7,
then 5 and finally 3), analysis began breaking
down;
6000 words for the reference corpus –
approximate benchmark.
Possible conclusion: function words are really
better for longer texts, which also occur in
forensic settings.
28. Why did T-test fail?
Possible explanation: some markers occurred very rarely
They had little linguistic significance (no theory behind)
Analysis broke down with very consistent markers. Why?
Possibly, because the amount of text (number of words)
was insufficient
For Comparison: Grant‟s(2010) also reports his
analysis breaking down when the amount of text is
reduced
Perhaps qualitative analysis is better for shorter texts
But it works against the Daubert Criteria
29. Recommendations
Use grammar reference books for semantic roles of
function words and more detailed division of
roles
Choose different words (look what worked for other
authors)
Try more texts, but short ones (e.g. 50 texts of 400
words each)
Try more statistical techniques
30. Conclusion
Function words – potentially another tool in a forensic
linguist‟s toolbox
T-Test – good analytical tool;
It returns exact results with certain error rates that are
easy to interpret (consistent with Daubert criteria)
However, it also has some limitations and additional
analysis may be needed to complete the picture
T-Test works with discriminating better than with
clustering
Analysis breaks down with small corpora
31. References
NB: The references are from the original paper; some authors present in this
list may not have been cited in the presentation
Books and Journals
Argamon, S. & Levitan, S. (2005) Measuring the Usefulness of Function Words for Authorship
Attribution [Online]. Available at:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.71.6935&rep=rep1&type=pdf [Accessed
12 September 2010]
Burrows, J. (2003). Questions of Authorship: Attribution and Beyond. Computers and Humanities
[Online] 37, pp. 5-23. Available from: http://www.springerlink.com/content/nv46t75125472350/
[Accessed 1 August 2010].
Chaski, C. E. (1997). Who Wrote It? Steps Towards a Science of Authorship Identification. National
Institute of Justice Journal. (September Issue) [Online]. Available from:
http://www.ncjrs.gov/pdffiles/jr000233.pdf [Accessed 31 January 2010].
Chaski, C. E. (2001). Empirical evaluations of language-based author identification techniques. The
International Journal of Speech, Language and the Law [Online] 8 (1), pp. 1-65. Available from:
http://www.equinoxjournals.com/ojs/index.php/IJSLL/article/view/1690/1151 [Accessed 12 June
2008].
Chaski, C. E. (2005). Who‟s at the Keyboard? Authorship Attribution in Digital Evidence
Investigations. International Journal of Digital Evidence [Online] 4 (1), pp. 1-14. Available from:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.75.3852&rep=rep1&type=pdf [Accessed
31 January 2010].
32. Coulthard, M. (1998). Identifying the Author. Cahiers de Linguistique Française [Online] 20, pp. 139-
161. Available at: http://clf.unige.ch/display.php?idFichier=168 [Accessed 28 January 2010].
Coulthard, M. (2004). Author Identification, Idiolect and Linguistic Uniqueness. Applied Linguistics
[Online] 25 (4), pp. 431-447. Available at: http://www.business-
english.ch/downloads/Malcolm%20Coulthard/AppLing.art.final.pdf [Accessed 27 January 2010].
Coulthard, M. & Johnson, A. (2007). An Introduction to Forensic Linguistics: Language in Evidence.
Abingdon: Routledge.
De Vel, O. (2001). Multi-Topic E-mail Authorship Attribution Forensics. In: ACM Conference on
Computer Security – Workshop on data mining for security applications. November 8,
2001.Phildelphia, PA [Online]. Available at:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.9951&rep=rep1&type=pdf [Accessed
31 August 2010].
Grant, T. (2007). Quantifying Evidence in Forensic Authorship Analysis. The international Journal of
Speech, Language and the Law [Online] 14 (1), pp. 1-25. Available at:
http://www.equinoxjournals.com/IJSLL/article/view/3955/2428 [Accessed 12 July 2010].
Grant, T. (2008). Dr Tim Grant: How text-messaging slips can help catch murderers. The
Independent [Online]. (Last updated 9 September 2009). Available at:
http://www.independent.co.uk/opinion/commentators/dr-tim-grant-how-textmessaging-slips-can-
help-catch-murderers-923503.html [Accessed 11 September 2010].
Grant, T. D. (2010). Txt 4n6: idiolect free authorship analysis? In: Roultledge Handbook of Forensic
Lingusitics. Abingdon: Routledge
De Vel, O. (2001). Multi-Topic E-mail Authorship Attribution Forensics. In: ACM Conference on
Computer Security – Workshop on data mining for security applications. November 8,
2001.Phildelphia, PA [Online]. Available at:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.9951&rep=rep1&type=pdf [Accessed
31 August 2010].
33. De Vel, O. (2001). Multi-Topic E-mail Authorship Attribution Forensics. In: ACM Conference on
Computer Security – Workshop on data mining for security applications. November 8,
2001.Phildelphia, PA [Online]. Available at:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.9951&rep=rep1&type=pdf [Accessed
31 August 2010].
Grant, T. (2007). Quantifying Evidence in Forensic Authorship Analysis. The international Journal of
Speech, Language and the Law [Online] 14 (1), pp. 1-25. Available at:
http://www.equinoxjournals.com/IJSLL/article/view/3955/2428 [Accessed 12 July 2010].
Grant, T. (2008). Dr Tim Grant: How text-messaging slips can help catch murderers. The
Independent [Online]. (Last updated 9 September 2009). Available at:
http://www.independent.co.uk/opinion/commentators/dr-tim-grant-how-textmessaging-slips-can-
help-catch-murderers-923503.html [Accessed 11 September 2010].
Grant, T. D. (2010). Txt 4n6: idiolect free authorship analysis? In: Roultledge Handbook of Forensic
Lingusitics. Abingdon: Routledge
Grant, T. & Baker, K. (2001). Identifying reliable, valid markers of authorship: a response to Chaski.
The International Journal of Speech, Language and the Law [Online] 8 (1), pp. 66-79. Available at:
http://www.equinoxjournals.com/ojs/index.php/IJSLL/article/view/1691/1150 [Accessed 12 June
2008].
Holmes, D. I. & Forsyth, R. S. (1995). The Federalist Revisited: New Directions in Authorship
Attribution. Literary and Linguistic Computing [Online] 10 (2), pp. 111-127. Available from:
http://llc.oxfordjournals.org/cgi/reprint/10/2/111 [Accessed 1 August 2010] .
Hunston, C. (2002). Corpora in Applied Linguistics. Cambridge: Cambridge University Press.
Mitchell, E. (2008). The Case for Forensic Linguisitcs. BBC News [Online]. (Last updates 8
September 2008). Available at: http://news.bbc.co.uk/1/hi/sci/tech/7600769.stm [Accessed 11
September 2010]
34. Rudman, J. (1998). The State of Authorship Attribution Studies: Some Problems and Solutions. Computers and
the Humanities [Online] 31, pp. 351–365. Available from:
http://www.springerlink.com/content/l023q7047388133x/fulltext.pdf
[Accessed 2 August 2010].
Websites:
Textstat
http://neon.niederlandistik.fu-berlin.de/textstat/
T-test Calculator
http://www.graphpad.com/quickcalcs/OneSampleT1.cfm
T-Tables
http://www.statsoft.com/textbook/distribution-tables/#t
Notas do Editor
Now in the UK – the expert, not the method; becoming more like the US.
Two particularly interesting aspects and also areas of concern in my research were the text length and the number of texts. As repeatedly pointed out by Coulthard (2004), Coulthard and Johnson (2007) and Chaski (2001), texts that forensic experts usually work with are very few and short: around 100-400 words. In her 2001 study, Chaski used texts that varied in length between 93 and 556 words. However, Wallace and Mosteller in their study on Federalist Papers looked as 85 essays, each 900 to 3,500 words in length. The number of texts matters as much as their length does: e.g. Grant (2007) examines 63 texts (3 authors with 21 texts per author). This study examines 20 texts (18 K (known) ones and 2 Q (query) ones), which also differ in length (see Table 5 for details). Although they are longer than usual forensic texts, they are relatively few. This may make the analysis more difficult as Grant's (2007) analysis breaks down when he reduces the number of texts per author. However, this study attempts to replicate, at least in part, a forensic experiment and research difficulties may be seen as the real-world challenges.