SlideShare uma empresa Scribd logo
1 de 34
Baixar para ler offline
All the Text‟s a Stage; And All the
 Function Words Merely Players?

  Statistical Analysis of Authorship

           Vlad Mackevic
          Aston University
Work of a Modern Forensic Linguist
Playing detective?

In forensic science – investigators look for clues
that the culprit leaves unwittingly;
In linguistics – „unconscious language‟
i.e. Function Words (de Vel, 2001; Argamon
& Levitan, 2005; Burrows, 2003)
Rather old idea (Wallace & Mosteller, 1964);
revisited in Holmes & Forsyth (1995).
Authorship analysis using function words forensic linguistics
Advantages of Function Words in FL

„Unconscious language‟

Numerous even in a relatively short text.

Can be easily counted
      Related to the Daubert Criteria
      Enables corpus analysis (Key Words in Context)
The Daubert Criteria

1. The theory must have been tested;
2. It must have been subjected to peer review and
publication;
3. It must have a known error rate;
4. It must be generally accepted in the scientific community.
                 (Tiersma & Solan, 2002, cited in Coulthard,
                 2004; Chaski, 1997; Grant, 2007)
Implications for linguists

Increased pressure on the linguists to use
mathematical methods, repeatable procedures;
Forensic linguists must serve justice;
„Beyond reasonable doubt‟ in criminal cases
(Grant, 2010)
„Raise legitimate doubt‟ in civil cases (ibid.)

The method is King, not the expert.
It is „a challenge to the academic community to
test the error rate and at the same time to fix an
acceptable statistical equivalent for „beyond
reasonable doubt‟
                            Coulthard (2004: 476)

It is „the linguist‟s responsibility to create
theoretically sound hypotheses‟ and test them
                                    Chaski (2001: 2)
.
Idiolect
Defined as the idiosyncratic use of dialect, idiolect
is a way of speaking (and, consequently, writing)
that is unique for each individual
                                     Chaski (1997).
'the totality of the possible utterances of one
speaker at one time in using a language to
interact with one other speaker‟
                          Bloch (1948, cited in Grieve,
                                           2007: 255).
Theory

Grant (2010) - two theoretical frameworks:
      Idiolect is linked to neuroscience
      The author is influenced by the language he/she
      is exposed to.
De Vel‟s (2001) and Argamon & Levitan‟s
(2005) claims about certain function words
being unconscious linguistic choices – also a
theory.
Theory (cont.)

Grant (2010):

„simple detection of consistency and determination of
distinctiveness‟ would be able to help practical
authorship analysis more than even a strong theory.
Hypotheses

The use of function words is unique to each
individual (could be limited by context or genre) -
idiolect;
The frequency of certain function words is an
authorship marker (e.g. Holmes & Forsyth, 1995);
The frequency of semantic roles that certain
function words play is also an authorship
marker.
Semantic Roles

Semantic roles are the word‟s functions in the
specific context of the sentence.

The words I analysed were AS, IT, THAT and
THERE

Criteria: frequency (corpus) and explicit multiple
meanings
AS
Function                                   Examples

Start of time adjunct clause               As we approached the small hut;
                                           as I followed the masses
Fixed Phrase as [adj/adv] as               As easily as; as soon as, as well as

AS + Noun Phrase                           as a museum; as the red-light district

AS at the start of a manner adjunct        as you can imagine; as the locals do

AS could be replaced with because          big push for the Chinese people to learn English, as
                                           they have now made it mandatory in their schools
AS is used for comparison                  as if they knew we were on their turf;
                                           still as a board;
                                           the same as fall back in Chicago;
IT
Function                                       Examples

IT serves as s dummy subject

IT + [to be] + predicament + infinitive        It's hard to enjoy a festival the same way

IT + [to be] or other verb phrase (+           It turns out I'll be going to at least four
adj/noun phrase) + relative clause (that, if
etc.)
IT + [to be] + time reference                  it's time for Pendulum
IT (cont.)
Function                                   Examples

IT + seem/feel/any other perception verb   it stops feeling like Hannover

IT + [to be] + noun phrase                 it would have been a great day

IT refers to something mentioned before    We woke up early to catch the ferry and it
                                           couldn't have been easier.

IT is a part of a fixed phrase             We made it to Macau in less than 2 hours
THAT
Function                             Examples

THAT begins a subordinate clause     I also couldn't help but notice that when I
                                     looked toward the island
THAT could be replaced with which    It was the spot on the beach that was
                                     shaped like a triangle
THAT is a determiner                 That night, we all reconvened at the hotel
THERE
Function                          Examples


THERE serves as a dummy subject   there are a few longhaired dogs

THERE refers to a place           it was there strictly for the tourists
My Dataset
                                Author A                              Author B
   Type of text                 Travel Blog                           Travel Blog
   Gender (self-                Female                                Male
   declared)
   Mother Tongue and            English (American)                    English (perhaps Irish)
   variety (self-declared)

   Website URL - the            http://www.travelblog.org             http://www.getjealous.com
   data source

                                     Size of K corpus

              9 texts           7 texts           5 texts           3 texts           Q text

Author A               20,875         16,118            11,024                6,260            2,479

Author B                7,991             6,176             4,241             2,611             750
Methodology
Texts were imported into TEXTSTAT concordance software;
Words AS, IT, THAT and THERE were chosen for their
explicit diverse meanings in the sentence;
Quantitative analysis was used to determine how different
(or similar) the authors were in terms of their frequency of
use of function words and their meanings;
The number of texts was reduced to see if at some point
analysis breaks down (compare to Grant, 2007);
Statistical technique used – T-TEST
Matrix of Probabilities
Application                      PSA values            Meaning
Clustering                       PSA > 90%             Success
Clustering and Differentiating   PSA ≥ 95%             ‘Beyond Reasonable Doubt’
Differentiating                  PSA < 85%             Definite Failure (error rate at
                                                       15% causes reasonable doubt).
Clustering and Differentiating   PSA > 50%             Balance of probabilities –
                                                       suitable for civil court.

              PSA = probability of same authorship
              Clustering = the author of both texts is likely to be the same
              person
              Differentiating = texts were written by different authors
              Beyond reasonable doubt: 95%
Findings: T-Test

Clustering                  Discriminating
Analysing each marker       Analysing each marker
of the same author          of the one author
against the values of       against the values of
that marker in the Q text   that marker in the Q text
by the same author          by the other author
How likely is that person   How likely is that K and
to have produced the        Q texts have been
text?                       produced by the same
                            person?
Findings: Reliability of markers
         All texts by one author compared against each other
         Every semantic role of each function word was included
         Special attention: success of the test depends on the amount
         of text
         Not all markers are reliable; their frequency can be too
         low in a short text
Marker             Clustering                  Discrimination
AS                 Very inconsistent           Consistent
IT                 Very consistent             Very Consistent
THAT               depends on the amount of    depends on the amount of
                   text (A- yes; B - no)       text (A- yes; B - no)
THERE              Very consistent             Very consistent
T-Test: Success
                       Beyond Reasonable Doubt: 95% or more



Functi   Function                       Clustering                  Discrimi
on                                                                  nating
Word
                                        A             B
AS       Start of time adjunct clause   FAIL   YES    BRD    NO     BRD
         Fixed Phrase as [adj/adv] as   BRD    FAIL   FAIL   YES    BRD
         AS + Noun Phrase               FAIL   BRD    YES    YES    NO
         AS at the start of a manner    FAIL   YES    BRD    N/A    NO
         adjunct
         AS could be replaced with      BRD    BRD    N/A    N/A    N/A
         because
         AS is used for comparison      YES    BRD    BRD    FAIL   NO
Function   Function                      Clustering                  Discrimin
Word                                                                 ating

                                         A             B
IT                                       YES    YES    BRD    FAIL   BRD
           Dummy subject
           Dummy subject at the          FAIL   FAIL   FAIL   FAIL   NO
           start of the sentence
THAT       That begins a subordinate     BRD    YES    FAIL   FAIL   NO
           clause
           That could be replaced with   FAIL   FAIL   BRD    BRD    BRD
           which
           That is a determiner          FAIL   FAIL   FAIL   YES    BRD
THERE                                    YES    BRD    N/A    FAIL   NO
           Dummy subject
           Dummy subject at the          FAIL   FAIL   N/A    FAIL   BRD
           start of the sentence
Results
Marker   Success Failure   Explanation
AS       50%      33.33%   A fairly reliable marker. Would do in civil court.
IT       80%      20%      The most reliable marker in this study.
                           IT at the start of the sentence has no linguistic
                           theory behind it, and failure was expected.
THAT     46.67%   53.33%   Also in Mackevic (2011):
                           “Very unreliable across all authors – enormous
                           error rates; PSA shooting over 50% most of the
                           time. ”
THERE    30%      50%      Marker totally unreliable.
Discussion of Results
Most of the markers – much better at
discriminating that at clustering;
A lot depends of the text’s length– when I
started removing texts from the corpus (9, then 7,
then 5 and finally 3), analysis began breaking
down;
        6000 words for the reference corpus –
        approximate benchmark.
Possible conclusion: function words are really
better for longer texts, which also occur in
forensic settings.
Why did T-test fail?

Possible explanation: some markers occurred very rarely
They had little linguistic significance (no theory behind)
Analysis broke down with very consistent markers. Why?
Possibly, because the amount of text (number of words)
was insufficient
        For Comparison: Grant‟s(2010) also reports his
        analysis breaking down when the amount of text is
        reduced
  Perhaps qualitative analysis is better for shorter texts
        But it works against the Daubert Criteria
Recommendations

Use grammar reference books for semantic roles of
function words and more detailed division of
roles
Choose different words (look what worked for other
authors)
Try more texts, but short ones (e.g. 50 texts of 400
words each)
Try more statistical techniques
Conclusion
Function words – potentially another tool in a forensic
linguist‟s toolbox
T-Test – good analytical tool;
It returns exact results with certain error rates that are
easy to interpret (consistent with Daubert criteria)
However, it also has some limitations and additional
analysis may be needed to complete the picture
T-Test works with discriminating better than with
clustering
Analysis breaks down with small corpora
References
NB: The references are from the original paper; some authors present in this
              list may not have been cited in the presentation
  Books and Journals

  Argamon, S. & Levitan, S. (2005) Measuring the Usefulness of Function Words for Authorship
  Attribution [Online]. Available at:
  http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.71.6935&rep=rep1&type=pdf [Accessed
  12 September 2010]
  Burrows, J. (2003). Questions of Authorship: Attribution and Beyond. Computers and Humanities
  [Online] 37, pp. 5-23. Available from: http://www.springerlink.com/content/nv46t75125472350/
  [Accessed 1 August 2010].
  Chaski, C. E. (1997). Who Wrote It? Steps Towards a Science of Authorship Identification. National
  Institute of Justice Journal. (September Issue) [Online]. Available from:
  http://www.ncjrs.gov/pdffiles/jr000233.pdf [Accessed 31 January 2010].
  Chaski, C. E. (2001). Empirical evaluations of language-based author identification techniques. The
  International Journal of Speech, Language and the Law [Online] 8 (1), pp. 1-65. Available from:
  http://www.equinoxjournals.com/ojs/index.php/IJSLL/article/view/1690/1151 [Accessed 12 June
  2008].
  Chaski, C. E. (2005). Who‟s at the Keyboard? Authorship Attribution in Digital Evidence
  Investigations. International Journal of Digital Evidence [Online] 4 (1), pp. 1-14. Available from:
  http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.75.3852&rep=rep1&type=pdf [Accessed
  31 January 2010].
Coulthard, M. (1998). Identifying the Author. Cahiers de Linguistique Française [Online] 20, pp. 139-
161. Available at: http://clf.unige.ch/display.php?idFichier=168 [Accessed 28 January 2010].
Coulthard, M. (2004). Author Identification, Idiolect and Linguistic Uniqueness. Applied Linguistics
[Online] 25 (4), pp. 431-447. Available at: http://www.business-
english.ch/downloads/Malcolm%20Coulthard/AppLing.art.final.pdf [Accessed 27 January 2010].
Coulthard, M. & Johnson, A. (2007). An Introduction to Forensic Linguistics: Language in Evidence.
Abingdon: Routledge.
De Vel, O. (2001). Multi-Topic E-mail Authorship Attribution Forensics. In: ACM Conference on
Computer Security – Workshop on data mining for security applications. November 8,
2001.Phildelphia, PA [Online]. Available at:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.9951&rep=rep1&type=pdf [Accessed
31 August 2010].
Grant, T. (2007). Quantifying Evidence in Forensic Authorship Analysis. The international Journal of
Speech, Language and the Law [Online] 14 (1), pp. 1-25. Available at:
http://www.equinoxjournals.com/IJSLL/article/view/3955/2428 [Accessed 12 July 2010].
Grant, T. (2008). Dr Tim Grant: How text-messaging slips can help catch murderers. The
Independent [Online]. (Last updated 9 September 2009). Available at:
http://www.independent.co.uk/opinion/commentators/dr-tim-grant-how-textmessaging-slips-can-
help-catch-murderers-923503.html [Accessed 11 September 2010].
Grant, T. D. (2010). Txt 4n6: idiolect free authorship analysis? In: Roultledge Handbook of Forensic
Lingusitics. Abingdon: Routledge
De Vel, O. (2001). Multi-Topic E-mail Authorship Attribution Forensics. In: ACM Conference on
Computer Security – Workshop on data mining for security applications. November 8,
2001.Phildelphia, PA [Online]. Available at:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.9951&rep=rep1&type=pdf [Accessed
31 August 2010].
De Vel, O. (2001). Multi-Topic E-mail Authorship Attribution Forensics. In: ACM Conference on
Computer Security – Workshop on data mining for security applications. November 8,
2001.Phildelphia, PA [Online]. Available at:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.9951&rep=rep1&type=pdf [Accessed
31 August 2010].
Grant, T. (2007). Quantifying Evidence in Forensic Authorship Analysis. The international Journal of
Speech, Language and the Law [Online] 14 (1), pp. 1-25. Available at:
http://www.equinoxjournals.com/IJSLL/article/view/3955/2428 [Accessed 12 July 2010].
Grant, T. (2008). Dr Tim Grant: How text-messaging slips can help catch murderers. The
Independent [Online]. (Last updated 9 September 2009). Available at:
http://www.independent.co.uk/opinion/commentators/dr-tim-grant-how-textmessaging-slips-can-
help-catch-murderers-923503.html [Accessed 11 September 2010].
Grant, T. D. (2010). Txt 4n6: idiolect free authorship analysis? In: Roultledge Handbook of Forensic
Lingusitics. Abingdon: Routledge
Grant, T. & Baker, K. (2001). Identifying reliable, valid markers of authorship: a response to Chaski.
The International Journal of Speech, Language and the Law [Online] 8 (1), pp. 66-79. Available at:
http://www.equinoxjournals.com/ojs/index.php/IJSLL/article/view/1691/1150 [Accessed 12 June
2008].
Holmes, D. I. & Forsyth, R. S. (1995). The Federalist Revisited: New Directions in Authorship
Attribution. Literary and Linguistic Computing [Online] 10 (2), pp. 111-127. Available from:
http://llc.oxfordjournals.org/cgi/reprint/10/2/111 [Accessed 1 August 2010] .
Hunston, C. (2002). Corpora in Applied Linguistics. Cambridge: Cambridge University Press.
Mitchell, E. (2008). The Case for Forensic Linguisitcs. BBC News [Online]. (Last updates 8
September 2008). Available at: http://news.bbc.co.uk/1/hi/sci/tech/7600769.stm [Accessed 11
September 2010]
Rudman, J. (1998). The State of Authorship Attribution Studies: Some Problems and Solutions. Computers and
the Humanities [Online] 31, pp. 351–365. Available from:
http://www.springerlink.com/content/l023q7047388133x/fulltext.pdf
[Accessed 2 August 2010].

Websites:

Textstat
http://neon.niederlandistik.fu-berlin.de/textstat/
T-test Calculator
http://www.graphpad.com/quickcalcs/OneSampleT1.cfm
T-Tables
http://www.statsoft.com/textbook/distribution-tables/#t

Mais conteúdo relacionado

Mais procurados

Forensic linguistics
Forensic linguisticsForensic linguistics
Forensic linguisticsAbbou Zohra
 
Discourse analysis (Linguistics Forms and Functions)
Discourse analysis (Linguistics Forms and Functions)Discourse analysis (Linguistics Forms and Functions)
Discourse analysis (Linguistics Forms and Functions)Satya Permadi
 
Sociolinguistics_April 15th, 2019
Sociolinguistics_April 15th, 2019Sociolinguistics_April 15th, 2019
Sociolinguistics_April 15th, 2019ilhamseptian02
 
Language maintenance and shift
Language maintenance and shift Language maintenance and shift
Language maintenance and shift Farah Nadia
 
A HISTORICAL DEVELOPMENT OF CONTRASTIVE ANALYSIS: A RELEVANT REVIEW IN SECOND...
A HISTORICAL DEVELOPMENT OF CONTRASTIVE ANALYSIS: A RELEVANT REVIEW IN SECOND...A HISTORICAL DEVELOPMENT OF CONTRASTIVE ANALYSIS: A RELEVANT REVIEW IN SECOND...
A HISTORICAL DEVELOPMENT OF CONTRASTIVE ANALYSIS: A RELEVANT REVIEW IN SECOND...ijejournal
 
Language death and language loss
Language death and language lossLanguage death and language loss
Language death and language lossDesi Puspitasariku
 
Language standardization: How and why
Language standardization: How and whyLanguage standardization: How and why
Language standardization: How and whyadm-2012
 
Code Switching, Types and Reasons
Code Switching, Types and ReasonsCode Switching, Types and Reasons
Code Switching, Types and ReasonsSohail Khan
 
Stylistics - Norm and Deviation.
Stylistics - Norm and Deviation.Stylistics - Norm and Deviation.
Stylistics - Norm and Deviation.AleeenaFarooq
 
Mutual intelligibility
Mutual intelligibilityMutual intelligibility
Mutual intelligibilityMuslimah Alg
 
Forensic linguistics ppt by roshna
Forensic linguistics ppt by roshnaForensic linguistics ppt by roshna
Forensic linguistics ppt by roshnaG.P.G.C Mardan
 
Sociolinguistics
SociolinguisticsSociolinguistics
SociolinguisticsAlicia Ruiz
 

Mais procurados (20)

Contrastive rhetoric
Contrastive rhetoricContrastive rhetoric
Contrastive rhetoric
 
Chapter 6 Language and Politics
Chapter 6 Language and PoliticsChapter 6 Language and Politics
Chapter 6 Language and Politics
 
Forensic linguistics
Forensic linguisticsForensic linguistics
Forensic linguistics
 
Discourse analysis (Linguistics Forms and Functions)
Discourse analysis (Linguistics Forms and Functions)Discourse analysis (Linguistics Forms and Functions)
Discourse analysis (Linguistics Forms and Functions)
 
Forensic Linguistics
Forensic LinguisticsForensic Linguistics
Forensic Linguistics
 
Sociolinguistics_April 15th, 2019
Sociolinguistics_April 15th, 2019Sociolinguistics_April 15th, 2019
Sociolinguistics_April 15th, 2019
 
Presentation.
Presentation.Presentation.
Presentation.
 
Language maintenance and shift
Language maintenance and shift Language maintenance and shift
Language maintenance and shift
 
Test Usefulness
Test UsefulnessTest Usefulness
Test Usefulness
 
A HISTORICAL DEVELOPMENT OF CONTRASTIVE ANALYSIS: A RELEVANT REVIEW IN SECOND...
A HISTORICAL DEVELOPMENT OF CONTRASTIVE ANALYSIS: A RELEVANT REVIEW IN SECOND...A HISTORICAL DEVELOPMENT OF CONTRASTIVE ANALYSIS: A RELEVANT REVIEW IN SECOND...
A HISTORICAL DEVELOPMENT OF CONTRASTIVE ANALYSIS: A RELEVANT REVIEW IN SECOND...
 
Intro to-stylistics
Intro to-stylisticsIntro to-stylistics
Intro to-stylistics
 
Language death and language loss
Language death and language lossLanguage death and language loss
Language death and language loss
 
Stylistics
Stylistics Stylistics
Stylistics
 
Language standardization: How and why
Language standardization: How and whyLanguage standardization: How and why
Language standardization: How and why
 
Code Switching, Types and Reasons
Code Switching, Types and ReasonsCode Switching, Types and Reasons
Code Switching, Types and Reasons
 
Forensic Linguistics
Forensic LinguisticsForensic Linguistics
Forensic Linguistics
 
Stylistics - Norm and Deviation.
Stylistics - Norm and Deviation.Stylistics - Norm and Deviation.
Stylistics - Norm and Deviation.
 
Mutual intelligibility
Mutual intelligibilityMutual intelligibility
Mutual intelligibility
 
Forensic linguistics ppt by roshna
Forensic linguistics ppt by roshnaForensic linguistics ppt by roshna
Forensic linguistics ppt by roshna
 
Sociolinguistics
SociolinguisticsSociolinguistics
Sociolinguistics
 

Destaque

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...PyData
 
Forensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical ApplicationsForensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical Applicationsdahveed123
 
forensic linguistics
forensic linguisticsforensic linguistics
forensic linguisticsGhazal Parsi
 
The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...yosra Yassora
 
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...osify
 
Co authorship and attribution
Co authorship and attributionCo authorship and attribution
Co authorship and attributionJenny Delasalle
 
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...
Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...Maarten van Wesel
 
My Graduation Project Documentation: Plagiarism Detection System for English ...
My Graduation Project Documentation: Plagiarism Detection System for English ...My Graduation Project Documentation: Plagiarism Detection System for English ...
My Graduation Project Documentation: Plagiarism Detection System for English ...Ahmed Mater
 
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaTraian Rebedea
 
Scientometrics and semantic maps for development (Author: Iina Hellsten)
Scientometrics and semantic maps for development (Author: Iina Hellsten)Scientometrics and semantic maps for development (Author: Iina Hellsten)
Scientometrics and semantic maps for development (Author: Iina Hellsten)Sarah Cummings
 
Using Technology To Detect Plagiarism
Using Technology To Detect PlagiarismUsing Technology To Detect Plagiarism
Using Technology To Detect Plagiarismguestf17a2e
 
Police and veteran training
Police and veteran trainingPolice and veteran training
Police and veteran trainingEddie Black
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk Vijay Ganti
 
Training programme for Community Police Officers
Training programme for Community Police OfficersTraining programme for Community Police Officers
Training programme for Community Police OfficersPeople's Trust, Jaipur
 
Police Stress and Trauma Paper
Police Stress and Trauma PaperPolice Stress and Trauma Paper
Police Stress and Trauma PaperMeghan Mohon
 
Plagiarism and its detection
Plagiarism and its detectionPlagiarism and its detection
Plagiarism and its detectionankit_saluja
 

Destaque (20)

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
Forensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical ApplicationsForensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical Applications
 
forensic linguistics
forensic linguisticsforensic linguistics
forensic linguistics
 
The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...
 
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
 
Co authorship and attribution
Co authorship and attributionCo authorship and attribution
Co authorship and attribution
 
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...
Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...
 
Term Paper
Term PaperTerm Paper
Term Paper
 
My Graduation Project Documentation: Plagiarism Detection System for English ...
My Graduation Project Documentation: Plagiarism Detection System for English ...My Graduation Project Documentation: Plagiarism Detection System for English ...
My Graduation Project Documentation: Plagiarism Detection System for English ...
 
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corpora
 
Plag detection
Plag detectionPlag detection
Plag detection
 
Scientometrics and semantic maps for development (Author: Iina Hellsten)
Scientometrics and semantic maps for development (Author: Iina Hellsten)Scientometrics and semantic maps for development (Author: Iina Hellsten)
Scientometrics and semantic maps for development (Author: Iina Hellsten)
 
Using Technology To Detect Plagiarism
Using Technology To Detect PlagiarismUsing Technology To Detect Plagiarism
Using Technology To Detect Plagiarism
 
Police and veteran training
Police and veteran trainingPolice and veteran training
Police and veteran training
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
 
Training programme for Community Police Officers
Training programme for Community Police OfficersTraining programme for Community Police Officers
Training programme for Community Police Officers
 
Police Stress and Trauma Paper
Police Stress and Trauma PaperPolice Stress and Trauma Paper
Police Stress and Trauma Paper
 
Plagiarism and its detection
Plagiarism and its detectionPlagiarism and its detection
Plagiarism and its detection
 
police officers
police officerspolice officers
police officers
 
Criminal profiling
Criminal profilingCriminal profiling
Criminal profiling
 

Semelhante a Authorship analysis using function words forensic linguistics

language skills editing updated
language skills editing updatedlanguage skills editing updated
language skills editing updatedKiran
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdfSoha82
 
Discourse Analysis (Linguistic 101)
Discourse Analysis (Linguistic 101)Discourse Analysis (Linguistic 101)
Discourse Analysis (Linguistic 101)Rain Thorvaldsen
 
Natural language-processing
Natural language-processingNatural language-processing
Natural language-processingHareem Naz
 
What English Do University Students Really Need
What English Do University Students Really NeedWhat English Do University Students Really Need
What English Do University Students Really NeedHala Nur
 
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...Seth Grimes
 
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Seth Grimes
 
Mind map esl 502
Mind map esl 502Mind map esl 502
Mind map esl 502k1hinze
 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptxSan Kim
 
Mind map esl 502
Mind map esl 502Mind map esl 502
Mind map esl 502k1hinze
 
Beyond Readability Formulas: The Editor as Advocate of Whole Text and All Rea...
Beyond Readability Formulas: The Editor as Advocate of Whole Text and All Rea...Beyond Readability Formulas: The Editor as Advocate of Whole Text and All Rea...
Beyond Readability Formulas: The Editor as Advocate of Whole Text and All Rea...Romina Marazzato Sparano
 
Talk features notes
Talk features notesTalk features notes
Talk features notesjrourke
 
The noun phrase introducers of npChapter 4the noun phr.docx
The noun phrase  introducers of npChapter 4the noun phr.docxThe noun phrase  introducers of npChapter 4the noun phr.docx
The noun phrase introducers of npChapter 4the noun phr.docxarnoldmeredith47041
 
The noun phrase introducers of npChapter 4the noun phr.docx
The noun phrase  introducers of npChapter 4the noun phr.docxThe noun phrase  introducers of npChapter 4the noun phr.docx
The noun phrase introducers of npChapter 4the noun phr.docxdennisa15
 

Semelhante a Authorship analysis using function words forensic linguistics (20)

language skills editing updated
language skills editing updatedlanguage skills editing updated
language skills editing updated
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
 
Defining Open-Class Words
Defining Open-Class WordsDefining Open-Class Words
Defining Open-Class Words
 
6 POS SA.pptx
6 POS SA.pptx6 POS SA.pptx
6 POS SA.pptx
 
Discourse Analysis (Linguistic 101)
Discourse Analysis (Linguistic 101)Discourse Analysis (Linguistic 101)
Discourse Analysis (Linguistic 101)
 
Natural language-processing
Natural language-processingNatural language-processing
Natural language-processing
 
What English Do University Students Really Need
What English Do University Students Really NeedWhat English Do University Students Really Need
What English Do University Students Really Need
 
Syntactic parsing for arabic
Syntactic parsing for arabicSyntactic parsing for arabic
Syntactic parsing for arabic
 
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
 
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
 
Mind map esl 502
Mind map esl 502Mind map esl 502
Mind map esl 502
 
semantics nour.pptx
semantics nour.pptxsemantics nour.pptx
semantics nour.pptx
 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx
 
Mind map esl 502
Mind map esl 502Mind map esl 502
Mind map esl 502
 
unit -3 part 1.ppt
unit -3 part 1.pptunit -3 part 1.ppt
unit -3 part 1.ppt
 
Beyond Readability Formulas: The Editor as Advocate of Whole Text and All Rea...
Beyond Readability Formulas: The Editor as Advocate of Whole Text and All Rea...Beyond Readability Formulas: The Editor as Advocate of Whole Text and All Rea...
Beyond Readability Formulas: The Editor as Advocate of Whole Text and All Rea...
 
Talk features notes
Talk features notesTalk features notes
Talk features notes
 
Incrementality
IncrementalityIncrementality
Incrementality
 
The noun phrase introducers of npChapter 4the noun phr.docx
The noun phrase  introducers of npChapter 4the noun phr.docxThe noun phrase  introducers of npChapter 4the noun phr.docx
The noun phrase introducers of npChapter 4the noun phr.docx
 
The noun phrase introducers of npChapter 4the noun phr.docx
The noun phrase  introducers of npChapter 4the noun phr.docxThe noun phrase  introducers of npChapter 4the noun phr.docx
The noun phrase introducers of npChapter 4the noun phr.docx
 

Último

Philosophy of Education and Educational Philosophy
Philosophy of Education  and Educational PhilosophyPhilosophy of Education  and Educational Philosophy
Philosophy of Education and Educational PhilosophyShuvankar Madhu
 
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdfMaximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdfTechSoup
 
CapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapitolTechU
 
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRADUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRATanmoy Mishra
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...CaraSkikne1
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxAditiChauhan701637
 
Patterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxPatterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxMYDA ANGELICA SUAN
 
The Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsThe Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsEugene Lysak
 
Diploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfDiploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfMohonDas
 
Benefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationBenefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationMJDuyan
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICESayali Powar
 
Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...raviapr7
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.EnglishCEIPdeSigeiro
 
CAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxCAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxSaurabhParmar42
 
How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17Celine George
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxraviapr7
 
Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.raviapr7
 
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptxSandy Millin
 
How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17Celine George
 
Ultra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxUltra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxDr. Asif Anas
 

Último (20)

Philosophy of Education and Educational Philosophy
Philosophy of Education  and Educational PhilosophyPhilosophy of Education  and Educational Philosophy
Philosophy of Education and Educational Philosophy
 
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdfMaximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
 
CapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptx
 
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRADUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptx
 
Patterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxPatterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptx
 
The Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsThe Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George Wells
 
Diploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfDiploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdf
 
Benefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationBenefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive Education
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICE
 
Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.
 
CAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxCAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptx
 
How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptx
 
Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.
 
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
 
How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17
 
Ultra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxUltra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptx
 

Authorship analysis using function words forensic linguistics

  • 1. All the Text‟s a Stage; And All the Function Words Merely Players? Statistical Analysis of Authorship Vlad Mackevic Aston University
  • 2. Work of a Modern Forensic Linguist
  • 3. Playing detective? In forensic science – investigators look for clues that the culprit leaves unwittingly; In linguistics – „unconscious language‟ i.e. Function Words (de Vel, 2001; Argamon & Levitan, 2005; Burrows, 2003) Rather old idea (Wallace & Mosteller, 1964); revisited in Holmes & Forsyth (1995).
  • 5. Advantages of Function Words in FL „Unconscious language‟ Numerous even in a relatively short text. Can be easily counted Related to the Daubert Criteria Enables corpus analysis (Key Words in Context)
  • 6. The Daubert Criteria 1. The theory must have been tested; 2. It must have been subjected to peer review and publication; 3. It must have a known error rate; 4. It must be generally accepted in the scientific community. (Tiersma & Solan, 2002, cited in Coulthard, 2004; Chaski, 1997; Grant, 2007)
  • 7. Implications for linguists Increased pressure on the linguists to use mathematical methods, repeatable procedures; Forensic linguists must serve justice; „Beyond reasonable doubt‟ in criminal cases (Grant, 2010) „Raise legitimate doubt‟ in civil cases (ibid.) The method is King, not the expert.
  • 8. It is „a challenge to the academic community to test the error rate and at the same time to fix an acceptable statistical equivalent for „beyond reasonable doubt‟ Coulthard (2004: 476) It is „the linguist‟s responsibility to create theoretically sound hypotheses‟ and test them Chaski (2001: 2) .
  • 9. Idiolect Defined as the idiosyncratic use of dialect, idiolect is a way of speaking (and, consequently, writing) that is unique for each individual Chaski (1997). 'the totality of the possible utterances of one speaker at one time in using a language to interact with one other speaker‟ Bloch (1948, cited in Grieve, 2007: 255).
  • 10. Theory Grant (2010) - two theoretical frameworks: Idiolect is linked to neuroscience The author is influenced by the language he/she is exposed to. De Vel‟s (2001) and Argamon & Levitan‟s (2005) claims about certain function words being unconscious linguistic choices – also a theory.
  • 11. Theory (cont.) Grant (2010): „simple detection of consistency and determination of distinctiveness‟ would be able to help practical authorship analysis more than even a strong theory.
  • 12. Hypotheses The use of function words is unique to each individual (could be limited by context or genre) - idiolect; The frequency of certain function words is an authorship marker (e.g. Holmes & Forsyth, 1995); The frequency of semantic roles that certain function words play is also an authorship marker.
  • 13. Semantic Roles Semantic roles are the word‟s functions in the specific context of the sentence. The words I analysed were AS, IT, THAT and THERE Criteria: frequency (corpus) and explicit multiple meanings
  • 14. AS Function Examples Start of time adjunct clause As we approached the small hut; as I followed the masses Fixed Phrase as [adj/adv] as As easily as; as soon as, as well as AS + Noun Phrase as a museum; as the red-light district AS at the start of a manner adjunct as you can imagine; as the locals do AS could be replaced with because big push for the Chinese people to learn English, as they have now made it mandatory in their schools AS is used for comparison as if they knew we were on their turf; still as a board; the same as fall back in Chicago;
  • 15. IT Function Examples IT serves as s dummy subject IT + [to be] + predicament + infinitive It's hard to enjoy a festival the same way IT + [to be] or other verb phrase (+ It turns out I'll be going to at least four adj/noun phrase) + relative clause (that, if etc.) IT + [to be] + time reference it's time for Pendulum
  • 16. IT (cont.) Function Examples IT + seem/feel/any other perception verb it stops feeling like Hannover IT + [to be] + noun phrase it would have been a great day IT refers to something mentioned before We woke up early to catch the ferry and it couldn't have been easier. IT is a part of a fixed phrase We made it to Macau in less than 2 hours
  • 17. THAT Function Examples THAT begins a subordinate clause I also couldn't help but notice that when I looked toward the island THAT could be replaced with which It was the spot on the beach that was shaped like a triangle THAT is a determiner That night, we all reconvened at the hotel
  • 18. THERE Function Examples THERE serves as a dummy subject there are a few longhaired dogs THERE refers to a place it was there strictly for the tourists
  • 19. My Dataset Author A Author B Type of text Travel Blog Travel Blog Gender (self- Female Male declared) Mother Tongue and English (American) English (perhaps Irish) variety (self-declared) Website URL - the http://www.travelblog.org http://www.getjealous.com data source Size of K corpus 9 texts 7 texts 5 texts 3 texts Q text Author A 20,875 16,118 11,024 6,260 2,479 Author B 7,991 6,176 4,241 2,611 750
  • 20. Methodology Texts were imported into TEXTSTAT concordance software; Words AS, IT, THAT and THERE were chosen for their explicit diverse meanings in the sentence; Quantitative analysis was used to determine how different (or similar) the authors were in terms of their frequency of use of function words and their meanings; The number of texts was reduced to see if at some point analysis breaks down (compare to Grant, 2007); Statistical technique used – T-TEST
  • 21. Matrix of Probabilities Application PSA values Meaning Clustering PSA > 90% Success Clustering and Differentiating PSA ≥ 95% ‘Beyond Reasonable Doubt’ Differentiating PSA < 85% Definite Failure (error rate at 15% causes reasonable doubt). Clustering and Differentiating PSA > 50% Balance of probabilities – suitable for civil court. PSA = probability of same authorship Clustering = the author of both texts is likely to be the same person Differentiating = texts were written by different authors Beyond reasonable doubt: 95%
  • 22. Findings: T-Test Clustering Discriminating Analysing each marker Analysing each marker of the same author of the one author against the values of against the values of that marker in the Q text that marker in the Q text by the same author by the other author How likely is that person How likely is that K and to have produced the Q texts have been text? produced by the same person?
  • 23. Findings: Reliability of markers All texts by one author compared against each other Every semantic role of each function word was included Special attention: success of the test depends on the amount of text Not all markers are reliable; their frequency can be too low in a short text Marker Clustering Discrimination AS Very inconsistent Consistent IT Very consistent Very Consistent THAT depends on the amount of depends on the amount of text (A- yes; B - no) text (A- yes; B - no) THERE Very consistent Very consistent
  • 24. T-Test: Success Beyond Reasonable Doubt: 95% or more Functi Function Clustering Discrimi on nating Word A B AS Start of time adjunct clause FAIL YES BRD NO BRD Fixed Phrase as [adj/adv] as BRD FAIL FAIL YES BRD AS + Noun Phrase FAIL BRD YES YES NO AS at the start of a manner FAIL YES BRD N/A NO adjunct AS could be replaced with BRD BRD N/A N/A N/A because AS is used for comparison YES BRD BRD FAIL NO
  • 25. Function Function Clustering Discrimin Word ating A B IT YES YES BRD FAIL BRD Dummy subject Dummy subject at the FAIL FAIL FAIL FAIL NO start of the sentence THAT That begins a subordinate BRD YES FAIL FAIL NO clause That could be replaced with FAIL FAIL BRD BRD BRD which That is a determiner FAIL FAIL FAIL YES BRD THERE YES BRD N/A FAIL NO Dummy subject Dummy subject at the FAIL FAIL N/A FAIL BRD start of the sentence
  • 26. Results Marker Success Failure Explanation AS 50% 33.33% A fairly reliable marker. Would do in civil court. IT 80% 20% The most reliable marker in this study. IT at the start of the sentence has no linguistic theory behind it, and failure was expected. THAT 46.67% 53.33% Also in Mackevic (2011): “Very unreliable across all authors – enormous error rates; PSA shooting over 50% most of the time. ” THERE 30% 50% Marker totally unreliable.
  • 27. Discussion of Results Most of the markers – much better at discriminating that at clustering; A lot depends of the text’s length– when I started removing texts from the corpus (9, then 7, then 5 and finally 3), analysis began breaking down; 6000 words for the reference corpus – approximate benchmark. Possible conclusion: function words are really better for longer texts, which also occur in forensic settings.
  • 28. Why did T-test fail? Possible explanation: some markers occurred very rarely They had little linguistic significance (no theory behind) Analysis broke down with very consistent markers. Why? Possibly, because the amount of text (number of words) was insufficient For Comparison: Grant‟s(2010) also reports his analysis breaking down when the amount of text is reduced Perhaps qualitative analysis is better for shorter texts But it works against the Daubert Criteria
  • 29. Recommendations Use grammar reference books for semantic roles of function words and more detailed division of roles Choose different words (look what worked for other authors) Try more texts, but short ones (e.g. 50 texts of 400 words each) Try more statistical techniques
  • 30. Conclusion Function words – potentially another tool in a forensic linguist‟s toolbox T-Test – good analytical tool; It returns exact results with certain error rates that are easy to interpret (consistent with Daubert criteria) However, it also has some limitations and additional analysis may be needed to complete the picture T-Test works with discriminating better than with clustering Analysis breaks down with small corpora
  • 31. References NB: The references are from the original paper; some authors present in this list may not have been cited in the presentation Books and Journals Argamon, S. & Levitan, S. (2005) Measuring the Usefulness of Function Words for Authorship Attribution [Online]. Available at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.71.6935&rep=rep1&type=pdf [Accessed 12 September 2010] Burrows, J. (2003). Questions of Authorship: Attribution and Beyond. Computers and Humanities [Online] 37, pp. 5-23. Available from: http://www.springerlink.com/content/nv46t75125472350/ [Accessed 1 August 2010]. Chaski, C. E. (1997). Who Wrote It? Steps Towards a Science of Authorship Identification. National Institute of Justice Journal. (September Issue) [Online]. Available from: http://www.ncjrs.gov/pdffiles/jr000233.pdf [Accessed 31 January 2010]. Chaski, C. E. (2001). Empirical evaluations of language-based author identification techniques. The International Journal of Speech, Language and the Law [Online] 8 (1), pp. 1-65. Available from: http://www.equinoxjournals.com/ojs/index.php/IJSLL/article/view/1690/1151 [Accessed 12 June 2008]. Chaski, C. E. (2005). Who‟s at the Keyboard? Authorship Attribution in Digital Evidence Investigations. International Journal of Digital Evidence [Online] 4 (1), pp. 1-14. Available from: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.75.3852&rep=rep1&type=pdf [Accessed 31 January 2010].
  • 32. Coulthard, M. (1998). Identifying the Author. Cahiers de Linguistique Française [Online] 20, pp. 139- 161. Available at: http://clf.unige.ch/display.php?idFichier=168 [Accessed 28 January 2010]. Coulthard, M. (2004). Author Identification, Idiolect and Linguistic Uniqueness. Applied Linguistics [Online] 25 (4), pp. 431-447. Available at: http://www.business- english.ch/downloads/Malcolm%20Coulthard/AppLing.art.final.pdf [Accessed 27 January 2010]. Coulthard, M. & Johnson, A. (2007). An Introduction to Forensic Linguistics: Language in Evidence. Abingdon: Routledge. De Vel, O. (2001). Multi-Topic E-mail Authorship Attribution Forensics. In: ACM Conference on Computer Security – Workshop on data mining for security applications. November 8, 2001.Phildelphia, PA [Online]. Available at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.9951&rep=rep1&type=pdf [Accessed 31 August 2010]. Grant, T. (2007). Quantifying Evidence in Forensic Authorship Analysis. The international Journal of Speech, Language and the Law [Online] 14 (1), pp. 1-25. Available at: http://www.equinoxjournals.com/IJSLL/article/view/3955/2428 [Accessed 12 July 2010]. Grant, T. (2008). Dr Tim Grant: How text-messaging slips can help catch murderers. The Independent [Online]. (Last updated 9 September 2009). Available at: http://www.independent.co.uk/opinion/commentators/dr-tim-grant-how-textmessaging-slips-can- help-catch-murderers-923503.html [Accessed 11 September 2010]. Grant, T. D. (2010). Txt 4n6: idiolect free authorship analysis? In: Roultledge Handbook of Forensic Lingusitics. Abingdon: Routledge De Vel, O. (2001). Multi-Topic E-mail Authorship Attribution Forensics. In: ACM Conference on Computer Security – Workshop on data mining for security applications. November 8, 2001.Phildelphia, PA [Online]. Available at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.9951&rep=rep1&type=pdf [Accessed 31 August 2010].
  • 33. De Vel, O. (2001). Multi-Topic E-mail Authorship Attribution Forensics. In: ACM Conference on Computer Security – Workshop on data mining for security applications. November 8, 2001.Phildelphia, PA [Online]. Available at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.9951&rep=rep1&type=pdf [Accessed 31 August 2010]. Grant, T. (2007). Quantifying Evidence in Forensic Authorship Analysis. The international Journal of Speech, Language and the Law [Online] 14 (1), pp. 1-25. Available at: http://www.equinoxjournals.com/IJSLL/article/view/3955/2428 [Accessed 12 July 2010]. Grant, T. (2008). Dr Tim Grant: How text-messaging slips can help catch murderers. The Independent [Online]. (Last updated 9 September 2009). Available at: http://www.independent.co.uk/opinion/commentators/dr-tim-grant-how-textmessaging-slips-can- help-catch-murderers-923503.html [Accessed 11 September 2010]. Grant, T. D. (2010). Txt 4n6: idiolect free authorship analysis? In: Roultledge Handbook of Forensic Lingusitics. Abingdon: Routledge Grant, T. & Baker, K. (2001). Identifying reliable, valid markers of authorship: a response to Chaski. The International Journal of Speech, Language and the Law [Online] 8 (1), pp. 66-79. Available at: http://www.equinoxjournals.com/ojs/index.php/IJSLL/article/view/1691/1150 [Accessed 12 June 2008]. Holmes, D. I. & Forsyth, R. S. (1995). The Federalist Revisited: New Directions in Authorship Attribution. Literary and Linguistic Computing [Online] 10 (2), pp. 111-127. Available from: http://llc.oxfordjournals.org/cgi/reprint/10/2/111 [Accessed 1 August 2010] . Hunston, C. (2002). Corpora in Applied Linguistics. Cambridge: Cambridge University Press. Mitchell, E. (2008). The Case for Forensic Linguisitcs. BBC News [Online]. (Last updates 8 September 2008). Available at: http://news.bbc.co.uk/1/hi/sci/tech/7600769.stm [Accessed 11 September 2010]
  • 34. Rudman, J. (1998). The State of Authorship Attribution Studies: Some Problems and Solutions. Computers and the Humanities [Online] 31, pp. 351–365. Available from: http://www.springerlink.com/content/l023q7047388133x/fulltext.pdf [Accessed 2 August 2010]. Websites: Textstat http://neon.niederlandistik.fu-berlin.de/textstat/ T-test Calculator http://www.graphpad.com/quickcalcs/OneSampleT1.cfm T-Tables http://www.statsoft.com/textbook/distribution-tables/#t

Notas do Editor

  1. Now in the UK – the expert, not the method; becoming more like the US.
  2. Two particularly interesting aspects and also areas of concern in my research were the text length and the number of texts. As repeatedly pointed out by Coulthard (2004), Coulthard and Johnson (2007) and Chaski (2001), texts that forensic experts usually work with are very few and short: around 100-400 words. In her 2001 study, Chaski used texts that varied in length between 93 and 556 words. However, Wallace and Mosteller in their study on Federalist Papers looked as 85 essays, each 900 to 3,500 words in length. The number of texts matters as much as their length does: e.g. Grant (2007) examines 63 texts (3 authors with 21 texts per author). This study examines 20 texts (18 K (known) ones and 2 Q (query) ones), which also differ in length (see Table 5 for details). Although they are longer than usual forensic texts, they are relatively few. This may make the analysis more difficult as Grant&apos;s (2007) analysis breaks down when he reduces the number of texts per author. However, this study attempts to replicate, at least in part, a forensic experiment and research difficulties may be seen as the real-world challenges.