SlideShare a Scribd company logo
1 of 37
Basic Definitions
Authorship Attribution– Reza Ramezani
Danielle Jones
• Last seen 18th June 2001.
• After her disappearance a series of text
messages were sent from her phone.
• Linguistic analysis showed that the later
messages were sent by her Uncle,
Stuart Campbell.
• Campbell was convicted of Danielle’s
murder 19th December 2002 in part
because of the linguistic evidence.
Jenny Nicholl
• Last seen 30th June 2005.
• After her disappearance a series of
text messages were sent from her
• Linguistic analysis showed that the
later messages were sent by her
classmate, David Hodgson.
Hodgson was convicted
of Jenny’s murder 19th
February 2008 in part
because of the linguistic
Authorship Attribution – Reza Ramezani
Authorship Attribution
• Definition
– In the typical authorship attribution problem, a text of unknown authorship is
assigned to one candidate author, given a set of candidate authors for whom text
samples of undisputed authorship are available.
– From a machine learning point of view, this can be viewed as a multiclass, single-
label text-categorization task.
– This task also is called authorship (or author) identification
• Idea
– The main idea behind authorship attribution is by measuring some textual features.
– Authorship attribution is supported by statistical or computational methods.
– This scientific field takes advantage of research advances in areas such as machine
learning, information retrieval, and natural language processing.
Authorship Attribution– Reza Ramezani
Supervised Learning
Authorship Attribution – Reza Ramezani
• Representation
– Research in authorship attribution are done by attempts to define features for
quantifying writing style, a line of research known as “stylometry”
• Sentence length, word length, word frequencies, character frequencies, and
vocabulary richness functions
– 1,000 different measures was estimated by Rudman (1998)
Authorship Attribution – Reza Ramezani
Stylometric Features
• Lexical and Character Features
– Consider a text as a mere sequence of word-tokens or characters, respectively.
• Syntactic and Semantic Features
– Require deeper linguistic analysis
• Application-Specific Features
– Can be defined only in certain text domains or languages.
Authorship Attribution – Reza Ramezani
1. Lexical Features
• A simple and natural way to view a text is as a:
– Sequence of tokens grouped into sentences, with each token corresponding to a
word, number, or punctuation mark.
• Method
– The very first attempts to attribute authorship were based on simple measures
such as sentence length counts and word length counts.
• Advantage
– They can be applied to any language and any corpus with no additional
requirements except the availability of a tokenizer.
• The most straightforward approach to represent texts is by vectors of word
frequencies (vast majority of authorship attribution studies).
Authorship Attribution – Reza Ramezani
Classification & Authorship Attribution
• Difference in style-based and topic-based text classification
– The most common words (articles, prepositions, pronouns, etc.) are found to be
among the best features to discriminate between authors.
– Such words are usually excluded from the feature set of the topic-based text-
classification methods since they do not carry any semantic information, and they
are usually called “function words”.
• Various sets of function words have been used for English, but limited information
was provided about the way that they were selected: 150, 303, 365, 480 and 675
Authorship Attribution – Reza Ramezani
• A Simple Method
– Extract the most frequent words found in the available corpus (comprising all the
texts of the candidate authors).
– Then, a decision has to be made about the amount of the frequent words that will
be used as features.
• In the earlier studies, sets of at most 100 frequent words were considered adequate
to represent the style of an author.
– Another factor that affects the feature-set size is the classification algorithm that
will be used since many algorithms over-fit the training data when the
dimensionality of the problem increases.
– Some machine learning algorithm (Such as SVM) can deal with thousands of
Authorship Attribution – Reza Ramezani
Methods (Cont’d)
• Required Routines
– Tokenizer (Word Extraction)
– Conversion to lowercase
– Stemmers
– Lemmatizers
– Detectors of common homographic forms
• Disadvantages
– The bag-of-words approach provides a simple and efficient solution, but
disregards word-order (i.e., contextual) information.
• One Possible Solution
– n-grams
Authorship Attribution – Reza Ramezani
Methods (Cont’d)
• n-grams
– n contiguous words also known as word collocations
• Features
– The dimensionality of the problem following this approach increases considerably
with n to account for all the possible combinations between words.
– The representation produced by this approach is very sparse since most of the
word combinations are not encountered in a given (especially short) text, making it
very difficult to be handled effectively by a classification algorithm.
– The classification accuracy achieved by word n-grams is not always better than
individual word features.
Authorship Attribution – Reza Ramezani
Methods (Cont’d)
• Writing Error Measures
– Spelling errors
• Letter omissions
• Insertions
– Formatting errors
• “all caps” words
– This method needs an accurate spell checker.
Authorship Attribution – Reza Ramezani
• Vocabulary Richness
– The vocabulary richness functions are attempts to quantify the diversity of the
vocabulary of a text.
– Typical examples are the type-token ratio V/N, where V is the size of the
vocabulary (unique tokens) and N is the total number of tokens of the text.
• Unreliable Measures
– Vocabulary size (V) depends heavily on text length (as the text length increases, the
vocabulary also increases, quickly at the beginning and then more and more
– Various functions have been proposed to achieve stability over text length,
including K (Yule, 1944), and R (Honore, 1979), with questionable results.
Authorship Attribution – Reza Ramezani
2. Character Features
• A text is viewed as a mere sequence of characters.
• Character-level Measures
– Alphabetic characters count
– Digit characters count
– Uppercase and lowercase characters count
– Punctuation marks count
– And so on …
• Feature
– This type of information is easily available for any natural language and corpus
– It has been proven to be quite useful to quantify the writing style.
Authorship Attribution – Reza Ramezani
• Character n-gram
– Extract frequencies of n-grams on the character level.
• Features
– An advantage of this representation is its ability to be tolerant to noise.
– In cases of lexicon errors the character n-gram representation is not affected
• The words “simplistic” and “simpilstc” would produce many common character
– For oriental languages where the tokenization procedure is quite hard, character
n-grams offers a suitable solution
– The procedure of extracting the most frequent n-grams is language-independent
and requires no special tools.
• Compression-based Approaches
– Will be discussed later …
Authorship Attribution – Reza Ramezani
3. Syntactic Features
• Employing syntactic information
• Idea
– The idea is that authors tend to unconsciously use similar syntactic patterns.
– Therefore, syntactic information is considered more a reliable authorial
fingerprint in comparison to lexical information.
– This type of information requires robust and accurate NLP tools able to
perform syntactic analysis of texts.
• The syntactic measure extraction is a language dependent procedure
• Such features will produce noisy datasets due to unavoidable errors made by
the parser.
Authorship Attribution – Reza Ramezani
• Rewrite Rule
– Extracting Rewrite Rule frequencies, using a produced full parse tree of each
– Using Rewrite Rules to analysis parts of syntactic.
– Consider the following rewrite rule:
A : PP → P : PREP + PC : NP
– It means that an adverbial prepositional phrase is constituted by a preposition
followed by a noun phrase as a prepositional complement.
– These information describe “how the words are combined to form phrases or other
– Experimental results have shown that this type of measure performs better than do
Lexical and Characters features.
• It needs accurate fully automated parser, able to provide a detailed syntactic
analysis of sentences.
Authorship Attribution – Reza Ramezani
Methods (Cont’d)
• Paragraph Analyze
– Another attempt to exploit syntactic information was proposed by Stamatatos
– This sentence would be analyzed as following:
– NP[Another attempt] VP[to exploit] NP[syntactic information] VP[was proposed]
PP[by Stamatatos]
– Where NP, VP, and PP stand for noun phrase, verb phrase, and prepositional
phrase, respectively.
– This type of information is simpler than Rewrite Rules.
– It could be extracted automatically with relatively high accuracy.
– The extracted measures referred to noun phrase counts, verb phrase counts,
length of noun phrases, length of verb phrases, and so on…
Authorship Attribution – Reza Ramezani
Methods (Cont’d)
• Part-of-Speech (POS)
– (POS) tagger, a tool that assigns a tag of morpho-syntactic information to each
word-token based on contextual information.
– Several researchers have used POS tag frequencies or POS tag n-gram frequencies
to represent style
– POS tag information provides only a hint of the structural analysis of sentences
since it is not clear:
• How the words are combined to form phrases
• How the phrases are combined into higher level structures
Authorship Attribution – Reza Ramezani
4. Semantic Features
• Low-Level vs. High-Level
– Previous methods are at context level (low-level), not semantic level (high-level)
– NLP tools can be applied successfully to low-level tasks such as:
• Sentence splitting
• POS tagging
• Text chunking
• Partial parsing,
– More complicated tasks cannot yet be handled adequately by current NLP
technology for unrestricted text. such as:
• Full syntactic parsing
• Semantic analysis
• Pragmatic analysis
– As a result, very few attempts have been made to exploit high-level features for
stylometric purposes.
Authorship Attribution – Reza Ramezani
Semantic Features Tools
• Produce semantic dependency graphs
– Gamon, Michael. "Linguistic correlates of style: authorship classification with
deep linguistic analysis features." In Proceedings of the 20th international
conference on Computational Linguistics, p. 611. Association for Computational
Linguistics, 2004.
• Extracting semantic measures based on WordNet
– McCarthy, Philip M., Gwyneth A. Lewis, David F. Dufty, and Danielle S.
McNamara. "Analyzing writing styles with Coh-Metrix." In Proceedings of the
Florida Artificial Intelligence Research Society International Conference
(FLAIRS), pp. 764-769. 2006.
• Using the theory of Systemic Functional Grammar (SFG)
– Argamon, Shlomo, Casey Whitelaw, Paul Chase, Sobhan Raj Hota, Navendu
Garg, and Shlomo Levitan. "Stylistic text classification using functional lexical
features." Journal of the American Society for Information Science and
Technology 58, no. 6 (2007): 802-822.
Authorship Attribution – Reza Ramezani
5. Application-Specific Features
• Application-Specific Features
– One can define application-specific measures to better represent the nuances of
style in a given text domain.
– Defining structural measures to quantify the authorial style in special domain.
• Such as e-mail messages and online-forum messages
– Structural measures include:
• The use of greetings and farewells in the messages,
• Types of signatures,
• Use of indentation,
• Paragraph length,
• and so on …
– Other types of application-specific features can be defined only for certain natural
languages, such as Greek.
Authorship Attribution – Reza Ramezani
Attribution Methods
• Profile-Based Approaches
– Probabilistic models
– Compression models
– Common n-grams and variants
• Instance-Based Approaches
– Vector space models
– Similarity-based models
– Meta-learning models
• Hybrid Approaches
– Average Methods
Authorship Attribution – Reza Ramezani
Attribution Methods (Cont’d)
• Profile-Based Approaches
– Cumulatively (per author)
• Concatenating all the available training texts per author in one big file
(author’s profile)
• The stylometric measures extracted from the concatenated file may be quite
different in comparison to each of the original training texts.
• Instance-Based Approaches
– Individually (per author)
Authorship Attribution – Reza Ramezani
1. Profile-Based Approaches
Authorship Attribution – Reza Ramezani
1.1. Probabilistic Models
Authorship Attribution – Reza Ramezani
1.2. Compression Models
• Compression Models
– Such methods do not produce a concrete vector representation of the author’s
• Steps
– Initially a compression algorithm is called to produce a compressed file C(xa).
– Then, the unseen text x is added to each text xa, and the compression algorithm is
called again for each C(xa +x).
– The difference in bit-wise size of the compressed files d(x, xa) = C(xa +x) − C(xa)
indicates the similarity of the unseen text with each candidate author.
– These models are applied only to character sequences, not word sequences.
Authorship Attribution – Reza Ramezani
1.3. Common n-grams (CNG)
Authorship Attribution – Reza Ramezani
1.3. Common n-grams (CNG) (Cont’d)
• Parameters
– The CNG method has two important parameters that should be tuned:
– The profile size L
• How many strings constitute the profile.
– And the character n-gram length n;
• How long strings constitute the profile.
– Keselj et al. (2003) reported their best results for 1,000 ≤ L ≤ 5,000 and 3 ≤ n ≤ 5.
– The CNG distance function performs well when the training corpus is relatively
– But it fails in imbalanced cases where at least one author’s profile is shorter than L.
• CNG variant
– To solve the problem of class imbalanced.
Authorship Attribution – Reza Ramezani
2. Instance-Based Approaches
Authorship Attribution – Reza Ramezani
2.1. Vector Space Models
• Definition
– It could be considered each text as a vector in a multivariate space.
– Then, a variety of powerful statistical and machine learning algorithms can be
used to build a classification model, including:
• Discriminant Analysis
• Decision Trees
• Neural Networks
• Genetic Algorithms
• Memory-based Learners
• Classifier Ensemble Methods
• and so on.
– Some of these algorithms can effectively handle high-dimensional, noisy, and
sparse data, allowing more expressive representations of texts.
– The effectiveness of methods is diminished by the presence of the class-imbalance.
Authorship Attribution – Reza Ramezani
2.2. Similarity-based Models
• Idea
– Calculation of pairwise similarity measures between the unseen text and all the
training texts,
– And then estimating the most likely author based on a nearest-neighbor algorithm.
• Example
– Compression Model
• Compressing of each training text in separate files using an off-the-shelf algorithm
• C(x) is the bit-wise size of the compression of file x
• The difference C(x +y) − C(x) indicates the similarity of a training text x with the
unseen text y.
Authorship Attribution – Reza Ramezani
2.3. Meta-learning Models
• Definition
– More complex algorithms specifically designed for authorship attribution.
– The main goal is to use such meta-data to understand how automatic learning can
become flexible in solving different kinds of learning problems:
• Hence to improve the performance of existing learning algorithms.
– The most interesting approach of this kind is the unmasking method.
Authorship Attribution – Reza Ramezani
Unmasking Method
• Unmasking Method
– In the unmasking method, For each unseen text, an SVM classifier is built to
discriminate it from the training texts of each candidate author.
– Thus, for n candidate authors, n classifiers for each unseen text is built.
– Then, in an iterative procedure, a predefined amount of the most important
features for each classifier is removed and the drop in accuracy is measured.
– At the beginning, all the classifiers had more or less the same very high accuracy.
– After a few iterations, the accuracy of the classifier that discriminates between the
unseen text and the true author would be too low while the accuracy of the other
classifiers would remain relatively high.
– This happens because the differences between the unseen text and the other
authors are manifold, so by removing a few features, the accuracy is not affected
Authorship Attribution – Reza Ramezani
3. Hybrid Approaches
• Hybrid Approaches
– Methods that borrow some elements from both profile-based and instance-based
• Example
– All the training text samples are represented separately, as it happens with the
instance-based approaches.
– The representation vectors for the texts of each author are feature-wisely averaged
and produced a single profile vector for each author, as happens with the profile-
based approaches.
– The distance of the profile of an unseen text from the profile of each author is then
calculated by a weighted feature-wise function.
Authorship Attribution – Reza Ramezani

More Related Content

What's hot

Translation studies....
Translation studies....Translation studies....
Translation studies....AdnanBaloch15
Transformational Grammar
Transformational GrammarTransformational Grammar
Transformational GrammarCristina Tamayo
Introduction to Stylistics
Introduction to StylisticsIntroduction to Stylistics
Introduction to Stylisticsmj_llanto
Principles And Parameter Of Universal Grammar
Principles And Parameter Of Universal GrammarPrinciples And Parameter Of Universal Grammar
Principles And Parameter Of Universal GrammarDr. Cupid Lucid
Critical discourse analysis
Critical discourse analysisCritical discourse analysis
Critical discourse analysisFira Nursya`bani
Principles of parameters
Principles of parametersPrinciples of parameters
Principles of parametersVelnar
Definitions, Origins and approaches of Sociolinguistics
Definitions, Origins and approaches of Sociolinguistics Definitions, Origins and approaches of Sociolinguistics
Definitions, Origins and approaches of Sociolinguistics AleeenaFarooq
Transformational grammar
Transformational grammarTransformational grammar
Transformational grammarJack Feng
Discourse and the sentence
Discourse and the sentenceDiscourse and the sentence
Discourse and the sentenceStudent
Semantics and pragmatics
Semantics and pragmaticsSemantics and pragmatics
Semantics and pragmaticsKate Nahi
Discourse and conversation
Discourse and conversationDiscourse and conversation
Discourse and conversationbrightmoon90900
Globalization and translation
Globalization and translationGlobalization and translation
Globalization and translationPankaj Dwivedi
Noam chomsky and generative grammar
Noam chomsky and generative grammarNoam chomsky and generative grammar
Noam chomsky and generative grammarAsia Fareed
Forensic linguistics
Forensic linguistics Forensic linguistics
Forensic linguistics mimizin

What's hot (20)

Translation studies....
Translation studies....Translation studies....
Translation studies....
Transformational Grammar
Transformational GrammarTransformational Grammar
Transformational Grammar
Introduction to Stylistics
Introduction to StylisticsIntroduction to Stylistics
Introduction to Stylistics
Generative grammar
Generative grammarGenerative grammar
Generative grammar
Functional Linguistics
Functional LinguisticsFunctional Linguistics
Functional Linguistics
Principles And Parameter Of Universal Grammar
Principles And Parameter Of Universal GrammarPrinciples And Parameter Of Universal Grammar
Principles And Parameter Of Universal Grammar
Generative grammer
Generative grammerGenerative grammer
Generative grammer
Critical discourse analysis
Critical discourse analysisCritical discourse analysis
Critical discourse analysis
Language Variation.pptx
Language Variation.pptxLanguage Variation.pptx
Language Variation.pptx
Principles of parameters
Principles of parametersPrinciples of parameters
Principles of parameters
Definitions, Origins and approaches of Sociolinguistics
Definitions, Origins and approaches of Sociolinguistics Definitions, Origins and approaches of Sociolinguistics
Definitions, Origins and approaches of Sociolinguistics
Transformational grammar
Transformational grammarTransformational grammar
Transformational grammar
Discourse and the sentence
Discourse and the sentenceDiscourse and the sentence
Discourse and the sentence
Semantics and pragmatics
Semantics and pragmaticsSemantics and pragmatics
Semantics and pragmatics
Discourse and conversation
Discourse and conversationDiscourse and conversation
Discourse and conversation
Globalization and translation
Globalization and translationGlobalization and translation
Globalization and translation
Noam chomsky and generative grammar
Noam chomsky and generative grammarNoam chomsky and generative grammar
Noam chomsky and generative grammar
Forensic linguistics
Forensic linguistics Forensic linguistics
Forensic linguistics

Viewers also liked

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...PyData
Authorship analysis using function words forensic linguistics
Authorship analysis using function words forensic linguisticsAuthorship analysis using function words forensic linguistics
Authorship analysis using function words forensic linguisticsVlad Mackevic
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...osify
Co authorship and attribution
Co authorship and attributionCo authorship and attribution
Co authorship and attributionJenny Delasalle
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...
Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...Maarten van Wesel
My Graduation Project Documentation: Plagiarism Detection System for English ...
My Graduation Project Documentation: Plagiarism Detection System for English ...My Graduation Project Documentation: Plagiarism Detection System for English ...
My Graduation Project Documentation: Plagiarism Detection System for English ...Ahmed Mater
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaTraian Rebedea
Using Technology To Detect Plagiarism
Using Technology To Detect PlagiarismUsing Technology To Detect Plagiarism
Using Technology To Detect Plagiarismguestf17a2e
NLTK и Python для работы с текстами
NLTK и Python для работы с текстами  NLTK и Python для работы с текстами
NLTK и Python для работы с текстами
The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...yosra Yassora
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk Vijay Ganti
Plagiarism and its detection
Plagiarism and its detectionPlagiarism and its detection
Plagiarism and its detectionankit_saluja
Machine Learning for NLP
Machine Learning for NLPMachine Learning for NLP
Machine Learning for NLPbutest
plagiarism detection tools and techniques
plagiarism detection tools and techniquesplagiarism detection tools and techniques
plagiarism detection tools and techniquesNimisha T
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
Forensic linguistics
Forensic linguisticsForensic linguistics
Forensic linguisticsAbbou Zohra
Forensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical ApplicationsForensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical Applicationsdahveed123

Viewers also liked (19)

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship analysis using function words forensic linguistics
Authorship analysis using function words forensic linguisticsAuthorship analysis using function words forensic linguistics
Authorship analysis using function words forensic linguistics
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Co authorship and attribution
Co authorship and attributionCo authorship and attribution
Co authorship and attribution
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...
Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...Choosing The Right Tool For The Job; How  Maastricht  University Is Selecting...
Choosing The Right Tool For The Job; How Maastricht University Is Selecting...
My Graduation Project Documentation: Plagiarism Detection System for English ...
My Graduation Project Documentation: Plagiarism Detection System for English ...My Graduation Project Documentation: Plagiarism Detection System for English ...
My Graduation Project Documentation: Plagiarism Detection System for English ...
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corpora
Plag detection
Plag detectionPlag detection
Plag detection
Using Technology To Detect Plagiarism
Using Technology To Detect PlagiarismUsing Technology To Detect Plagiarism
Using Technology To Detect Plagiarism
NLTK и Python для работы с текстами
NLTK и Python для работы с текстами  NLTK и Python для работы с текстами
NLTK и Python для работы с текстами
The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
Plagiarism and its detection
Plagiarism and its detectionPlagiarism and its detection
Plagiarism and its detection
Machine Learning for NLP
Machine Learning for NLPMachine Learning for NLP
Machine Learning for NLP
plagiarism detection tools and techniques
plagiarism detection tools and techniquesplagiarism detection tools and techniques
plagiarism detection tools and techniques
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
Forensic linguistics
Forensic linguisticsForensic linguistics
Forensic linguistics
Forensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical ApplicationsForensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical Applications

Similar to Authorship attribution

Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4DigiGurukul
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckHotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckTao Xie
Natural Language Processing Course in AI
Natural Language Processing Course in AINatural Language Processing Course in AI
Natural Language Processing Course in AISATHYANARAYANAKB
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmUnit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmDhruvKushwaha12
Natural Language Processing basics presentation
Natural Language Processing basics presentationNatural Language Processing basics presentation
Natural Language Processing basics presentationPREETHIRRA2011003040
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Abdullah al Mamun
Sanskrit in Natural Language Processing
Sanskrit in Natural Language ProcessingSanskrit in Natural Language Processing
Sanskrit in Natural Language ProcessingHitesh Joshi
natural language processing help at
natural language processing  help at myassignmenthelp.netnatural language processing  help at
natural language processing help at
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyRimzim Thube
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
Supporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwareSupporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwarevsrtwin
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxSHIBDASDUTTA

Similar to Authorship attribution (20)

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William EnckHotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
Natural Language Processing Course in AI
Natural Language Processing Course in AINatural Language Processing Course in AI
Natural Language Processing Course in AI
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmUnit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Natural Language Processing basics presentation
Natural Language Processing basics presentationNatural Language Processing basics presentation
Natural Language Processing basics presentation
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Sanskrit in Natural Language Processing
Sanskrit in Natural Language ProcessingSanskrit in Natural Language Processing
Sanskrit in Natural Language Processing
natural language processing help at
natural language processing  help at myassignmenthelp.netnatural language processing  help at
natural language processing help at
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
Text summarization
Text summarization Text summarization
Text summarization
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A Survey
LiCord: Language Independent Content Word Finder
LiCord: Language Independent Content Word FinderLiCord: Language Independent Content Word Finder
LiCord: Language Independent Content Word Finder
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
Supporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwareSupporting the authoring process with linguistic software
Supporting the authoring process with linguistic software
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptx

More from Reza Ramezani

Real time operating systems for safety-critical applications
Real time operating systems for safety-critical applicationsReal time operating systems for safety-critical applications
Real time operating systems for safety-critical applicationsReza Ramezani
Fault tolerant real-time scheduling
Fault tolerant real-time schedulingFault tolerant real-time scheduling
Fault tolerant real-time schedulingReza Ramezani
An introduction to forensic linguistics
An introduction to forensic linguisticsAn introduction to forensic linguistics
An introduction to forensic linguisticsReza Ramezani
An improved to ak max sat (max-sat problem)
An improved to ak max sat (max-sat problem)An improved to ak max sat (max-sat problem)
An improved to ak max sat (max-sat problem)Reza Ramezani
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methodsReza Ramezani
Multi criteria decision support system on mobile phone selection with ahp and...
Multi criteria decision support system on mobile phone selection with ahp and...Multi criteria decision support system on mobile phone selection with ahp and...
Multi criteria decision support system on mobile phone selection with ahp and...Reza Ramezani
Deadlock detection in distributed systems
Deadlock detection in distributed systemsDeadlock detection in distributed systems
Deadlock detection in distributed systemsReza Ramezani
Fault injection techniques, design pattern for fault injector system
Fault injection techniques, design pattern for fault injector systemFault injection techniques, design pattern for fault injector system
Fault injection techniques, design pattern for fault injector systemReza Ramezani
Question answering in linked data
Question answering in linked dataQuestion answering in linked data
Question answering in linked dataReza Ramezani
Finding Association Rules in Linked Data
Finding Association Rules in Linked DataFinding Association Rules in Linked Data
Finding Association Rules in Linked DataReza Ramezani

More from Reza Ramezani (10)

Real time operating systems for safety-critical applications
Real time operating systems for safety-critical applicationsReal time operating systems for safety-critical applications
Real time operating systems for safety-critical applications
Fault tolerant real-time scheduling
Fault tolerant real-time schedulingFault tolerant real-time scheduling
Fault tolerant real-time scheduling
An introduction to forensic linguistics
An introduction to forensic linguisticsAn introduction to forensic linguistics
An introduction to forensic linguistics
An improved to ak max sat (max-sat problem)
An improved to ak max sat (max-sat problem)An improved to ak max sat (max-sat problem)
An improved to ak max sat (max-sat problem)
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methods
Multi criteria decision support system on mobile phone selection with ahp and...
Multi criteria decision support system on mobile phone selection with ahp and...Multi criteria decision support system on mobile phone selection with ahp and...
Multi criteria decision support system on mobile phone selection with ahp and...
Deadlock detection in distributed systems
Deadlock detection in distributed systemsDeadlock detection in distributed systems
Deadlock detection in distributed systems
Fault injection techniques, design pattern for fault injector system
Fault injection techniques, design pattern for fault injector systemFault injection techniques, design pattern for fault injector system
Fault injection techniques, design pattern for fault injector system
Question answering in linked data
Question answering in linked dataQuestion answering in linked data
Question answering in linked data
Finding Association Rules in Linked Data
Finding Association Rules in Linked DataFinding Association Rules in Linked Data
Finding Association Rules in Linked Data

Recently uploaded

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Recently uploaded (20)

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf

Authorship attribution

  • 1.
  • 3. Authorship Attribution– Reza Ramezani Danielle Jones • Last seen 18th June 2001. • After her disappearance a series of text messages were sent from her phone. • Linguistic analysis showed that the later messages were sent by her Uncle, Stuart Campbell. • Campbell was convicted of Danielle’s murder 19th December 2002 in part because of the linguistic evidence. Jenny Nicholl • Last seen 30th June 2005. • After her disappearance a series of text messages were sent from her phone. • Linguistic analysis showed that the later messages were sent by her classmate, David Hodgson. Hodgson was convicted of Jenny’s murder 19th February 2008 in part because of the linguistic evidence. Importance 26
  • 4. Authorship Attribution – Reza Ramezani Authorship Attribution • Definition – In the typical authorship attribution problem, a text of unknown authorship is assigned to one candidate author, given a set of candidate authors for whom text samples of undisputed authorship are available. – From a machine learning point of view, this can be viewed as a multiclass, single- label text-categorization task. – This task also is called authorship (or author) identification • Idea – The main idea behind authorship attribution is by measuring some textual features. – Authorship attribution is supported by statistical or computational methods. – This scientific field takes advantage of research advances in areas such as machine learning, information retrieval, and natural language processing. 4
  • 5. Authorship Attribution– Reza Ramezani Supervised Learning 5
  • 6. Authorship Attribution – Reza Ramezani Stylometry • Representation – Research in authorship attribution are done by attempts to define features for quantifying writing style, a line of research known as “stylometry” • Sentence length, word length, word frequencies, character frequencies, and vocabulary richness functions – 1,000 different measures was estimated by Rudman (1998) 6
  • 7. Authorship Attribution – Reza Ramezani Stylometric Features • Lexical and Character Features – Consider a text as a mere sequence of word-tokens or characters, respectively. • Syntactic and Semantic Features – Require deeper linguistic analysis • Application-Specific Features – Can be defined only in certain text domains or languages. 7
  • 8. Authorship Attribution – Reza Ramezani 1. Lexical Features • A simple and natural way to view a text is as a: – Sequence of tokens grouped into sentences, with each token corresponding to a word, number, or punctuation mark. • Method – The very first attempts to attribute authorship were based on simple measures such as sentence length counts and word length counts. • Advantage – They can be applied to any language and any corpus with no additional requirements except the availability of a tokenizer. • The most straightforward approach to represent texts is by vectors of word frequencies (vast majority of authorship attribution studies). 8
  • 9. Authorship Attribution – Reza Ramezani Classification & Authorship Attribution • Difference in style-based and topic-based text classification – The most common words (articles, prepositions, pronouns, etc.) are found to be among the best features to discriminate between authors. – Such words are usually excluded from the feature set of the topic-based text- classification methods since they do not carry any semantic information, and they are usually called “function words”. • Various sets of function words have been used for English, but limited information was provided about the way that they were selected: 150, 303, 365, 480 and 675 9
  • 10. Authorship Attribution – Reza Ramezani Methods • A Simple Method – Extract the most frequent words found in the available corpus (comprising all the texts of the candidate authors). – Then, a decision has to be made about the amount of the frequent words that will be used as features. • In the earlier studies, sets of at most 100 frequent words were considered adequate to represent the style of an author. – Another factor that affects the feature-set size is the classification algorithm that will be used since many algorithms over-fit the training data when the dimensionality of the problem increases. – Some machine learning algorithm (Such as SVM) can deal with thousands of features. 10
  • 11. Authorship Attribution – Reza Ramezani Methods (Cont’d) • Required Routines – Tokenizer (Word Extraction) – Conversion to lowercase – Stemmers – Lemmatizers – Detectors of common homographic forms • Disadvantages – The bag-of-words approach provides a simple and efficient solution, but disregards word-order (i.e., contextual) information. • One Possible Solution – n-grams 11
  • 12. Authorship Attribution – Reza Ramezani Methods (Cont’d) • n-grams – n contiguous words also known as word collocations • Features – The dimensionality of the problem following this approach increases considerably with n to account for all the possible combinations between words. – The representation produced by this approach is very sparse since most of the word combinations are not encountered in a given (especially short) text, making it very difficult to be handled effectively by a classification algorithm. – The classification accuracy achieved by word n-grams is not always better than individual word features. 12
  • 13. Authorship Attribution – Reza Ramezani Methods (Cont’d) • Writing Error Measures – Spelling errors • Letter omissions • Insertions – Formatting errors • “all caps” words – This method needs an accurate spell checker. 13
  • 14. Authorship Attribution – Reza Ramezani Methods • Vocabulary Richness – The vocabulary richness functions are attempts to quantify the diversity of the vocabulary of a text. – Typical examples are the type-token ratio V/N, where V is the size of the vocabulary (unique tokens) and N is the total number of tokens of the text. • Unreliable Measures – Vocabulary size (V) depends heavily on text length (as the text length increases, the vocabulary also increases, quickly at the beginning and then more and more slowly). – Various functions have been proposed to achieve stability over text length, including K (Yule, 1944), and R (Honore, 1979), with questionable results. 14
  • 15. Authorship Attribution – Reza Ramezani 2. Character Features • A text is viewed as a mere sequence of characters. • Character-level Measures – Alphabetic characters count – Digit characters count – Uppercase and lowercase characters count – Punctuation marks count – And so on … • Feature – This type of information is easily available for any natural language and corpus – It has been proven to be quite useful to quantify the writing style. 15
  • 16. Authorship Attribution – Reza Ramezani Methods • Character n-gram – Extract frequencies of n-grams on the character level. • Features – An advantage of this representation is its ability to be tolerant to noise. – In cases of lexicon errors the character n-gram representation is not affected dramatically. • The words “simplistic” and “simpilstc” would produce many common character 3-grams. – For oriental languages where the tokenization procedure is quite hard, character n-grams offers a suitable solution – The procedure of extracting the most frequent n-grams is language-independent and requires no special tools. • Compression-based Approaches – Will be discussed later … 16
  • 17. Authorship Attribution – Reza Ramezani 3. Syntactic Features • Employing syntactic information • Idea – The idea is that authors tend to unconsciously use similar syntactic patterns. – Therefore, syntactic information is considered more a reliable authorial fingerprint in comparison to lexical information. – This type of information requires robust and accurate NLP tools able to perform syntactic analysis of texts. • The syntactic measure extraction is a language dependent procedure • Such features will produce noisy datasets due to unavoidable errors made by the parser. 17
  • 18. Authorship Attribution – Reza Ramezani Methods • Rewrite Rule – Extracting Rewrite Rule frequencies, using a produced full parse tree of each sentence. – Using Rewrite Rules to analysis parts of syntactic. – Consider the following rewrite rule: A : PP → P : PREP + PC : NP – It means that an adverbial prepositional phrase is constituted by a preposition followed by a noun phrase as a prepositional complement. – These information describe “how the words are combined to form phrases or other structures”. – Experimental results have shown that this type of measure performs better than do Lexical and Characters features. • It needs accurate fully automated parser, able to provide a detailed syntactic analysis of sentences. 18
  • 19. Authorship Attribution – Reza Ramezani Methods (Cont’d) • Paragraph Analyze – Another attempt to exploit syntactic information was proposed by Stamatatos – This sentence would be analyzed as following: – NP[Another attempt] VP[to exploit] NP[syntactic information] VP[was proposed] PP[by Stamatatos] – Where NP, VP, and PP stand for noun phrase, verb phrase, and prepositional phrase, respectively. – This type of information is simpler than Rewrite Rules. – It could be extracted automatically with relatively high accuracy. – The extracted measures referred to noun phrase counts, verb phrase counts, length of noun phrases, length of verb phrases, and so on… 19
  • 20. Authorship Attribution – Reza Ramezani Methods (Cont’d) • Part-of-Speech (POS) – (POS) tagger, a tool that assigns a tag of morpho-syntactic information to each word-token based on contextual information. – Several researchers have used POS tag frequencies or POS tag n-gram frequencies to represent style – POS tag information provides only a hint of the structural analysis of sentences since it is not clear: • How the words are combined to form phrases • How the phrases are combined into higher level structures 20
  • 21. Authorship Attribution – Reza Ramezani 4. Semantic Features • Low-Level vs. High-Level – Previous methods are at context level (low-level), not semantic level (high-level) – NLP tools can be applied successfully to low-level tasks such as: • Sentence splitting • POS tagging • Text chunking • Partial parsing, – More complicated tasks cannot yet be handled adequately by current NLP technology for unrestricted text. such as: • Full syntactic parsing • Semantic analysis • Pragmatic analysis – As a result, very few attempts have been made to exploit high-level features for stylometric purposes. 21
  • 22. Authorship Attribution – Reza Ramezani Semantic Features Tools • Produce semantic dependency graphs – Gamon, Michael. "Linguistic correlates of style: authorship classification with deep linguistic analysis features." In Proceedings of the 20th international conference on Computational Linguistics, p. 611. Association for Computational Linguistics, 2004. • Extracting semantic measures based on WordNet – McCarthy, Philip M., Gwyneth A. Lewis, David F. Dufty, and Danielle S. McNamara. "Analyzing writing styles with Coh-Metrix." In Proceedings of the Florida Artificial Intelligence Research Society International Conference (FLAIRS), pp. 764-769. 2006. • Using the theory of Systemic Functional Grammar (SFG) – Argamon, Shlomo, Casey Whitelaw, Paul Chase, Sobhan Raj Hota, Navendu Garg, and Shlomo Levitan. "Stylistic text classification using functional lexical features." Journal of the American Society for Information Science and Technology 58, no. 6 (2007): 802-822. 22
  • 23. Authorship Attribution – Reza Ramezani 5. Application-Specific Features • Application-Specific Features – One can define application-specific measures to better represent the nuances of style in a given text domain. – Defining structural measures to quantify the authorial style in special domain. • Such as e-mail messages and online-forum messages – Structural measures include: • The use of greetings and farewells in the messages, • Types of signatures, • Use of indentation, • Paragraph length, • and so on … – Other types of application-specific features can be defined only for certain natural languages, such as Greek. 23
  • 24. Authorship Attribution – Reza Ramezani Attribution Methods • Profile-Based Approaches – Probabilistic models – Compression models – Common n-grams and variants • Instance-Based Approaches – Vector space models – Similarity-based models – Meta-learning models • Hybrid Approaches – Average Methods 24
  • 25. Authorship Attribution – Reza Ramezani Attribution Methods (Cont’d) • Profile-Based Approaches – Cumulatively (per author) • Concatenating all the available training texts per author in one big file (author’s profile) • The stylometric measures extracted from the concatenated file may be quite different in comparison to each of the original training texts. • Instance-Based Approaches – Individually (per author) 25
  • 26. Authorship Attribution – Reza Ramezani 1. Profile-Based Approaches 26
  • 27. Authorship Attribution – Reza Ramezani 1.1. Probabilistic Models 27
  • 28. Authorship Attribution – Reza Ramezani 1.2. Compression Models • Compression Models – Such methods do not produce a concrete vector representation of the author’s profile. • Steps – Initially a compression algorithm is called to produce a compressed file C(xa). – Then, the unseen text x is added to each text xa, and the compression algorithm is called again for each C(xa +x). – The difference in bit-wise size of the compressed files d(x, xa) = C(xa +x) − C(xa) indicates the similarity of the unseen text with each candidate author. – These models are applied only to character sequences, not word sequences. 28
  • 29. Authorship Attribution – Reza Ramezani 1.3. Common n-grams (CNG) 29
  • 30. Authorship Attribution – Reza Ramezani 1.3. Common n-grams (CNG) (Cont’d) • Parameters – The CNG method has two important parameters that should be tuned: – The profile size L • How many strings constitute the profile. – And the character n-gram length n; • How long strings constitute the profile. – Keselj et al. (2003) reported their best results for 1,000 ≤ L ≤ 5,000 and 3 ≤ n ≤ 5. – The CNG distance function performs well when the training corpus is relatively balanced. – But it fails in imbalanced cases where at least one author’s profile is shorter than L. • CNG variant – To solve the problem of class imbalanced. 30
  • 31. Authorship Attribution – Reza Ramezani 2. Instance-Based Approaches 31
  • 32. Authorship Attribution – Reza Ramezani 2.1. Vector Space Models • Definition – It could be considered each text as a vector in a multivariate space. – Then, a variety of powerful statistical and machine learning algorithms can be used to build a classification model, including: • Discriminant Analysis • SVM • Decision Trees • Neural Networks • Genetic Algorithms • Memory-based Learners • Classifier Ensemble Methods • and so on. – Some of these algorithms can effectively handle high-dimensional, noisy, and sparse data, allowing more expressive representations of texts. – The effectiveness of methods is diminished by the presence of the class-imbalance. 32
  • 33. Authorship Attribution – Reza Ramezani 2.2. Similarity-based Models • Idea – Calculation of pairwise similarity measures between the unseen text and all the training texts, – And then estimating the most likely author based on a nearest-neighbor algorithm. • Example – Compression Model • Compressing of each training text in separate files using an off-the-shelf algorithm • C(x) is the bit-wise size of the compression of file x • The difference C(x +y) − C(x) indicates the similarity of a training text x with the unseen text y. 33
  • 34. Authorship Attribution – Reza Ramezani 2.3. Meta-learning Models • Definition – More complex algorithms specifically designed for authorship attribution. – The main goal is to use such meta-data to understand how automatic learning can become flexible in solving different kinds of learning problems: • Hence to improve the performance of existing learning algorithms. – The most interesting approach of this kind is the unmasking method. 34
  • 35. Authorship Attribution – Reza Ramezani Unmasking Method • Unmasking Method – In the unmasking method, For each unseen text, an SVM classifier is built to discriminate it from the training texts of each candidate author. – Thus, for n candidate authors, n classifiers for each unseen text is built. – Then, in an iterative procedure, a predefined amount of the most important features for each classifier is removed and the drop in accuracy is measured. – At the beginning, all the classifiers had more or less the same very high accuracy. – After a few iterations, the accuracy of the classifier that discriminates between the unseen text and the true author would be too low while the accuracy of the other classifiers would remain relatively high. – This happens because the differences between the unseen text and the other authors are manifold, so by removing a few features, the accuracy is not affected dramatically. 35
  • 36. Authorship Attribution – Reza Ramezani 3. Hybrid Approaches • Hybrid Approaches – Methods that borrow some elements from both profile-based and instance-based approaches. • Example – All the training text samples are represented separately, as it happens with the instance-based approaches. – The representation vectors for the texts of each author are feature-wisely averaged and produced a single profile vector for each author, as happens with the profile- based approaches. – The distance of the profile of an unseen text from the profile of each author is then calculated by a weighted feature-wise function. 36
  • 37. Authorship Attribution – Reza Ramezani 37

Editor's Notes

  1. Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item.Inflection=نواخت، نوا
  2. oriental =مشرق زمین مانند چین1. A subtle or slight degree of difference, as in meaning, feeling, or tone; a gradation.2. Expression or appreciation of subtle shades of meaning, feeling, or tone: a rich artistic performance, full of nuance.
  3. Adverbial=عبارت قیدیprepositional=حرف اضافه
  4. Pragmatic=واقع گرایانه
  5. Nuances=ریزه کاری، آهنگ، فحواfarewells =خداحافظی indentation=تورفتگی
  6. با حذف ویژگی های مهم، دقت Classifierی که کلاس درست را نشان می دهد از همه بدتر می شود، زیرا صفات مهم در آن از همه موثرتر است.