SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
Authorship Attribution & Forensic
Linguistics with Python/Scikit-Learn/Pandas

Kostas Perifanos, Search & Analytics Engineer
@perifanoskostas
Learner Analytics & Data Science Team
Definition
“Automated authorship attribution is the problem
of identifying the author of an anonymous text, or
text whose authorship is in doubt” [Love, 2002]
Domains of application
● Author attribution
● Author verification
● Plagiarism detection
● Author profiling [age, education, gender]
● Stylistic inconsistencies [multiple collaborators/authors]
● Can be also applied in computer code, music scores, ...
“Automated authorship attribution is the problem
of identifying the author of an anonymous text, or
text whose authorship is in doubt”
“Automation”, “identification”, “text”: Machine Learning
A classification problem
●
●
●
●

Define classes
Extract features
Train ML classifier
Evaluate
Class definition[s]
● AuthorA, AuthorB, AuthorC, …
● Author vs rest-of-the-world [1-class classification
problem]
● Or even, in extended contexts, a clustering problem
Feature extraction
●
●
●
●

Lexical features
Character features
Syntactic features
Application specific
Feature extraction
● Lexical features
●
●
●
●
●

Word length, sentence length etc
Vocabulary richness [lexical density: functional word vs content words ratio]
Word frequencies
Word n-grams
Spelling errors
Feature extraction
● Character features
●
●
●

Character types (letters, digits, punctuation)
Character n-grams (fixed and variable length)
Compression methods [Entropy, which is really nice but for another talk :) ]
Feature extraction
● Syntactic features
●
●
●

Part-of-speech tags [eg Verbs (VB), Nouns (NN), Prepositions (PP) etc]
Sentence and phrase structure
Errors
Feature extraction
● Semantic features
●
●

Synonyms
Semantic dependencies

● Application specific features
●
●
●

Structural
Content specific
Language specific
Demo application
Let’s apply a classification algorithm on texts, using word
and character n-grams and POS n-grams
Data set (1): 12867 tweets from 10 users, in Greek
Language, collected in 2012 [4]
Data set (2): 1157 judgments from 2 judges, in English [5]
But what’s an “n-gram”?
[…]an n-gram is a contiguous sequence of n items from a given sequence of
text. [http://en.wikipedia.org/wiki/N-gram]
So, for the sentence above:
word 2-grams (or bigrams): [ (an, n-gram), (n-gram, is), (is, a), (a,
contiguous), …]
char 2-grams: [ ‘an’, ‘n ‘, ‘ n’, ‘n-’, ‘-g’, …]
We will use the TF-IDF weighted frequencies of both word and character ngrams as features.
Enter Python
Flashback [or, transforming experiments to accepted papers in t<=2h]
A few months earlier, Dec 13, just one day before my holidays I get this call...
Load the dataset
# assume we have the data in 10 tsv files, one file per author.
# each file consists of two columns, id and actual text
import pandas as pd
def load_corpus(input_dir):
trainfiles= [

f for f in listdir( input_dir ) if isfile(join(input_dir ,f)) ]

trainset = []
for filename in trainfiles:
df =

pd.read_csv( input_dir + "/" + filename

, sep="t",

dtype={ 'id':object, 'text':object } )
for row in df['text']:
trainset.append(
return trainset

{ "label":filename, "text": row }

)
Extract features [1]
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import FeatureUnion
word_vector = TfidfVectorizer( analyzer="word" , ngram_range=(2,2),
max_features = 2000, binary = False

)

char_vector = TfidfVectorizer(ngram_range=(2, 3), analyzer="char",
max_features = 2000,binary=False, min_df=0 )
for item in trainset:
corpus.append( item[“text”]

)

classes.append( item["label"] )
#our vectors are the feature union of word/char ngrams
vectorizer = FeatureUnion([

("chars", char_vector),("words", word_vector)

# load corpus, use fit_transform to get vectors
X = vectorizer.fit_transform(corpus)

] )
Extract features [2]
import nltk
#generate POS tags using nltk, return the sequence as whitespace separated string
def pos_tags(txt):
tokens = nltk.word_tokenize(txt)
return " ".join( [ tag for (word, tag) in nltk.pos_tag( tokens

) ] )

#combine word and char ngrams with POS-ngrams
tag_vector = TfidfVectorizer( analyzer="word" , ngram_range=(2,2),
binary = False, max_features= 2000, decode_error = 'ignore' )
X1 = vectorizer.fit_transform( corpus )
X2 = tag_vector.fit_transform( tags )
#concatenate the two matrices
X =

sp.hstack((X1, X2), format='csr')
Extract features [2.1]
#this last part is a little bit tricky
X =

sp.hstack((X1, X2), format='csr')

There was no (obvious) way to use FeatureUnion
X1, X2 are sparse matrices - so, we are using hstack to stack two matrices horizontally
(column wise)
http://docs.scipy.org/doc/numpy/reference/generated/numpy.hstack.html
Put everything together

feature vector components

Author: A function of

word ngrams

character ngrams

POS tags ngrams
(optional)
Fit the model and evaluate (10-fold-CV)
model = LinearSVC( loss='l1', dual=True)
scores = cross_validation.cross_val_score(

estimator = model,

X = matrix.toarray(),
y= np.asarray(classes), cv=10

)

print "10-fold cross validation results:", "mean score = ", scores.mean(), 
"std=", scores.std(), ", num folds =", len(scores)

Results: 96% accuracy for two authors, using 10-foldCV
Evaluate (train set vs test set)
from sklearn.cross_validation import train_test_split
model = LinearSVC( loss='l1', dual=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
y_pred = model.fit(X_train, y_train).predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
pl.matshow(cm)
pl.title('Confusion matrix')
pl.colorbar()
pl.ylabel('True label')
pl.xlabel('Predicted label')
pl.show()
Confusion Matrix

[[ 57
[ 0
[ 3
[ 0
[ 5
[ 9
[ 3
[ 8
[ 8
[ 2

1
71
0
1
4
11
1
12
4
6

2
0
4
8
3 27 13
1
1
0 13
0
8
6
51
1
3
5
4
8 25
0 207
2
8
8
8 82
3
7 106 30 10 25 23
3 15 11 350 14 46 42
3
8 13 16 244 21 38
10
3 11 46 13 414 39
7 59 11 21 31 49 579
1
4
3 24 13 29 15

2]
0]
0]
2]
3]
12]
5]
8]
10]
61]]
Interesting questions
●
●
●
●
●
●
●
●

Many authors?
Short texts / “micro messages"?
Is writing style affected by time/age?
Can we detect “mood”?
Psychological profiles?
What about obfuscation?
Even more subtle problems [PAN Workshop 2013]
Other applications (code, music scores etc)
References & Libraries
1.
2.
3.
4.
5.

Authorship Attribution: An Introduction, Harold Love, 2002
A Survey of Modern Authorship Attribution Methods,Efstathios
Stamatatos, 2007
Authorship Attribution, Patrick Juola, 2008
Authorship Attribution in Greek Tweets Using Author's Multilevel
N-Gram Profiles, G. Mikros, Kostas Perifanos. 2012
Authorship Attribution with Latent Dirichlet Allocation,
Seroussi,Zukerman, Bohnert, 2011

Python libraries:
●
●
●

Pandas: http://pandas.pydata.org/
Scikit-learn: http://scikit-learn.org/stable/
nltk, http://www.nltk.org/

Data:
www.csse.monash.edu.au/research/umnl/data
Demo Python code:
https://gist.github.com/kperi/f0730ff3028f7be86b15
Questions?
Thank you!

Mais conteúdo relacionado

Mais procurados

Whiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in PythonWhiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in PythonAndrew Ferlitsch
 
3.1 javascript objects_DOM
3.1 javascript objects_DOM3.1 javascript objects_DOM
3.1 javascript objects_DOMJalpesh Vasa
 
Prototypes in Pharo
Prototypes in PharoPrototypes in Pharo
Prototypes in PharoESUG
 
Java Foundations: Objects and Classes
Java Foundations: Objects and ClassesJava Foundations: Objects and Classes
Java Foundations: Objects and ClassesSvetlin Nakov
 
Basics of Python programming (part 2)
Basics of Python programming (part 2)Basics of Python programming (part 2)
Basics of Python programming (part 2)Pedro Rodrigues
 
String classes and its methods.20
String classes and its methods.20String classes and its methods.20
String classes and its methods.20myrajendra
 
PYTHON-Chapter 3-Classes and Object-oriented Programming: MAULIK BORSANIYA
PYTHON-Chapter 3-Classes and Object-oriented Programming: MAULIK BORSANIYAPYTHON-Chapter 3-Classes and Object-oriented Programming: MAULIK BORSANIYA
PYTHON-Chapter 3-Classes and Object-oriented Programming: MAULIK BORSANIYAMaulik Borsaniya
 
Identifiers, keywords and types
Identifiers, keywords and typesIdentifiers, keywords and types
Identifiers, keywords and typesDaman Toor
 
String and string buffer
String and string bufferString and string buffer
String and string bufferkamal kotecha
 
Strings In OOP(Object oriented programming)
Strings In OOP(Object oriented programming)Strings In OOP(Object oriented programming)
Strings In OOP(Object oriented programming)Danial Virk
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsDebasish Ghosh
 
Object Orientation vs Functional Programming in Python
Object Orientation vs Functional Programming in PythonObject Orientation vs Functional Programming in Python
Object Orientation vs Functional Programming in PythonTendayi Mawushe
 

Mais procurados (20)

Oop concepts in python
Oop concepts in pythonOop concepts in python
Oop concepts in python
 
Class method
Class methodClass method
Class method
 
Introduction to c ++ part -2
Introduction to c ++   part -2Introduction to c ++   part -2
Introduction to c ++ part -2
 
Whiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in PythonWhiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in Python
 
3.1 javascript objects_DOM
3.1 javascript objects_DOM3.1 javascript objects_DOM
3.1 javascript objects_DOM
 
Prototypes in Pharo
Prototypes in PharoPrototypes in Pharo
Prototypes in Pharo
 
Java Foundations: Objects and Classes
Java Foundations: Objects and ClassesJava Foundations: Objects and Classes
Java Foundations: Objects and Classes
 
Basics of Python programming (part 2)
Basics of Python programming (part 2)Basics of Python programming (part 2)
Basics of Python programming (part 2)
 
String classes and its methods.20
String classes and its methods.20String classes and its methods.20
String classes and its methods.20
 
PYTHON-Chapter 3-Classes and Object-oriented Programming: MAULIK BORSANIYA
PYTHON-Chapter 3-Classes and Object-oriented Programming: MAULIK BORSANIYAPYTHON-Chapter 3-Classes and Object-oriented Programming: MAULIK BORSANIYA
PYTHON-Chapter 3-Classes and Object-oriented Programming: MAULIK BORSANIYA
 
Python programming : Classes objects
Python programming : Classes objectsPython programming : Classes objects
Python programming : Classes objects
 
Identifiers, keywords and types
Identifiers, keywords and typesIdentifiers, keywords and types
Identifiers, keywords and types
 
String and string buffer
String and string bufferString and string buffer
String and string buffer
 
Linq Introduction
Linq IntroductionLinq Introduction
Linq Introduction
 
Strings In OOP(Object oriented programming)
Strings In OOP(Object oriented programming)Strings In OOP(Object oriented programming)
Strings In OOP(Object oriented programming)
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
 
.NET F# Events
.NET F# Events.NET F# Events
.NET F# Events
 
Object Orientation vs Functional Programming in Python
Object Orientation vs Functional Programming in PythonObject Orientation vs Functional Programming in Python
Object Orientation vs Functional Programming in Python
 
STRINGS IN JAVA
STRINGS IN JAVASTRINGS IN JAVA
STRINGS IN JAVA
 
Sql server lab_2
Sql server lab_2Sql server lab_2
Sql server lab_2
 

Semelhante a Authorship attribution pydata london

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...PyData
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachReza Rahimi
 
Python Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard WayPython Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard WayUtkarsh Sengar
 
More expressive types for spark with frameless
More expressive types for spark with framelessMore expressive types for spark with frameless
More expressive types for spark with framelessMiguel Pérez Pasalodos
 
The Scala Programming Language
The Scala Programming LanguageThe Scala Programming Language
The Scala Programming Languageleague
 
Session 07 text data.pptx
Session 07 text data.pptxSession 07 text data.pptx
Session 07 text data.pptxSara-Jayne Terp
 
Session 07 text data.pptx
Session 07 text data.pptxSession 07 text data.pptx
Session 07 text data.pptxbodaceacat
 
Session 07 text data.pptx
Session 07 text data.pptxSession 07 text data.pptx
Session 07 text data.pptxSara-Jayne Terp
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learningtelss09
 
An Overview Of Python With Functional Programming
An Overview Of Python With Functional ProgrammingAn Overview Of Python With Functional Programming
An Overview Of Python With Functional ProgrammingAdam Getchell
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
 
CascadiaJS 2015 - Adding intelligence to your JS applications
CascadiaJS 2015 - Adding intelligence to your JS applicationsCascadiaJS 2015 - Adding intelligence to your JS applications
CascadiaJS 2015 - Adding intelligence to your JS applicationsKevin Dela Rosa
 
Data analysis in R
Data analysis in RData analysis in R
Data analysis in RAndrew Lowe
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
 
Scala Parser Combinators - Scalapeno Lightning Talk
Scala Parser Combinators - Scalapeno Lightning TalkScala Parser Combinators - Scalapeno Lightning Talk
Scala Parser Combinators - Scalapeno Lightning TalkLior Schejter
 

Semelhante a Authorship attribution pydata london (20)

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning Approach
 
Python Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard WayPython Workshop - Learn Python the Hard Way
Python Workshop - Learn Python the Hard Way
 
Aggregate.pptx
Aggregate.pptxAggregate.pptx
Aggregate.pptx
 
More expressive types for spark with frameless
More expressive types for spark with framelessMore expressive types for spark with frameless
More expressive types for spark with frameless
 
The Scala Programming Language
The Scala Programming LanguageThe Scala Programming Language
The Scala Programming Language
 
Session 07 text data.pptx
Session 07 text data.pptxSession 07 text data.pptx
Session 07 text data.pptx
 
Session 07 text data.pptx
Session 07 text data.pptxSession 07 text data.pptx
Session 07 text data.pptx
 
Session 07 text data.pptx
Session 07 text data.pptxSession 07 text data.pptx
Session 07 text data.pptx
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
 
Python lecture 05
Python lecture 05Python lecture 05
Python lecture 05
 
An Overview Of Python With Functional Programming
An Overview Of Python With Functional ProgrammingAn Overview Of Python With Functional Programming
An Overview Of Python With Functional Programming
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
R교육1
R교육1R교육1
R교육1
 
CascadiaJS 2015 - Adding intelligence to your JS applications
CascadiaJS 2015 - Adding intelligence to your JS applicationsCascadiaJS 2015 - Adding intelligence to your JS applications
CascadiaJS 2015 - Adding intelligence to your JS applications
 
Data analysis in R
Data analysis in RData analysis in R
Data analysis in R
 
R basics
R basicsR basics
R basics
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
 
Scala Parser Combinators - Scalapeno Lightning Talk
Scala Parser Combinators - Scalapeno Lightning TalkScala Parser Combinators - Scalapeno Lightning Talk
Scala Parser Combinators - Scalapeno Lightning Talk
 

Authorship attribution pydata london

  • 1. Authorship Attribution & Forensic Linguistics with Python/Scikit-Learn/Pandas Kostas Perifanos, Search & Analytics Engineer @perifanoskostas Learner Analytics & Data Science Team
  • 2. Definition “Automated authorship attribution is the problem of identifying the author of an anonymous text, or text whose authorship is in doubt” [Love, 2002]
  • 3. Domains of application ● Author attribution ● Author verification ● Plagiarism detection ● Author profiling [age, education, gender] ● Stylistic inconsistencies [multiple collaborators/authors] ● Can be also applied in computer code, music scores, ...
  • 4. “Automated authorship attribution is the problem of identifying the author of an anonymous text, or text whose authorship is in doubt” “Automation”, “identification”, “text”: Machine Learning
  • 5. A classification problem ● ● ● ● Define classes Extract features Train ML classifier Evaluate
  • 6. Class definition[s] ● AuthorA, AuthorB, AuthorC, … ● Author vs rest-of-the-world [1-class classification problem] ● Or even, in extended contexts, a clustering problem
  • 7. Feature extraction ● ● ● ● Lexical features Character features Syntactic features Application specific
  • 8. Feature extraction ● Lexical features ● ● ● ● ● Word length, sentence length etc Vocabulary richness [lexical density: functional word vs content words ratio] Word frequencies Word n-grams Spelling errors
  • 9. Feature extraction ● Character features ● ● ● Character types (letters, digits, punctuation) Character n-grams (fixed and variable length) Compression methods [Entropy, which is really nice but for another talk :) ]
  • 10. Feature extraction ● Syntactic features ● ● ● Part-of-speech tags [eg Verbs (VB), Nouns (NN), Prepositions (PP) etc] Sentence and phrase structure Errors
  • 11. Feature extraction ● Semantic features ● ● Synonyms Semantic dependencies ● Application specific features ● ● ● Structural Content specific Language specific
  • 12. Demo application Let’s apply a classification algorithm on texts, using word and character n-grams and POS n-grams Data set (1): 12867 tweets from 10 users, in Greek Language, collected in 2012 [4] Data set (2): 1157 judgments from 2 judges, in English [5]
  • 13. But what’s an “n-gram”? […]an n-gram is a contiguous sequence of n items from a given sequence of text. [http://en.wikipedia.org/wiki/N-gram] So, for the sentence above: word 2-grams (or bigrams): [ (an, n-gram), (n-gram, is), (is, a), (a, contiguous), …] char 2-grams: [ ‘an’, ‘n ‘, ‘ n’, ‘n-’, ‘-g’, …] We will use the TF-IDF weighted frequencies of both word and character ngrams as features.
  • 14. Enter Python Flashback [or, transforming experiments to accepted papers in t<=2h] A few months earlier, Dec 13, just one day before my holidays I get this call...
  • 15. Load the dataset # assume we have the data in 10 tsv files, one file per author. # each file consists of two columns, id and actual text import pandas as pd def load_corpus(input_dir): trainfiles= [ f for f in listdir( input_dir ) if isfile(join(input_dir ,f)) ] trainset = [] for filename in trainfiles: df = pd.read_csv( input_dir + "/" + filename , sep="t", dtype={ 'id':object, 'text':object } ) for row in df['text']: trainset.append( return trainset { "label":filename, "text": row } )
  • 16. Extract features [1] from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import FeatureUnion word_vector = TfidfVectorizer( analyzer="word" , ngram_range=(2,2), max_features = 2000, binary = False ) char_vector = TfidfVectorizer(ngram_range=(2, 3), analyzer="char", max_features = 2000,binary=False, min_df=0 ) for item in trainset: corpus.append( item[“text”] ) classes.append( item["label"] ) #our vectors are the feature union of word/char ngrams vectorizer = FeatureUnion([ ("chars", char_vector),("words", word_vector) # load corpus, use fit_transform to get vectors X = vectorizer.fit_transform(corpus) ] )
  • 17. Extract features [2] import nltk #generate POS tags using nltk, return the sequence as whitespace separated string def pos_tags(txt): tokens = nltk.word_tokenize(txt) return " ".join( [ tag for (word, tag) in nltk.pos_tag( tokens ) ] ) #combine word and char ngrams with POS-ngrams tag_vector = TfidfVectorizer( analyzer="word" , ngram_range=(2,2), binary = False, max_features= 2000, decode_error = 'ignore' ) X1 = vectorizer.fit_transform( corpus ) X2 = tag_vector.fit_transform( tags ) #concatenate the two matrices X = sp.hstack((X1, X2), format='csr')
  • 18. Extract features [2.1] #this last part is a little bit tricky X = sp.hstack((X1, X2), format='csr') There was no (obvious) way to use FeatureUnion X1, X2 are sparse matrices - so, we are using hstack to stack two matrices horizontally (column wise) http://docs.scipy.org/doc/numpy/reference/generated/numpy.hstack.html
  • 19. Put everything together feature vector components Author: A function of word ngrams character ngrams POS tags ngrams (optional)
  • 20. Fit the model and evaluate (10-fold-CV) model = LinearSVC( loss='l1', dual=True) scores = cross_validation.cross_val_score( estimator = model, X = matrix.toarray(), y= np.asarray(classes), cv=10 ) print "10-fold cross validation results:", "mean score = ", scores.mean(), "std=", scores.std(), ", num folds =", len(scores) Results: 96% accuracy for two authors, using 10-foldCV
  • 21. Evaluate (train set vs test set) from sklearn.cross_validation import train_test_split model = LinearSVC( loss='l1', dual=True) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) y_pred = model.fit(X_train, y_train).predict(X_test) cm = confusion_matrix(y_test, y_pred) print(cm) pl.matshow(cm) pl.title('Confusion matrix') pl.colorbar() pl.ylabel('True label') pl.xlabel('Predicted label') pl.show()
  • 22. Confusion Matrix [[ 57 [ 0 [ 3 [ 0 [ 5 [ 9 [ 3 [ 8 [ 8 [ 2 1 71 0 1 4 11 1 12 4 6 2 0 4 8 3 27 13 1 1 0 13 0 8 6 51 1 3 5 4 8 25 0 207 2 8 8 8 82 3 7 106 30 10 25 23 3 15 11 350 14 46 42 3 8 13 16 244 21 38 10 3 11 46 13 414 39 7 59 11 21 31 49 579 1 4 3 24 13 29 15 2] 0] 0] 2] 3] 12] 5] 8] 10] 61]]
  • 23. Interesting questions ● ● ● ● ● ● ● ● Many authors? Short texts / “micro messages"? Is writing style affected by time/age? Can we detect “mood”? Psychological profiles? What about obfuscation? Even more subtle problems [PAN Workshop 2013] Other applications (code, music scores etc)
  • 24. References & Libraries 1. 2. 3. 4. 5. Authorship Attribution: An Introduction, Harold Love, 2002 A Survey of Modern Authorship Attribution Methods,Efstathios Stamatatos, 2007 Authorship Attribution, Patrick Juola, 2008 Authorship Attribution in Greek Tweets Using Author's Multilevel N-Gram Profiles, G. Mikros, Kostas Perifanos. 2012 Authorship Attribution with Latent Dirichlet Allocation, Seroussi,Zukerman, Bohnert, 2011 Python libraries: ● ● ● Pandas: http://pandas.pydata.org/ Scikit-learn: http://scikit-learn.org/stable/ nltk, http://www.nltk.org/ Data: www.csse.monash.edu.au/research/umnl/data Demo Python code: https://gist.github.com/kperi/f0730ff3028f7be86b15