4. Authorship recogni%on
Stylometry:
– An authorship recogni%on system based solely on
wri%ng style.
– Not handwri%ng
– Only linguis%c style: word choice, sentence length,
parts‐of‐speech usage, …
7. Extract Determine
features authorship
Machine Learning
System
Document of
unknown authorship
8. Assump%ons
• Wri%ng style is invariant.
– It’s like a fingerprint, you can’t really change it.
– Authorship recogni%on can iden%fy you if there
are sufficient wri%ng samples and a set of
suspects.
9. Adversarial AUacks
• Imita%on or framing aUack
– Where one author imitates another author
– Par%cipants were asked to imitate Cormac
McCarthy in wri%ng about their day
• Obfusca%on aUack
– Where an author hides his regular style
– Par%cipants were asked to write about their
neighborhood in a different style
M. Brennan and R. Greenstadt. Prac%cal aUacks against authorship recogni%on
techniques. In Proceedings of the Twenty‐First Conference on Innova%ve
Applica%ons of Ar%ficial Intelligence (IAAI), Pasadena, CA, 2009.
10. Accuracy in detec6ng authorship of regular
documents
1
0.9
0.8
0.7
0.6
0.5 More than 80% accurate in detec%ng 9‐Feature (NN)
authorship with 40 authors in regular Synonym‐Based
0.4
document Writeprints Baseline (SVM)
0.3 Random
0.2
0.1
0
5 10 15 20 25 30 35 40
Number of Authors
11. Accuracy in detec6ng authorship of
Obfuscated documents
1
0.9
0.8
0.7
0.6
9‐Feature (NN)
0.5
Accuracy is less than random chance Synonym‐Based
0.4 in obfuscated documents Writeprints Baseline (SVM)
0.3 Random
0.2
0.1
0
5 10 15 20 25 30 35 40
Number of Authors
12. Accuracy in detec6ng authorship of
Imitated documents
1
0.9
0.8
0.7
0.6
9‐Feature (NN)
0.5
Accuracy is less than random chance Synonym‐Based
0.4 in imitated documents Writeprints Baseline (SVM)
0.3 Random
0.2
0.1
0
5 10 15 20 25 30 35 40
Number of Authors
14. Extended‐Brennan‐Greenstadt
Corpus
• 56 authors
– 12 of the par%cipants are from Drexel University
– The rest are paid workers from Amazon mechanical turk
• Three kinds of wri%ng samples
– Regular wri%ngs (5000 words)
– Imitated wri%ng
• A 500‐word ar%cle describing a day
• Imitate Cormac McCarthy from `The Road’
– Obfuscated wri%ng
• A 500‐word ar%cle describing neighborhood
• Hide own wri%ng style
16. Feature Changes in Imita6on and Obfusca6on
Personal pronoun
Sentence count
Par%cle
Short Words
Verb
Unique words
Adverb
Existen%al there Imita%on
Average syllables per word Obfusca%on
Average word length
Adjec%ve
Cardinal number
Gunning‐Fog readability index
Average sentence length
‐80 ‐60 ‐40 ‐20 0 20 40 60 80 100
17. Problem with the dataset:
Topic Similarity
• All the decep%ve documents were of same
topic.
5,$6.)78)9+,$($-.)8$%.'($)&$.)+-)9$.$60-1)
%9:$(&%(+%4)%'.;7(&;+3)
$"
• Non‐content‐specific
!#,"
!#+"
!#*"
!"#$%&'($)
features have same
!#)"
!#(" =>3/0<1<"
!#'" ?5@-<08"
!#&"
effect as content‐specific
A23/53/"
!#%"
!#$"
!"
features. -.-/0123" 4567804"
*+,$($-.)/(+0-1)2%#34$&)
29:7;<0123"
18. Hemingway‐Faulkner Imita%on
Corpus
• Ar%cles from the Interna%onal Imita%on
Hemingway Contest (2000‐2005)
• Ar%cles from the Faux Faulkner Contest
(2001‐2005)
• Original excerpts of Ernest Hemingway and
William Faulkner
20. Long term decep%on:
A Gay Girl In Damascus
Thomas MacMaster.
Fake picture of Amina Arraf.
– Original author was a 40‐year old American ci%zen,
Thomas MacMaster.
– Pretended to be a Syrian gay woman, Amina Arraf.
– The author worked for at least 5 years to create a
new style.