SlideShare uma empresa Scribd logo
1 de 110
Baixar para ler offline
Natural	
  Language	
  Processing	
  
Tools	
  for	
  the	
  Digital	
  Humanities	
  
                   Christopher	
  Manning	
  
                    Stanford	
  University	
  
                 Digital	
  Humanities	
  2011	
  
  http://nlp.stanford.edu/~manning/courses/DigitalHumanities/	
  	
  
Commencement	
  2010	
  
My	
  humanities	
  qualifications	
  
•  B.A.	
  (Hons),	
  Australian	
  National	
  University	
  
•  Ph.D.	
  Linguistics,	
  Stanford	
  University	
  


•  But:	
  
    –  I’m	
  not	
  sure	
  I’ve	
  ever	
  taken	
  a	
  real	
  humanities	
  class	
  
       (if	
  you	
  discount	
  linguistics	
  classes	
  and	
  high	
  school	
  
       English…)	
  
SO,	
  FEEL	
  FREE	
  TO	
  ASK	
  
       QUESTIONS!	
  
Text	
  
The	
  promise	
  




          Phrase	
  Net	
  visualization	
  of	
  	
  
          Pride	
  &	
  Prejudice	
  (*	
  (in|at)	
  *)	
  
          http://www-958.ibm.com/software/data/cognos/manyeyes/
“How	
  I	
  write”	
  [code]	
  
•  I	
  think	
  you	
  tend	
  to	
  get	
  too	
  much	
  of	
  people	
  
   showing	
  the	
  glitzy	
  output	
  of	
  something	
  
•  So,	
  for	
  this	
  tutorial,	
  at	
  least	
  in	
  the	
  slides	
  I’m	
  
   trying	
  to	
  include	
  the	
  low-­‐level	
  hacking	
  and	
  
   plumbing	
  
•  It’s	
  a	
  standard	
  truism	
  of	
  data	
  mining	
  that	
  more	
  
   time	
  goes	
  into	
  “data	
  preparation”	
  than	
  anything	
  
   else.	
  Definitely	
  goes	
  for	
  text	
  processing.	
  
Outline	
  
1.  Introduction	
  
2.  Getting	
  some	
  text	
  
3.  Words	
  
4.  Collocations,	
  etc.	
  
5.  NLP	
  Frameworks	
  and	
  tools	
  
6.  Part-­‐of-­‐speech	
  tagging	
  
7.  Named	
  entity	
  recognition	
  
8.  Parsing	
  
9.  Coreference	
  resolution	
  
10.  The	
  rest	
  of	
  the	
  languages	
  of	
  the	
  world	
  
11.  Parting	
  words	
  
2.	
  GETTING	
  SOME	
  TEXT	
  
First	
  step:	
  Text	
  
•  To	
  do	
  anything,	
  you	
  need	
  some	
  texts!	
  
    –  Many	
  sites	
  give	
  you	
  various	
  sorts	
  of	
  search-­‐and-­‐
       display	
  interfaces	
  
    –  But,	
  normally	
  you	
  just	
  can’t	
  do	
  what	
  you	
  want	
  in	
  NLP	
  
       for	
  the	
  Digital	
  Humanities	
  unless	
  you	
  have	
  a	
  copy	
  of	
  
       the	
  texts	
  sitting	
  on	
  your	
  computer	
  
    –  This	
  may	
  well	
  change	
  in	
  the	
  future:	
  There	
  is	
  
       increasing	
  use	
  of	
  cloud	
  computing	
  models	
  where	
  you	
  
       might	
  be	
  able	
  to	
  upload	
  code	
  to	
  run	
  it	
  on	
  data	
  on	
  a	
  
       server	
  
         •  or,	
  conversely,	
  upload	
  data	
  to	
  be	
  processed	
  by	
  code	
  on	
  a	
  server	
  	
  	
  
First	
  step:	
  Text	
  
•  People	
  in	
  the	
  audience	
  are	
  probably	
  more	
  familiar	
  
   with	
  the	
  state	
  of	
  play	
  here	
  than	
  me,	
  but	
  my	
  
   impression	
  is:	
  
    1.  There	
  are	
  increasingly	
  good	
  supplies	
  of	
  critical	
  texts	
  
        in	
  well-­‐marked-­‐up	
  XML	
  available	
  commercially	
  for	
  
        license	
  to	
  university	
  libraries	
  
    2.  There	
  are	
  various,	
  more	
  community	
  efforts	
  to	
  
        produce	
  good	
  digitized	
  collections,	
  but	
  most	
  of	
  
        those	
  seem	
  to	
  be	
  available	
  to	
  “friends”	
  rather	
  than	
  
        to	
  anybody	
  with	
  a	
  web	
  browser	
  
    3.  There’s	
  Project	
  Gutenberg	
  	
  
        •    Plain	
  text,	
  or	
  very	
  simple	
  HTML,	
  which	
  may	
  or	
  may	
  not	
  be	
  
             automatically	
  generated	
  
        •    Unicode	
  utf-­‐8	
  if	
  you’re	
  lucky,	
  US-­‐ASCII	
  if	
  you’re	
  not	
  
1.	
  Early	
  English	
  Books	
  Online	
  
•  TEI-­‐compliant	
  XML	
  texts	
  
•  http://eebo.chadwyck.com/	
  
2.	
  Old	
  Bailey	
  Online	
  
3.	
  Project	
  Gutenberg	
  
Running	
  example:	
  H.	
  Rider	
  Haggard	
  
•  The	
  hugely	
  popular	
  King	
  Solomon's	
  Mines	
  (1885)	
  by	
  H.	
  
   Rider	
  Haggard	
  is	
  sometimes	
  considered	
  the	
  first	
  of	
  the	
  
   “Lost	
  World”	
  or	
  “Imperialist	
  Romance”	
  genres	
  

•  Allan	
  Quatermain	
  (1887)	
  
•  She	
  (1887)	
  
•  Nada	
  the	
  Lily	
  (1892)	
  
•  Ayesha:	
  The	
  Return	
  of	
  She	
  
   (1905)	
  
•  She	
  and	
  Allan	
  (1921)	
  

•  Zip	
  file	
  at:	
  
     http://nlp.stanford.edu/~manning/courses/DigitalHumanities/	
  	
  
Interfaces	
  to	
  tools	
  


               Web	
                Programming	
  
            applications	
              APIs	
  


                         Command-­‐
    GUI	
  
                             line	
  
applications	
  
                         applications	
  
You’ll	
  need	
  to	
  program	
  
•  Lisa	
  Spiro,	
  TAMU	
  Digital	
  Scholarship	
  2009:	
  
    I’m a digital humanist with only limited programming
    skills (Perl & XSLT). Enhancing my programming
    skills would allow me to:
        •  Avoid so much tedious, manual work
        •  Do citation analysis
        •  Pre-process texts (remove the junk)
        •  Automatically download web pages
        •  And much more…
You’ll	
  need	
  to	
  program	
  
•  Program	
  in	
  what?	
  
    –  Perl	
  
         •  Traditional	
  seat-­‐of-­‐the-­‐pants	
  scripting	
  language	
  for	
  	
  text	
  
            processing	
  (it	
  nailed	
  flexible	
  regex).	
  	
  I	
  use	
  it	
  some	
  below….	
  
    –  Python	
  
         •  Cleaner,	
  more	
  modern	
  scripting	
  language	
  with	
  a	
  lot	
  of	
  
            energy,	
  and	
  the	
  best-­‐documented	
  NLP	
  framework,	
  NLTK.	
  
    –  Java	
  
         •  There	
  are	
  more	
  NLP	
  tools	
  for	
  Java	
  than	
  any	
  other	
  language.	
  
            And	
  it’s	
  one	
  of	
  those	
  most	
  popular	
  languages	
  in	
  general.	
  
            Good	
  regular	
  expressions,	
  Unicode,	
  etc.	
  
You’ll	
  need	
  to	
  program	
  
•  Program	
  with	
  what?	
  
    –  There	
  are	
  some	
  general	
  skills	
  that	
  you’ll	
  want	
  the	
  
       cut	
  across	
  programming	
  languages	
  
         •  Regular	
  expressions	
  
         •  XML,	
  especially	
  XPath	
  and	
  XSLT	
  
         •  Unicode	
  


•  But	
  I’m	
  wisely	
  not	
  going	
  to	
  try	
  to	
  teach	
  
   programming	
  or	
  these	
  skills	
  in	
  this	
  tutorial	
  	
  
Grabbing	
  files	
  from	
  websites	
  
•  wget	
  (Linux)	
  or	
  curl	
  (Mac	
  OS	
  X,	
  BSD)	
  
     –  wget	
  http://www.gutenberg.org/browse/authors/h	
  
     –  curl	
  -­‐O	
  http://www.gutenberg.org/browse/authors/h	
  



•  If	
  you	
  really	
  want	
  to	
  use	
  your	
  browser,	
  there	
  are	
  
   things	
  you	
  can	
  get	
  like	
  this	
  Firefox	
  plug-­‐in	
  
                 –  DownThemAll	
  	
  http://www.downthemall.net/	
  

	
  	
  	
  	
  	
  but	
  then	
  you	
  just	
  can’t	
  do	
  things	
  as	
  flexibly	
  
Grabbing	
  files	
  from	
  websites	
  
#!/usr/bin/perl	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
while	
  (<>)	
  {	
  last	
  if	
  (m/Haggard/);	
  }	
  
while	
  (<>)	
  {	
  
	
  	
  	
  	
  last	
  if	
  (m/Hague/);	
  
	
  	
  	
  	
  if	
  (m!pgdbetext"><a	
  href="/ebooks/(d+)">(.*)</a>	
  (English)!)	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  $title	
  =	
  $2;	
  
	
  	
  	
  	
  	
  	
  	
  	
  $num	
  =	
  $1;	
  
	
  	
  	
  	
  	
  	
  	
  	
  $title	
  =~	
  s/<br>/	
  /g;	
  
	
  	
  	
  	
  	
  	
  	
  	
  $title	
  =~	
  s/r//g;	
  
	
  	
  	
  	
  	
  	
  	
  	
  print	
  "curl	
  -­‐o	
  "$title	
  $num.txt"	
  http://www.gutenberg.org/cache/epub/$num/pg$num.txtn";	
  
	
  	
  	
  	
  	
  	
  	
  	
  #	
  Expect	
  only	
  one	
  of	
  the	
  html	
  to	
  exist	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  print	
  "curl	
  -­‐o	
  "$title	
  $num.html"	
  http://www.gutenberg.org/files/$num/$num-­‐h/$num-­‐h.htmn";	
  
	
  	
  	
  	
  	
  	
  	
  	
  print	
  "curl	
  -­‐o	
  "$title	
  $num-­‐g.html"	
  http://www.gutenberg.org/cache/epub/$num/pg$num.htmln";	
  
	
  	
  	
  	
  }	
  
}	
  
	
  
Grabbing	
  files	
  from	
  websites	
  
wget	
  http://www.gutenberg.org/browse/authors/h	
  
perl	
  getHaggard.pl	
  <	
  h	
  >	
  h.sh	
  
chmod	
  755	
  h.sh	
  
./h.sh	
  
#	
  and	
  a	
  bit	
  of	
  futzing	
  by	
  hand	
  that	
  I	
  will	
  leave	
  out….	
  
	
  
•  Often	
  you	
  want	
  the	
  90%	
  solution:	
  automating	
  
     nothing	
  would	
  be	
  slow	
  and	
  painful,	
  but	
  automating	
  
     everything	
  is	
  more	
  trouble	
  than	
  it’s	
  worth	
  for	
  a	
  one-­‐
     off	
  process	
  
Typical	
  text	
  problems	
  
"Devilish	
  strange!"	
  thought	
  he,	
  chuckling	
  to	
  himself;	
  "queer	
  business!	
  Capital	
  trick	
  of	
  the	
  cull	
  in	
  the	
  cloak	
  to	
  make	
  another	
  person's	
  brat	
  stand	
  the	
  brunt	
  
for	
  his	
  own-­‐-­‐-­‐capital!	
  ha!	
  ha!	
  Won't	
  do,	
  though.	
  He	
  must	
  be	
  a	
  sly	
  fox	
  to	
  get	
  out	
  of	
  the	
  Mint	
  without	
  my	
  	
  

[Page	
  59	
  ]	
  	
  
knowledge.	
  I've	
  a	
  shrewd	
  guess	
  where	
  he's	
  taken	
  refuge;	
  but	
  I'll	
  ferret	
  him	
  out.	
  These	
  bloods	
  will	
  pay	
  well	
  for	
  his	
  capture;	
  if	
  not,	
  he'll	
  pay	
  well	
  to	
  get	
  out	
  
of	
  their	
  hands;	
  so	
  I'm	
  safe	
  either	
  way-­‐-­‐-­‐ha!	
  ha!	
  Blueskin,"	
  he	
  added	
  aloud,	
  and	
  motioning	
  that	
  worthy,	
  "follow	
  me."	
  

Upon	
  which,	
  he	
  set	
  off	
  in	
  the	
  direction	
  of	
  the	
  entry.	
  His	
  progress,	
  however,	
  was	
  checked	
  by	
  loud	
  acclamations,	
  announcing	
  the	
  arrival	
  of	
  the	
  Master	
  of	
  
the	
  Mint	
  and	
  his	
  train.	
  

Baptist	
  Kettleby	
  (for	
  so	
  was	
  the	
  Master	
  named)	
  was	
  a	
  "goodly	
  portly	
  man,	
  and	
  a	
  corpulent,"	
  whose	
  fair	
  round	
  paunch	
  bespoke	
  the	
  affection	
  he	
  
entertained	
  for	
  good	
  liquor	
  and	
  good	
  living.	
  He	
  had	
  a	
  quick,	
  shrewd,	
  merry	
  eye,	
  and	
  a	
  look	
  in	
  which	
  duplicity	
  was	
  agreeably	
  veiled	
  by	
  good	
  humour.	
  It	
  
was	
  easy	
  to	
  discover	
  that	
  he	
  was	
  a	
  knave,	
  but	
  equally	
  easy	
  to	
  perceive	
  that	
  he	
  was	
  a	
  pleasant	
  fellow;	
  a	
  combination	
  of	
  qualities	
  by	
  no	
  means	
  of	
  rare	
  
occurrence.	
  So	
  far	
  as	
  regards	
  his	
  attire,	
  Baptist	
  was	
  not	
  seen	
  to	
  advantage.	
  No	
  great	
  lover	
  of	
  state	
  or	
  state	
  costume	
  at	
  any	
  time,	
  he	
  was	
  	
  
[Page	
  60	
  ]	
  	
  

generally,	
  towards	
  the	
  close	
  of	
  an	
  evening,	
  completely	
  in	
  dishabille,	
  and	
  in	
  this	
  condition	
  he	
  now	
  presented	
  himself	
  to	
  his	
  subjects.	
  His	
  shirt	
  was	
  
unfastened,	
  his	
  vest	
  unbuttoned,	
  his	
  hose	
  ungartered;	
  his	
  feet	
  were	
  stuck	
  into	
  a	
  pair	
  of	
  pantoufles,	
  his	
  arms	
  into	
  a	
  greasy	
  flannel	
  dressing-­‐gown,	
  his	
  
head	
  into	
  a	
  thrum-­‐cap,	
  the	
  cap	
  into	
  a	
  tie-­‐periwig,	
  and	
  the	
  wig	
  into	
  a	
  gold-­‐edged	
  hat.	
  A	
  white	
  apron	
  was	
  tied	
  round	
  his	
  waist,	
  and	
  into	
  the	
  apron	
  was	
  
thrust	
  a	
  short	
  thick	
  truncheon,	
  which	
  looked	
  very	
  much	
  like	
  a	
  rolling-­‐pin.	
  
The	
  Master	
  of	
  the	
  Mint	
  was	
  accompanied	
  by	
  another	
  gentleman	
  almost	
  as	
  portly	
  as	
  himself,	
  and	
  quite	
  as	
  deliberate	
  in	
  his	
  movements.	
  The	
  costume	
  of	
  
this	
  personage	
  was	
  somewhat	
  singular,	
  and	
  might	
  have	
  passed	
  for	
  a	
  masquerading	
  habit,	
  had	
  not	
  the	
  imperturbable	
  gravity	
  of	
  his	
  demeanour	
  
forbidden	
  any	
  such	
  supposition.	
  It	
  consisted	
  of	
  a	
  close	
  jerkin	
  of	
  brown	
  frieze,	
  ornamented	
  with	
  a	
  triple	
  row	
  of	
  brass	
  buttons;	
  loose	
  Dutch	
  slops,	
  made	
  
very	
  wide	
  in	
  the	
  seat	
  and	
  very	
  tight	
  at	
  the	
  knees;	
  red	
  stockings	
  with	
  black	
  clocks,	
  and	
  	
  

[Page	
  61	
  ]	
  	
  
a	
  fur	
  cap.	
  The	
  owner	
  of	
  this	
  dress	
  had	
  a	
  broad	
  weather-­‐beaten	
  face,	
  small	
  twinkling	
  eyes,	
  and	
  a	
  bushy,	
  grizzled	
  beard.	
  Though	
  he	
  walked	
  by	
  the	
  side	
  of	
  
the	
  governor,	
  he	
  seldom	
  exchanged	
  a	
  word	
  with	
  him,	
  but	
  appeared	
  wholly	
  absorbed	
  in	
  the	
  contemplations	
  inspired	
  by	
  a	
  broad-­‐bowled	
  Dutch	
  pipe.	
  
There	
  are	
  always	
  text-­‐processing	
  
                      gotchas	
  …	
  
•  …	
  and	
  not	
  dealing	
  with	
  them	
  can	
  badly	
  degrade	
  
   the	
  quality	
  of	
  subsequent	
  NLP	
  processing.	
  


1.  The	
  Gutenberg	
  *.txt	
  files	
  frequently	
  represent	
  
    italics	
  with	
  _underscores_.	
  
2.  There	
  may	
  be	
  file	
  headers	
  and	
  footers	
  
3.  Elements	
  like	
  headings	
  may	
  be	
  run	
  together	
  
    with	
  following	
  sentences	
  if	
  not	
  demarcated	
  or	
  
    eliminated	
  (example	
  later).	
  
There	
  are	
  always	
  text-­‐processing	
  
                         gotchas	
  …	
  
#!/usr/bin/perl	
  
$finishedHeader	
  =	
  0;	
  
$startedFooter	
  =	
  0;	
  
while	
  ($line	
  =	
  <>)	
  {	
  
	
  	
  if	
  ($line	
  =~	
  /^***s*END/	
  &&	
  $finishedHeader)	
  {	
  
	
  	
  	
  	
  $startedFooter	
  =	
  1;	
  
	
  	
  }	
  
	
  	
  if	
  ($finishedHeader	
  &&	
  !	
  $startedFooter)	
  {	
  
	
  	
  	
  	
  $line	
  =~	
  s/_//g;	
  	
  #	
  minor	
  cleanup	
  of	
  italics	
  
	
  	
  	
  	
  print	
  $line;	
  
	
  	
  }	
  
	
  	
  if	
  ($line	
  =~	
  /^***s*START/	
  &&	
  !	
  $finishedHeader)	
  {	
  
	
  	
  	
  	
  $finishedHeader	
  =	
  1;	
  
	
  	
  }	
  
}	
  
if	
  (	
  !	
  ($finishedHeader	
  &&	
  $startedFooter))	
  {	
  
	
  	
  print	
  STDERR	
  "****	
  Probable	
  book	
  format	
  problem!n";	
  
}	
  
3.	
  WORDS	
  
In	
  the	
  beginning	
  was	
  the	
  word	
  
•  Word	
  counts	
  


•  Word	
  counts	
  are	
  the	
  basis	
  of	
  all	
  the	
  simple,	
  first	
  
   order	
  methods	
  of	
  text	
  analysis	
  
    –  tag	
  clouds,	
  collocations,	
  topic	
  models	
  
•  Sometimes	
  you	
  can	
  get	
  a	
  fair	
  distance	
  with	
  word	
  
   counts	
  
She	
  (1887)	
     http://wordle.net/	
  	
  Jonathan	
  Feinberg	
  
Ayesha:	
  The	
  Return	
  of	
  She	
  (1905)	
  
She	
  and	
  Allan	
  (1921)	
  
Wisdom's	
  Daughter:	
  The	
  Life	
  and	
  Love	
  Story	
  of	
  She-­‐Who-­‐Must-­‐Be-­‐Obeyed	
  (1923)	
  
Wisdom's	
  Daughter:	
  The	
  Life	
  and	
  Love	
  Story	
  of	
  She-­‐Who-­‐Must-­‐Be-­‐Obeyed	
  (1923)	
  
Google	
  Books	
  Ngram	
  Viewer	
  
    http://ngrams.googlelabs.com/	
  
Google	
  Books	
  Ngram	
  Viewer	
  


•  …	
  you	
  have	
  to	
  be	
  the	
  most	
  jaded	
  or	
  cynical	
  scholar	
  
   not	
  to	
  be	
  excited	
  by	
  the	
  release	
  of	
  the	
  
   Google	
  Books	
  Ngram	
  Viewer	
  …	
  Digital	
  humanities	
  
   needs	
  gateway	
  drugs.	
  …	
  “Culturomics”	
  
   sounds	
  like	
  an	
  80s	
  new	
  wave	
  band.	
  If	
  we’re	
  going	
  to	
  
   coin	
  neologisms,	
  let’s	
  at	
  least	
  go	
  with	
  Sean	
  Gillies’	
  
   satirical	
  alternative:	
  Freakumanities.…	
  For	
  me,	
  the	
  
   biggest	
  problem	
  with	
  the	
  viewer	
  and	
  the	
  data	
  is	
  that	
  
   you	
  cannot	
  seamlessly	
  move	
  from	
  distant	
  reading	
  to	
  
   close	
  reading	
  
Language	
  change:	
  as	
  least	
  as	
  
C.	
  D.	
  Manning.	
  2003.	
  Probabilistic	
  Syntax	
  	
  
•  I	
  found	
  this	
  example	
  in	
  Russo	
  R.,	
  2001,	
  Empire	
  
   Falls	
  (on	
  p.3!):	
  
    –  By	
  the	
  time	
  their	
  son	
  was	
  born,	
  though,	
  Honus	
  
       Whiting	
  was	
  beginning	
  to	
  understand	
  and	
  
       privately	
  share	
  his	
  wife’s	
  opinion,	
  as	
  least	
  as	
  it	
  
       pertained	
  to	
  Empire	
  Falls.	
  
•  What’s	
  interesting	
  about	
  it?	
  
Language	
  change:	
  as	
  least	
  as	
  
•  A	
  language	
  change	
  in	
  progress?	
  I	
  found	
  a	
  bunch	
  of	
  other	
  
   examples:	
  
     –  Indeed,	
  the	
  will	
  and	
  the	
  means	
  to	
  follow	
  through	
  are	
  as	
  
        least	
  as	
  important	
  as	
  the	
  initial	
  commitment	
  to	
  deficit	
  
        reduction.	
  
     –  As	
  many	
  of	
  you	
  know	
  he	
  had	
  his	
  boat	
  built	
  at	
  the	
  same	
  
        time	
  as	
  mine	
  and	
  it’s	
  as	
  least	
  as	
  well	
  maintained	
  and	
  
        equipped.	
  
•  Apparently	
  not	
  a	
  “dialect”	
  
     –  Second,	
  if	
  the	
  required	
  disclosures	
  are	
  made	
  by	
  on-­‐screen	
  
        notice,	
  the	
  disclosure	
  of	
  the	
  vendor’s	
  legal	
  name	
  and	
  address	
  
        must	
  appear	
  on	
  one	
  of	
  several	
  specified	
  screens	
  on	
  the	
  vendor’s	
  
        electronic	
  site	
  and	
  must	
  be	
  at	
  least	
  as	
  legible	
  and	
  set	
  in	
  a	
  font	
  
        as	
  least	
  as	
  large	
  as	
  the	
  text	
  of	
  the	
  offer	
  itself.	
  
Language	
  change:	
  as	
  least	
  as	
  
Language	
  change:	
  as	
  least	
  as	
  
4.	
  COLLOCATIONS,	
  ETC.	
  
Using	
  a	
  text	
  editor	
  
•  You	
  can	
  get	
  a	
  fair	
  distance	
  with	
  a	
  text	
  editor	
  that	
  
   allows	
  multi-­‐file	
  searches,	
  regular	
  expressions,	
  
   etc.	
  
     –  It’s	
  like	
  a	
  little	
  concordancer	
  that’s	
  good	
  for	
  close	
  
        reading	
  
          •  jEdit	
  	
  	
  	
  http://www.jedit.org/	
  	
  	
  	
  	
  	
  	
  

          •  BBedit	
  on	
  Windows	
  
Traditional	
  Concordancers	
  
•  WordSmith	
  Tools	
  	
  	
  	
  Commercial;	
  Windows	
  
     –  http://www.lexically.net/wordsmith/	
  
•  Concordance	
  	
  	
  	
  	
  Commercial;	
  Windows	
  
     –  http://www.concordancesoftware.co.uk/	
  
•  AntConc	
  	
  	
  Free;	
  Windows,	
  Mac	
  OS	
  X	
  (only	
  under	
  X11);	
  Linux	
  
     –  http://www.antlab.sci.waseda.ac.jp/antconc_index.html	
  
•  CasualConc	
  	
  	
  Free;	
  Mac	
  OS	
  X	
  
     –  http://sites.google.com/site/casualconc/	
  
          •  by	
  Yasu	
  Imao	
  
The	
  decline	
  of	
  honour	
  
5.	
  NLP	
  FRAMEWORKS	
  
        AND	
  TOOLS	
  
The	
  Big	
  3	
  NLP	
  Frameworks	
  
•  GATE	
  –	
  General	
  Architecture	
  for	
  Text	
  Engineering	
  (U.	
  Sheffield)	
  
          •  http://gate.ac.uk/	
  
          •  Java,	
  quite	
  well	
  maintained	
  (now)	
  
          •  Includes	
  tons	
  of	
  components	
  
•  UIMA	
  –	
  Unstructured	
  Information	
  Management	
  Architecture.	
  
   Originally	
  IBM;	
  now	
  Apache	
  project	
  
          •  http://uima.apache.org/	
  
          •  Professional,	
  scalable,	
  etc.	
  
          •  But,	
  unless	
  you’re	
  comfortable	
  with	
  Xml,	
  Eclipse,	
  Java	
  or	
  C++,	
  etc.,	
  I	
  
             think	
  it’s	
  a	
  non-­‐starter	
  
•  NLTK	
  –	
  Natural	
  Language	
  To0lkit	
  (started	
  by	
  Steven	
  Bird)	
  
          •    http://www.nltk.org/	
  
          •    Big	
  community;	
  large	
  Python	
  package;	
  corpora	
  and	
  books	
  about	
  it	
  
          •    But	
  it’s	
  code	
  modules	
  and	
  API,	
  no	
  GUI	
  or	
  command-­‐line	
  tools	
  
          •    Like	
  R	
  for	
  NLP.	
  	
  But,	
  hey,	
  R’s	
  becoming	
  very	
  successful….	
  
The	
  main	
  NLP	
  Packages	
  
•  NLTK	
  	
  	
  Python	
  
     –  http://www.nltk.org/	
  
•  OpenNLP	
  
     –  http://incubator.apache.org/opennlp/	
  
•  Stanford	
  NLP	
  
     –  http://nlp.stanford.edu/software/	
  
•  LingPipe	
  
     –  http://alias-­‐i.com/lingpipe/	
  	
  
•  More	
  one-­‐off	
  packages	
  than	
  I	
  can	
  fit	
  on	
  this	
  slide	
  
     –  http://nlp.stanford.edu/links/statnlp.html	
  
NLP	
  tools:	
  Rules	
  of	
  thumb	
  for	
  2011	
  
1.  Unless	
  you’re	
  unlucky,	
  the	
  tool	
  you	
  want	
  to	
  use	
  
    will	
  work	
  with	
  Unicode	
  (at	
  least	
  BMP),	
  so	
  most	
  
    any	
  characters	
  are	
  okay	
  
2.  Unless	
  you’re	
  lucky,	
  the	
  tool	
  you	
  want	
  to	
  use	
  
    will	
  work	
  only	
  on	
  completely	
  plain	
  text,	
  or	
  
    extremely	
  simple	
  XML-­‐style	
  mark-­‐up	
  (e.g.,	
  <s>	
  
    …	
  </s>	
  around	
  sentences,	
  recognized	
  by	
  regexp)	
  
3.  By	
  default,	
  you	
  should	
  assume	
  that	
  any	
  tool	
  for	
  
    English	
  was	
  trained	
  on	
  American	
  newswire	
  
GATE	
  
Rule-­‐based	
  NLP	
  and	
  Statistical/
           Machine	
  Learning	
  NLP	
  
•  Most	
  work	
  on	
  NLP	
  in	
  the	
  1960s,	
  70s	
  and	
  80s	
  was	
  
   with	
  hand-­‐built	
  grammars	
  and	
  morphological	
  
   analyzers	
  (finite	
  state	
  transducers),	
  etc.	
  
    –  ANNIE	
  in	
  GATE	
  is	
  still	
  in	
  this	
  space	
  
•  Most	
  academic	
  research	
  work	
  in	
  NLP	
  in	
  the	
  
   1990s	
  and	
  2000s	
  use	
  probabilistic	
  or	
  more	
  
   generally	
  machine	
  learning	
  methods	
  (“Statistical	
  
   NLP”)	
  
    –  The	
  Stanford	
  NLP	
  tools	
  and	
  MorphAdorner,	
  
       which	
  we	
  will	
  come	
  to	
  soon,	
  are	
  in	
  this	
  space	
  
Rule-­‐based	
  NLP	
  and	
  Statistical/
            Machine	
  Learning	
  NLP	
  
•  Hand-­‐built	
  grammars	
  are	
  fine	
  for	
  tasks	
  in	
  a	
  closed	
  
   space	
  which	
  do	
  not	
  involve	
  reasoning	
  about	
  
   contexts	
  
     –  E.g.,	
  finding	
  the	
  possible	
  morphological	
  parses	
  of	
  a	
  
        word	
  
•  In	
  the	
  old	
  days	
  they	
  worked	
  really	
  badly	
  on	
  “real	
  
   text”	
  	
  
     –  They	
  were	
  always	
  insufficiently	
  tolerant	
  of	
  the	
  
        variability	
  of	
  real	
  language	
  
     –  But,	
  built	
  with	
  modern,	
  empirical	
  approaches,	
  they	
  
        can	
  do	
  reasonably	
  well	
  
          •  ANNIE	
  is	
  an	
  example	
  of	
  this	
  
Rule-­‐based	
  NLP	
  and	
  Statistical/
              Machine	
  Learning	
  NLP	
  
•  In	
  Statistical	
  NLP:	
  
    –  You	
  gather	
  corpus	
  data,	
  and	
  usually	
  hand-­‐annotate	
  it	
  with	
  the	
  
          kind	
  of	
  information	
  you	
  want	
  to	
  provide,	
  such	
  as	
  part-­‐of-­‐speech	
  
    –  You	
  then	
  train	
  (or	
  “learn”)	
  a	
  model	
  that	
  learns	
  to	
  try	
  to	
  predict	
  
          annotations	
  based	
  on	
  features	
  of	
  words	
  and	
  their	
  contexts	
  via	
  
          numeric	
  feature	
  weights	
  
    –  You	
  then	
  apply	
  the	
  trained	
  model	
  to	
  new	
  text	
  
•  This	
  tends	
  to	
  work	
  much	
  better	
  on	
  real	
  text	
  
    –  It	
  more	
  flexibly	
  handles	
  contextual	
  and	
  other	
  evidence	
  
•  But	
  the	
  technology	
  is	
  still	
  far	
  from	
  perfect,	
  it	
  requires	
  annotated	
  
   data,	
  and	
  degrades	
  (sometimes	
  very	
  badly)	
  when	
  there	
  are	
  
   mismatches	
  between	
  the	
  training	
  data	
  and	
  the	
  runtime	
  data	
  
How	
  much	
  hardware	
  do	
  you	
  need?	
  
•  NLP	
  software	
  often	
  needs	
  plenty	
  of	
  RAM	
  (especially)	
  
   and	
  processing	
  power	
  
•  But	
  these	
  days	
  we	
  have	
  really	
  powerful	
  laptops!	
  
•  Some	
  of	
  the	
  software	
  I	
  show	
  you	
  could	
  run	
  on	
  a	
  
   machine	
  with	
  256	
  MB	
  of	
  RAM	
  (e.g.,	
  Stanford	
  
   Parser),	
  but	
  much	
  of	
  it	
  requires	
  more	
  
•  Stanford	
  CoreNLP	
  requires	
  a	
  machine	
  with	
  4GB	
  of	
  
   RAM	
  
•  I	
  ran	
  everything	
  in	
  this	
  tutorial	
  on	
  the	
  laptop	
  I’m	
  
   presenting	
  on	
  …	
  4GB	
  RAM,	
  2.8	
  GHz	
  Core	
  2	
  Duo	
  
•  But	
  it	
  wasn’t	
  always	
  pleasant	
  writing	
  the	
  slides	
  while	
  
   software	
  was	
  running….	
  
How	
  much	
  hardware	
  do	
  you	
  need?	
  
•  Why	
  do	
  you	
  need	
  more	
  hardware?	
  
    –  More	
  speed	
  
        •  It	
  took	
  me	
  95	
  minutes	
  to	
  run	
  Ayesha,	
  the	
  Return	
  of	
  She	
  
           through	
  Stanford	
  CoreNLP	
  on	
  my	
  laptop….	
  
    –  More	
  scale	
  
        •  You’d	
  like	
  to	
  be	
  able	
  to	
  analyze	
  1	
  million	
  books	
  


•  Order	
  of	
  magnitude	
  rules	
  of	
  thumb:	
  
    –  POS	
  tagging,	
  NER,	
  etc:	
  5–10,000	
  words/second	
  
    –  Parsing:	
  1–10	
  sentences	
  per	
  second	
  
How	
  much	
  hardware	
  do	
  you	
  need?	
  
•  Luckily,	
  most	
  of	
  our	
  problems	
  are	
  trivially	
  
   parallelizable	
  
    –  Each	
  book/chapter	
  can	
  be	
  run	
  separately,	
  perhaps	
  
       on	
  a	
  separate	
  machine	
  

•  What	
  do	
  we	
  actually	
  use?	
  
    –  We	
  do	
  most	
  of	
  our	
  computing	
  on	
  rack	
  mounted	
  
       Linux	
  servers	
  
         •  Currently	
  4	
  x	
  quad	
  core	
  Xeon	
  processors	
  with	
  24	
  GB	
  of	
  
            RAM	
  seem	
  about	
  the	
  sweet	
  spot	
  
         •  About	
  $3500	
  per	
  machine	
  …	
  not	
  like	
  the	
  old	
  days	
  
6.	
  PART-­‐OF-­‐SPEECH	
  
        TAGGING	
  
Part-­‐of-­‐Speech	
  Tagging	
  
•  Part-­‐of-­‐speech	
  tagging	
  is	
  normally	
  done	
  by	
  a	
  sequence	
  
   model	
  (acronyms:	
  HMM,	
  CRM,	
  MEMM/CMM)	
  
     –  A	
  POS	
  tag	
  is	
  to	
  be	
  placed	
  above	
  each	
  word	
  
     –  The	
  model	
  considers	
  a	
  local	
  context	
  of	
  possible	
  previous	
  
        and	
  following	
  POS	
  tags,	
  the	
  current	
  word,	
  neighboring	
  
        words,	
  and	
  features	
  of	
  them	
  (capitalized?,	
  ends	
  in	
  -­‐ing?)	
  
     –  Each	
  such	
  feature	
  has	
  a	
  weight,	
  and	
  the	
  evidence	
  is	
  
        combined,	
  and	
  the	
  most	
  likely	
  sequence	
  of	
  tags	
  
        (according	
  to	
  the	
  model)	
  is	
  chosen	
  

     RB	
       NNP	
       NNP	
        RB	
       VBD	
         ,	
        JJ	
      NNS	
  


    When	
       Mr.	
      Holly	
      last	
     wrote	
       ,	
      many	
      years	
  
Stanford	
  POS	
  tagger	
  
                  http://nlp.stanford.edu/software/tagger.shtml                         	
  
$	
  java	
  -­‐mx1g	
  -­‐cp	
  ../Software/stanford-­‐postagger-­‐full-­‐2011-­‐06-­‐19/
stanford-­‐postagger.jar	
  edu.stanford.nlp.tagger.maxent.MaxentTagger	
  -­‐
model	
  ../Software/stanford-­‐postagger-­‐full-­‐2011-­‐06-­‐19/models/
left3words-­‐distsim-­‐wsj-­‐0-­‐18.tagger	
  -­‐outputFormat	
  tsv	
  -­‐tokenizerOptions	
  
untokenizable=allKeep	
  -­‐textFile	
  She	
  3155.txt	
  >	
  She	
  3155.tsv	
  
Loading	
  default	
  properties	
  from	
  trained	
  tagger	
  ../Software/stanford-­‐
postagger-­‐full-­‐2011-­‐06-­‐19/models/left3words-­‐distsim-­‐wsj-­‐0-­‐18.tagger	
  
Reading	
  POS	
  tagger	
  model	
  from	
  ../Software/stanford-­‐postagger-­‐
full-­‐2011-­‐06-­‐19/models/left3words-­‐distsim-­‐wsj-­‐0-­‐18.tagger	
  ...	
  done	
  [2.2	
  
sec].	
  
Jun	
  15,	
  2011	
  8:17:15	
  PM	
  edu.stanford.nlp.process.PTBLexer	
  next	
   Greek	
  stand-­‐
                                                                                              alone	
  
WARNING:	
  Untokenizable:	
  ?	
  (U+1FBD,	
  decimal:	
  8125)	
                           Koronis	
  
                                                                                           character	
  (a	
  
Tagged	
  132377	
  words	
  at	
  5559.72	
  words	
  per	
  second.	
                       little	
  
                                                                                                  obscure?)	
  
Stanford	
  POS	
  tagger	
  
•  For	
  the	
  second	
  time	
  you	
  do	
  it…	
  
$	
  alias	
  stanfordtag	
  "java	
  -­‐mx1g	
  -­‐cp	
  /Users/manning/Software/
stanford-­‐postagger-­‐full-­‐2011-­‐06-­‐19/stanford-­‐postagger.jar	
  
edu.stanford.nlp.tagger.maxent.MaxentTagger	
  -­‐model	
  /Users/
manning/Software/stanford-­‐postagger-­‐full-­‐2011-­‐06-­‐19/models/
left3words-­‐distsim-­‐wsj-­‐0-­‐18.tagger	
  -­‐outputFormat	
  tsv	
  -­‐
tokenizerOptions	
  untokenizable=allKeep	
  -­‐textFile"	
  
$	
  stanfordtag	
  RiderHaggard/King	
  Solomon's	
  Mines	
  2166.txt	
  >	
  
tagged/King	
  Solomon's	
  Mines	
  2166.tsv	
  
Reading	
  POS	
  tagger	
  model	
  from	
  /Users/manning/Software/
stanford-­‐postagger-­‐full-­‐2011-­‐06-­‐19/models/left3words-­‐distsim-­‐
wsj-­‐0-­‐18.tagger	
  ...	
  done	
  [2.1	
  sec].	
  
Tagged	
  98178	
  words	
  at	
  9807.99	
  words	
  per	
  second.	
  
MorphAdorner	
  
              http://morphadorner.northwestern.edu/	
  

•  MorphAdorner	
  is	
  a	
  set	
  of	
  NLP	
  tools	
  developed	
  at	
  
   Northwestern	
  by	
  Martin	
  Mueller	
  and	
  colleagues	
  
   specifically	
  for	
  English	
  language	
  fiction,	
  over	
  a	
  
   long	
  historical	
  period	
  from	
  EME	
  onwards	
  
    –  lemmatizer,	
  named	
  entity	
  recognizer,	
  POS	
  
       tagger,	
  spelling	
  standardizer,	
  etc.	
  
•  Aims	
  to	
  deal	
  with	
  variation	
  in	
  word	
  breaking	
  and	
  
   spelling	
  over	
  this	
  period	
  
•  Includes	
  its	
  own	
  POS	
  tag	
  set:	
  NUPOS	
  
MorphAdorner	
  
$	
  ./adornplaintext	
  temp	
  temp/3155.txt	
  
2011-­‐06-­‐15	
  20:30:52,111	
  INFO	
  	
  -­‐	
  MorphAdorner	
  version	
  1.0	
  
2011-­‐06-­‐15	
  20:30:52,111	
  INFO	
  	
  -­‐	
  Initializing,	
  please	
  wait...	
  
2011-­‐06-­‐15	
  20:30:52,318	
  INFO	
  	
  -­‐	
  Using	
  Trigram	
  tagger.	
  
2011-­‐06-­‐15	
  20:30:52,319	
  INFO	
  	
  -­‐	
  Using	
  I	
  retagger.	
  
2011-­‐06-­‐15	
  20:30:53,578	
  INFO	
  	
  -­‐	
  Loaded	
  word	
  lexicon	
  with	
  151,922	
  entries	
  in	
  2	
  seconds.	
  
2011-­‐06-­‐15	
  20:30:55,920	
  INFO	
  	
  -­‐	
  Loaded	
  suffix	
  lexicon	
  with	
  214,503	
  entries	
  in	
  3	
  seconds.	
  
2011-­‐06-­‐15	
  20:30:57,927	
  INFO	
  	
  -­‐	
  Loaded	
  transition	
  matrix	
  in	
  3	
  seconds.	
  
2011-­‐06-­‐15	
  20:30:58,137	
  INFO	
  	
  -­‐	
  Loaded	
  162,248	
  standard	
  spellings	
  in	
  1	
  second.	
  
2011-­‐06-­‐15	
  20:30:58,697	
  INFO	
  	
  -­‐	
  Loaded	
  5,434	
  alternative	
  spellings	
  in	
  1	
  second.	
  
2011-­‐06-­‐15	
  20:30:58,703	
  INFO	
  	
  -­‐	
  Loaded	
  349	
  more	
  alternative	
  spellings	
  in	
  14	
  word	
  classes	
  in	
  1	
  second.	
  
2011-­‐06-­‐15	
  20:30:58,713	
  INFO	
  	
  -­‐	
  Loaded	
  0	
  names	
  into	
  name	
  standardizer	
  in	
  <	
  1	
  second.	
  
2011-­‐06-­‐15	
  20:30:58,779	
  INFO	
  	
  -­‐	
  1	
  file	
  to	
  process.	
  
2011-­‐06-­‐15	
  20:30:58,789	
  INFO	
  	
  -­‐	
  Before	
  processing	
  input	
  texts:	
  Free	
  memory:	
  105,741,696,	
  total	
  memory:	
  480,694,272	
  
2011-­‐06-­‐15	
  20:30:58,789	
  INFO	
  	
  -­‐	
  Processing	
  file	
  'temp/3155.txt'	
  .	
  
2011-­‐06-­‐15	
  20:30:58,789	
  INFO	
  	
  -­‐	
  Adorning	
  temp/3155.txt	
  with	
  parts	
  of	
  speech.	
  
2011-­‐06-­‐15	
  20:30:58,832	
  INFO	
  	
  -­‐	
  Loaded	
  text	
  from	
  temp/3155.txt	
  in	
  1	
  second.	
  
2011-­‐06-­‐15	
  20:31:01,498	
  INFO	
  	
  -­‐	
  	
  	
  	
  Extracted	
  131,875	
  words	
  in	
  4,556	
  sentences	
  in	
  3	
  seconds.	
  
2011-­‐06-­‐15	
  20:31:03,860	
  INFO	
  	
  -­‐	
  	
  	
  	
  	
  	
  	
  lines:	
  1,000;	
  words:	
  27,756	
  
2011-­‐06-­‐15	
  20:31:04,364	
  INFO	
  	
  -­‐	
  	
  	
  	
  	
  	
  	
  lines:	
  2,000;	
  words:	
  58,728	
  
2011-­‐06-­‐15	
  20:31:04,676	
  INFO	
  	
  -­‐	
  	
  	
  	
  	
  	
  	
  lines:	
  3,000;	
  words:	
  84,735	
  
2011-­‐06-­‐15	
  20:31:04,990	
  INFO	
  	
  -­‐	
  	
  	
  	
  	
  	
  	
  lines:	
  4,000;	
  words:	
  115,396	
  
2011-­‐06-­‐15	
  20:31:05,152	
  INFO	
  	
  -­‐	
  	
  	
  	
  	
  	
  	
  lines:	
  4,556;	
  words:	
  131,875	
  
2011-­‐06-­‐15	
  20:31:05,152	
  INFO	
  	
  -­‐	
  	
  	
  	
  Part	
  of	
  speech	
  adornment	
  completed	
  in	
  4	
  seconds.	
  36,100	
  words	
  adorned	
  per	
  second.	
  
2011-­‐06-­‐15	
  20:31:05,152	
  INFO	
  	
  -­‐	
  	
  	
  	
  Generating	
  other	
  adornments.	
  
2011-­‐06-­‐15	
  20:31:13,840	
  INFO	
  	
  -­‐	
  	
  	
  	
  Adornments	
  written	
  to	
  temp/3155-­‐005.txt	
  in	
  9	
  seconds.	
  
2011-­‐06-­‐15	
  20:31:13,840	
  INFO	
  	
  -­‐	
  All	
  files	
  adorned	
  in	
  16	
  seconds.	
  
	
  
Ah,	
  the	
  old	
  days!	
  
$	
  ./adornplaintext	
  temp	
  temp/Hunter	
  Quartermain.txt	
  	
  
2011-­‐06-­‐15	
  17:18:15,551	
  INFO	
  	
  -­‐	
  MorphAdorner	
  version	
  1.0	
  
2011-­‐06-­‐15	
  17:18:15,552	
  INFO	
  	
  -­‐	
  Initializing,	
  please	
  wait...	
  
2011-­‐06-­‐15	
  17:18:15,730	
  INFO	
  	
  -­‐	
  Using	
  Trigram	
  tagger.	
  
2011-­‐06-­‐15	
  17:18:15,731	
  INFO	
  	
  -­‐	
  Using	
  I	
  retagger.	
  
2011-­‐06-­‐15	
  17:18:16,972	
  INFO	
  	
  -­‐	
  Loaded	
  word	
  lexicon	
  with	
  151,922	
  entries	
  in	
  2	
  
seconds.	
  
2011-­‐06-­‐15	
  17:18:18,684	
  INFO	
  	
  -­‐	
  Loaded	
  suffix	
  lexicon	
  with	
  214,503	
  entries	
  in	
  2	
  
seconds.	
  
2011-­‐06-­‐15	
  17:18:20,662	
  INFO	
  	
  -­‐	
  Loaded	
  transition	
  matrix	
  in	
  2	
  seconds.	
  
2011-­‐06-­‐15	
  17:18:20,887	
  INFO	
  	
  -­‐	
  Loaded	
  162,248	
  standard	
  spellings	
  in	
  1	
  second.	
  
2011-­‐06-­‐15	
  17:18:21,300	
  INFO	
  	
  -­‐	
  Loaded	
  5,434	
  alternative	
  spellings	
  in	
  1	
  second.	
  
2011-­‐06-­‐15	
  17:18:21,303	
  INFO	
  	
  -­‐	
  Loaded	
  349	
  more	
  alternative	
  spellings	
  in	
  14	
  word	
  
classes	
  in	
  1	
  second.	
  
2011-­‐06-­‐15	
  17:18:21,312	
  INFO	
  	
  -­‐	
  Loaded	
  0	
  names	
  into	
  name	
  standardizer	
  in	
  1	
  second.	
  
2011-­‐06-­‐15	
  17:18:21,381	
  INFO	
  	
  -­‐	
  No	
  files	
  found	
  to	
  process.	
  

•  But	
  it	
  works	
  better	
  if	
  you	
  make	
  sure	
  the	
  filename	
  has	
  
   no	
  spaces	
  in	
  it	
  	
  
Comparing	
  taggers:	
  Penn	
  Treebank	
  vs.	
  
                NUPOS	
  
Holly      	
  NNP       	
  Holly      	
  n1	
        going 	
  VBG	
  	
  	
  	
  	
  going 	
  vvg	
  
, 	
       	
  , 	
      	
  , 	
       	
  ,	
         to 	
   	
  TO 	
   	
  to 	
   	
  pc-­‐acp	
  
if 	
      	
  IN 	
     	
  if 	
      	
  cs	
        leave 	
  VB 	
   	
  leave 	
  vvi	
  
you	
      	
  PRP       	
  you	
      	
  pn22	
      you 	
  PRP	
  	
  	
  	
  	
  you	
   	
  pn22	
  
will	
     	
  MD	
      	
  will	
     	
  vmb	
       that 	
  IN 	
   	
  that 	
  d	
  
accept     	
  VB 	
     	
  accept     	
  vvi	
  
                                                        boy 	
  NN	
   	
  boy's 	
  ng1 	
  	
  
the	
      	
  DT 	
     	
  the	
      	
  dt	
  
                                                        's 	
   	
  POS	
  
trust      	
  NN	
      	
  trust      	
  n1	
  
                                                        sole 	
  JJ 	
   	
  sole 	
  j	
  
, 	
       	
  , 	
      	
  , 	
       	
  ,	
  
                                                        guardian	
  NN 	
  guardian	
  n1	
  
I 	
       	
  PRP       	
  I 	
       	
  pns11	
  
                                                        . 	
   	
  . 	
   	
  . 	
   	
  .	
  
am 	
      	
  VBP       	
  am 	
      	
  vbm	
  
	
                                                      	
  
Comparing	
  taggers:	
  Penn	
  Treebank	
  vs.	
  
                NUPOS	
  
Holly      	
  NNP       	
  Holly      	
  n1	
        going 	
  VBG	
  	
  	
  	
  	
  going 	
  vvg	
  
, 	
       	
  , 	
      	
  , 	
       	
  ,	
         to 	
   	
  TO 	
   	
  to 	
   	
  pc-­‐acp	
  
if 	
      	
  IN 	
     	
  if 	
      	
  cs	
        leave 	
  VB 	
   	
  leave 	
  vvi	
  
you	
      	
  PRP       	
  you	
      	
  pn22	
      you 	
  PRP	
  	
  	
  	
  	
  you	
   	
  pn22	
  
will	
     	
  MD	
      	
  will	
     	
  vmb	
       that 	
  IN 	
   	
  that 	
  d	
  
accept     	
  VB 	
     	
  accept     	
  vvi	
  
                                                        boy 	
  NN	
   	
  boy's 	
  ng1 	
  	
  
the	
      	
  DT 	
     	
  the	
      	
  dt	
  
                                                        's 	
   	
  POS	
  
trust      	
  NN	
      	
  trust      	
  n1	
  
                                                        sole 	
  JJ 	
   	
  sole 	
  j	
  
, 	
       	
  , 	
      	
  , 	
       	
  ,	
  
                                                        guardian	
  NN 	
  guardian	
  n1	
  
I 	
       	
  PRP       	
  I 	
       	
  pns11	
  
                                                        . 	
   	
  . 	
   	
  . 	
   	
  .	
  
am 	
      	
  VBP       	
  am 	
      	
  vbm	
  
	
                                                      	
  
Stylistic	
  factors	
  from	
  POS	
  
14000	
  
12000	
  
10000	
  
 8000	
  
                                                                                JJ	
  
 6000	
                                                                         MD	
  
 4000	
                                                                         DT	
  
 2000	
  
      0	
  
                She	
     Ayesha	
     She	
  and	
  Allan	
     Wisdom's	
  
                                                                 Daughter	
  
7.	
  NAMED	
  ENTITY	
  
  RECOGNITION	
  
        (NER)	
  
Named	
  Entity	
  Recognition	
  	
  
     –	
  “the	
  Chad	
  problem”	
  
Germanyʼ’s representative to the
European Unionʼ’s veterinary
committee Werner Zwingman said on
Wednesday consumers should …

IL-2 gene expression and NF-kappa B
activation through CD28 requires
reactive oxygen production by
5-lipoxygenase.
Conditional	
  Random	
  Fields	
  (CRFs)	
  

   O	
        PER	
      PER	
        O	
         O	
       O	
         O	
        O	
  


 When	
        Mr.	
     Holly	
     last	
     wrote	
      ,	
      many	
     years	
  


•  We	
  again	
  use	
  a	
  sequence	
  model	
  –	
  different	
  
   problem,	
  but	
  same	
  technology	
  
      –  Indeed,	
  sequence	
  models	
  are	
  used	
  for	
  lots	
  of	
  tasks	
  
         that	
  can	
  be	
  construed	
  as	
  labeling	
  tasks	
  that	
  
         require	
  only	
  local	
  context	
  (to	
  do	
  quite	
  well)	
  
•  There	
  is	
  a	
  background	
  label	
  –	
  O	
  –	
  and	
  labels	
  for	
  
   each	
  class	
  
•  Entities	
  are	
  both	
  segmented	
  and	
  categorized	
  
Stanford	
  NER	
  Features	
  
•  Word	
  features:	
  current	
  word,	
  previous	
  word,	
  next	
  
   word,	
  a	
  word	
  is	
  anywhere	
  in	
  a	
  +/–	
  4	
  word	
  window	
  
•  Orthographic	
  features:	
  	
  
     –  Jenny	
  	
   	
  	
  Xxxx	
  
     –  IL-­‐2	
  	
  	
  	
  	
  	
   	
  	
  XX-­‐#	
  
•  Prefixes	
  and	
  Suffixes:	
  
     –  Jenny	
  	
   	
  	
  <J,	
  <Je,	
  <Jen,	
  …,	
  nny>,	
  ny>,	
  y>	
  
•  Label	
  sequences	
  
•  Lots	
  of	
  feature	
  conjunctions	
  
Stanford	
  NER	
  
             http://nlp.stanford.edu/software/CRF-­‐NER.shtml	
  
$	
  java	
  -­‐mx500m	
  -­‐Dfile.encoding=utf-­‐8	
  -­‐cp	
  Software/stanford-­‐
ner-­‐2011-­‐06-­‐19/stanford-­‐ner.jar	
  edu.stanford.nlp.ie.crf.CRFClassifier	
  -­‐
loadClassifier	
  Software/stanford-­‐ner-­‐2011-­‐06-­‐19/classifiers/all.
3class.distsim.crf.ser.gz	
  -­‐textFile	
  RiderHaggard/She	
  3155.txt	
  >	
  ner/She	
  
3155.ner	
  
	
  
For	
  thou	
  shalt	
  rule	
  this	
  <LOCATION>England</LOCATION>-­‐-­‐-­‐-­‐”	
  
"But	
  we	
  have	
  a	
  queen	
  already,"	
  broke	
  in	
  <LOCATION>Leo</LOCATION>,	
  
hastily.	
  
"It	
  is	
  naught,	
  it	
  is	
  naught,"	
  said	
  <PERSON>Ayesha</PERSON>;	
  "she	
  can	
  
be	
  overthrown.”	
  
At	
  this	
  we	
  both	
  broke	
  out	
  into	
  an	
  exclamation	
  of	
  dismay,	
  and	
  explained	
  
that	
  we	
  should	
  as	
  soon	
  think	
  of	
  overthrowing	
  ourselves.	
  
"But	
  here	
  is	
  a	
  strange	
  thing,"	
  said	
  <PERSON>Ayesha</PERSON>,	
  in	
  
astonishment;	
  "a	
  queen	
  whom	
  her	
  people	
  love!	
  Surely	
  the	
  world	
  must	
  
have	
  changed	
  since	
  I	
  dwelt	
  in	
  <LOCATION>Kôr</LOCATION>."	
  
8.	
  PARSING	
  
Statistical	
  parsing	
  
•  One	
  of	
  the	
  big	
  successes	
  of	
  1990s	
  statistical	
  NLP	
  
   was	
  the	
  development	
  of	
  statistical	
  parsers	
  
•  These	
  are	
  trained	
  from	
  hand-­‐parsed	
  sentences	
  
   (“treebanks”),	
  and	
  know	
  statistics	
  about	
  phrase	
  
   structure	
  and	
  word	
  relationships,	
  and	
  use	
  them	
  to	
  
   assign	
  the	
  most	
  likely	
  structure	
  to	
  a	
  new	
  sentence	
  
•  They	
  will	
  return	
  a	
  sentence	
  parse	
  for	
  any	
  sequence	
  
   of	
  words.	
  And	
  it	
  will	
  usually	
  be	
  mostly	
  right	
  
•  There	
  are	
  many	
  opportunities	
  for	
  exploiting	
  this	
  
   richer	
  level	
  of	
  analysis,	
  which	
  have	
  only	
  been	
  partly	
  
   realized.	
  
Phrase	
  structure	
  Parsing	
  
•  Phrase	
  structure	
  representations	
  have	
  dominated	
  
   American	
  linguistics	
  since	
  the	
  1930s	
  
•  They	
  focus	
  on	
  showing	
  words	
  that	
  go	
  together	
  to	
  form	
  
   natural	
  groups	
  (constituents)	
  that	
  behave	
  alike	
  
•  They	
  are	
  good	
  for	
  showing	
  and	
  querying	
  details	
  of	
  
   sentence	
  structure	
  and	
  embedding	
  
                                 S
                                                   VP
          NP
                                          VBD                 VP
   NP           PP
                                                        VBN         PP
          IN           NP
                                                               IN         NP
   NNS         NNS     CC   NN
                                                                    NNP         NNP

  Bills   on   ports   and immigration    were   submitted     by   Senator    Brownback
Dependency	
  parsing	
  
•    A	
  dependency	
  parse	
  shows	
  which	
  words	
  in	
  a	
  sentence	
  modify	
  other	
  words	
  
•    The	
  key	
  notion	
  are	
  governors	
  with	
  dependents	
  
•    Widespread	
  use:	
  Pāṇini,	
  early	
  Arabic	
  grammarians,	
  diagramming	
  sentences,	
  …	
  

                                            submitted
                       nsubjpass                     auxpass              prep

                          Bills                  were                     by
                      prep                                                   pobj
                             on                                    Brownback
                       pobj                                         nn          appos
                          ports                          Senator              Republican
                       cc          conj                                       prep
                    and           immigration                                        of
                                                                               pobj
                                                                                 Kansas
Stanford	
  Dependencies	
  
•  SD	
  is	
  a	
  particular	
  dependency	
  representation	
  designed	
  for	
  easy	
  
   extraction	
  of	
  meaning	
  relationships	
  	
  [de	
  Marneffe	
  &	
  Manning,	
  2008]	
  
    –  It’s	
  basic	
  form	
  in	
  the	
  last	
  slide	
  has	
  each	
  word	
  as	
  is	
  
    –  A	
  “collapsed”	
  form	
  focuses	
  on	
  relations	
  between	
  main	
  words	
  

                                    submitted
                  nsubjpass                auxpass
                    Bills              were                 agent

             prep_on                                   Brownback
                                                       nn         appos
                     ports                    Senator           Republican
           conj_and            prep_on                      prep_of

                immigration                                        Kansas
Statistical	
  Parsers	
  	
  
•  There	
  are	
  now	
  many	
  good	
  statistical	
  parsers	
  that	
  
   are	
  freely	
  downloadable	
  
    –  Constituency	
  parsers	
  
         •  Collins/Bikel	
  Parser	
  
         •  Berkeley	
  Parser	
  
         •  BLLIP	
  Parser	
  =	
  Charniak/Johnson	
  Parser	
  
    –  Dependency	
  parsers	
  
         •  MaltParser	
  
         •  MST	
  Parser	
  
•  But	
  I’ll	
  show	
  the	
  Stanford	
  Parser	
  	
  
Tregex/Tgrep2	
  –	
  Tools	
  for	
  searching	
  
           over	
  syntax	
  	
  
dreadful	
  things	
  
She	
                                          Ayesha	
  
amod(day-­‐18,	
  dreadful-­‐17)	
             amod(clouds-­‐5,	
  dreadful-­‐2)	
  
amod(day-­‐45,	
  dreadful-­‐44)	
             amod(debt-­‐26,	
  dreadful-­‐25)	
  
amod(feast-­‐33,	
  dreadful-­‐32)	
           amod(doom-­‐21,	
  dreadful-­‐20)	
  
amod(fits-­‐51,	
  dreadful-­‐50)	
             amod(fashion-­‐50,	
  dreadful-­‐47)	
  
amod(form-­‐59,	
  dreadful-­‐58)	
            amod(form-­‐10,	
  dreadful-­‐7)	
  
amod(laugh-­‐9,	
  dreadful-­‐8)	
             amod(oath-­‐42,	
  dreadful-­‐41)	
  
amod(manifestation-­‐9,	
  dreadful-­‐8)	
     amod(road-­‐23,	
  dreadful-­‐22)	
  
amod(manner-­‐29,	
  dreadful-­‐28)	
          amod(silence-­‐5,	
  dreadful-­‐4)	
  
amod(marshes-­‐17,	
  dreadful-­‐16)	
         amod(threat-­‐19,	
  dreadful-­‐18)	
  
amod(people-­‐12,	
  dreadful-­‐11)	
  
amod(people-­‐46,	
  dreadful-­‐45)	
  
amod(place-­‐16,	
  dreadful-­‐15)	
  
amod(place-­‐6,	
  dreadful-­‐5)	
  
amod(sight-­‐5,	
  dreadful-­‐4)	
  
amod(spot-­‐13,	
  dreadful-­‐12)	
  
amod(thing-­‐41,	
  dreadful-­‐40)	
  
amod(thing-­‐5,	
  dreadful-­‐4)	
  
amod(tragedy-­‐22,	
  dreadful-­‐21)	
  
amod(wilderness-­‐43,	
  dreadful-­‐42)	
  
Making	
  use	
  of	
  dependency	
  structure	
  
J.	
  Engelberg	
  Costly	
  Information	
  Processing	
  (AFA,	
  2009):	
  	
  
•  An	
  efficient	
  market	
  should	
  immediately	
  incorporate	
  all	
  
       publicly	
  available	
  information.	
  
•  But	
  many	
  studies	
  have	
  shown	
  there	
  is	
  a	
  lag	
  
     –  And	
  the	
  lag	
  is	
  greater	
  on	
  Fridays	
  (!)	
  
•  An	
  explanation	
  for	
  this	
  is	
  that	
  there	
  is	
  a	
  cost	
  to	
  information	
  
   processing	
  
•  Engelberg	
  tests	
  and	
  shows	
  that	
   soft 	
  (textual)	
  information	
  
   takes	
  longer	
  to	
  be	
  absorbed	
  than	
   hard 	
  (numeric)	
  
   information	
  …	
  it s	
  higher	
  cost	
  information	
  processing	
  
•  But	
   soft 	
  information	
  has	
  value	
  beyond	
   hard 	
  information	
  
     –  It’s	
  especially	
  valuable	
  for	
  predicting	
  further	
  out	
  in	
  time	
  
         	
  	
  
Evidence from earnings announcements
                              [Engelberg AFA 2009]

•  But	
  how	
  do	
  you	
  use	
  the	
   soft 	
  information?	
  
•  Simply	
  using	
  proportion	
  of	
   negative 	
  words	
  (from	
  the	
  
   Harvard	
  General	
  Inquirer	
  lexicon)	
  is	
  a	
  useful	
  predictive	
  feature	
  
   of	
  future	
  stock	
  behavior	
  
    	
  	
  	
  Although	
  sales	
  remained	
  steady,	
  the	
  firm	
  continues	
  to	
  
                 suffer	
  from	
  rising	
  oil	
  prices.	
  
•  But	
  this	
  [or	
  text	
  categorization]	
  is	
  not	
  enough.	
  In	
  order	
  to	
  
   refine	
  my	
  analysis,	
  I	
  need	
  to	
  know	
  that	
  the	
  negative	
  
   sentiment	
  is	
  about	
  oil	
  prices. 	
  
•  He	
  thus	
  turns	
  to	
  use	
  of	
  the	
  typed	
  dependencies	
  
   representation	
  of	
  the	
  Stanford	
  Parser.	
  
    –  Words	
  that	
  negative	
  words	
  relate	
  to	
  are	
  grouped	
  into	
  1	
  of	
  
                 6	
  categories	
  [5	
  word	
  lists	
  or	
   other ]	
  
Evidence from earnings announcements
                                       [Engelberg 2009]


•  In	
  a	
  regression	
  model	
  with	
  many	
  standard	
  quantitative	
  
   predictors…	
  
    –  Just	
  the	
  negative	
  word	
  fraction	
  is	
  a	
  significant	
  predictor	
  of	
  3	
  
          day	
  or	
  80	
  day	
  post	
  earnings	
  announcement	
  abnormal	
  
          returns	
  (CAR)	
  
            •  Coefficient	
  −0.173,	
  p	
  <	
  0.05	
  for	
  80	
  day	
  CAR	
  
      –  Negative	
  sentiment	
  about	
  different	
  things	
  has	
  differential	
  
         effects	
  
            •  Fundamentals:	
  −0.198,	
  p	
  <	
  0.01	
  for	
  80	
  day	
  CAR	
  
            •  Future:	
  −0.356,	
  p	
  <	
  0.05	
  for	
  80	
  day	
  CAR	
  
            •  Other:	
  −0.023,	
  p	
  <	
  0.01	
  for	
  80	
  day	
  CAR	
  
      –  Only	
  some	
  of	
  which	
  analysts	
  pay	
  attention	
  to	
  
            •  Analyst	
  forecast-­‐for-­‐quarter-­‐ahead	
  earnings	
  is	
  predicted	
  by	
  
               negative	
  sentiment	
  on	
  Environment	
  and	
  Other	
  but	
  not	
  
               Fundamentals	
  or	
  Future!	
  
Syntactic Packaging and Implicit Sentiment
                 [Greene 2007; Greene and Resnik 2009]

•  Positive	
  or	
  negative	
  sentiment	
  can	
  be	
  carried	
  by	
  words	
  (e.g.,	
  
   adjectives),	
  but	
  often	
  it	
  isn’t….	
  
    –  These	
  sentences	
  differ	
  in	
  sentiment,	
  even	
  though	
  the	
  
       words	
  aren’t	
  so	
  different:	
  
           •  A	
  soldier	
  veered	
  his	
  jeep	
  into	
  a	
  crowded	
  market	
  and	
  killed	
  
              three	
  civilians	
  
           •  A	
  soldier s	
  jeep	
  veered	
  into	
  a	
  crowded	
  market	
  and	
  three	
  
              civilians	
  were	
  killed	
  
•  As	
  a	
  measurable	
  version	
  of	
  such	
  issues	
  of	
  linguistic	
  perspective,	
  
   they	
  define	
  OPUS	
  features	
  
     –  For	
  domain	
  relevant	
  terms,	
  OPUS	
  features	
  pair	
  the	
  word	
  with	
  a	
  
        syntactic	
  Stanford	
  Dependency:	
  
           •  killed:DOBJ 	
        	
  NSUBJ:soldier 	
      	
  killed:NSUBJ	
  
Predicting Opinions of the Death Penalty
                   [Greene 2007; Greene and Resnik 2009]

•  Collected	
  pro-­‐	
  and	
  anti-­‐	
  death	
  penalty	
  texts	
  from	
  websites	
  with	
  
   manual	
  checking	
  
•  Training	
  is	
  cross-­‐validation	
  of	
  training	
  on	
  some	
  pro-­‐	
  and	
  anti-­‐	
  sites	
  
   and	
  testing	
  on	
  documents	
  from	
  others	
  	
  	
  	
  	
  	
  	
  	
  [can t	
  use	
  site-­‐specific	
  
   nuances]	
  
•  Baseline	
  is	
  word	
  and	
  word	
  bigram	
  features	
  in	
  a	
  support	
  vector	
  
   machine	
  	
  	
  	
  	
  [SVM	
  =	
  good	
  classifier]	
  
              Condition                                     SVM accuracy
              Baseline                                      72.0%
              With OPUS features                            88.1%

•  58%	
  error	
  reduction!	
  
9.	
  COREFERENCE	
  
  RESOLUTION	
  
Coreference	
  resolution	
  
•  The	
  goal	
  is	
  to	
  work	
  out	
  which	
  (noun)	
  phrases	
  
   refer	
  to	
  the	
  same	
  entities	
  in	
  the	
  world	
  
    –  Sarah	
  asked	
  her	
  father	
  to	
  look	
  at	
  her.	
  He	
  
       appreciated	
  that	
  his	
  eldest	
  daughter	
  wanted	
  to	
  
       speak	
  frankly.	
  
•  ≈	
  anaphora	
  resolution	
  ≈	
  pronoun	
  resolution	
  ≈	
  
   entity	
  resolution	
  
Coreference	
  resolution	
  warnings	
  
•  Warning:	
  The	
  tools	
  we	
  have	
  looked	
  at	
  so	
  far	
  work	
  
   one	
  sentence	
  at	
  a	
  time	
  –	
  or	
  use	
  the	
  whole	
  
   document	
  but	
  ignore	
  all	
  structure	
  and	
  just	
  count	
  
   –	
  but	
  coreference	
  uses	
  the	
  whole	
  document	
  
•  The	
  resources	
  used	
  will	
  grow	
  with	
  the	
  document	
  
   size	
  –	
  you	
  might	
  want	
  to	
  try	
  a	
  chapter	
  not	
  a	
  novel	
  
•  Coreference	
  systems	
  normally	
  require	
  
   processing	
  with	
  parsers,	
  NER,	
  etc.	
  first,	
  and	
  use	
  
   of	
  lexicons	
  
Coreference	
  resolution	
  warnings	
  
•  English-­‐only	
  for	
  the	
  moment….	
  
•  While	
  there	
  are	
  some	
  papers	
  on	
  coreference	
  
   resolution	
  in	
  other	
  languages,	
  I	
  am	
  aware	
  of	
  no	
  
   downloadable	
  coreference	
  systems	
  for	
  any	
  
   language	
  other	
  than	
  English	
  
•  For	
  English,	
  there	
  are	
  a	
  good	
  number	
  of	
  
   downloadable	
  systems,	
  but	
  their	
  performance	
  
   remains	
  modest.	
  	
  It’s	
  just	
  not	
  like	
  POS	
  tagging,	
  
   NER	
  or	
  parsing	
  
Coreference	
  resolution	
  warnings	
  
Nevertheless,	
  it’s	
  not	
  yet	
  known	
  to	
  the	
  State	
  of	
  
California	
  to	
  cause	
  cancer,	
  so	
  let’s	
  continue….	
  
Stanford	
  CoreNLP	
  
           http://nlp.stanford.edu/software/corenlp.shtml	
  

•  Stanford	
  CoreNLP	
  is	
  our	
  new	
  package	
  that	
  ties	
  
   together	
  a	
  bunch	
  of	
  NLP	
  tools	
  
    –  POS	
  tagging	
  
    –  Named	
  Entity	
  Recognition	
  
    –  Parsing	
  
    –  and	
  Coreference	
  Resolution	
  
•  Output	
  is	
  an	
  XML	
  representation	
  [only	
  choice	
  at	
  present]	
  
•  Contains	
  a	
  state-­‐of-­‐the-­‐art	
  coreference	
  system!	
  
Stanford	
  CoreNLP	
  
$	
  java	
  -­‐mx3g	
  -­‐Dfile.encoding=utf-­‐8	
  -­‐cp	
  "Software/
stanford-­‐corenlp-­‐2011-­‐06-­‐08/stanford-­‐
corenlp-­‐2011-­‐06-­‐08.jar:Software/stanford-­‐
corenlp-­‐2011-­‐06-­‐08/stanford-­‐corenlp-­‐
models-­‐2011-­‐06-­‐08.jar:Software/stanford-­‐
corenlp-­‐2011-­‐06-­‐08/xom.jar:Software/stanford-­‐
corenlp-­‐2011-­‐06-­‐08/jgrapht.jar"	
  
edu.stanford.nlp.pipeline.StanfordCoreNLP	
  -­‐file	
  
RiderHaggard/Hunter	
  Quatermain's	
  Story	
  
2728.txt	
  -­‐outputDirectory	
  corenlp	
  
	
  
What	
  Stanford	
  CoreNLP	
  gives	
  
   –  Sarah	
  asked	
  her	
  father	
  to	
  look	
  at	
  her	
  .	
  	
  
   –  He	
  appreciated	
  that	
  his	
  eldest	
  daughter	
  wanted	
  
      to	
  speak	
  frankly	
  .	
  
•  Coreference	
  resolution	
  graph	
  
   –  sentence	
  1,	
  headword	
  1	
  (gov)	
  	
  
   –  sentence	
  1,	
  headword	
  3	
  
   –  sentence	
  1,	
  headword	
  4	
  (gov)	
  	
  
   –  sentence	
  2,	
  headword	
  1	
  
   –  sentence	
  2,	
  headword	
  4	
  
What	
  Stanford	
  CoreNLP	
  gives	
  
   –  Sarah	
  asked	
  her	
  father	
  to	
  look	
  at	
  her	
  .	
  	
  
   –  He	
  appreciated	
  that	
  his	
  eldest	
  daughter	
  wanted	
  
      to	
  speak	
  frankly	
  .	
  
•  Coreference	
  resolution	
  graph	
  
   –  sentence	
  1,	
  headword	
  1	
  (gov)	
  	
  
   –  sentence	
  1,	
  headword	
  3	
  

   –  sentence	
  1,	
  headword	
  4	
  (gov)	
  	
  
   –  sentence	
  2,	
  headword	
  1	
  
   –  sentence	
  2,	
  headword	
  4	
  
THE	
  REST	
  OF	
  THE	
  
LANGUAGES	
  OF	
  THE	
  
        WORLD	
  
           	
  
English-­‐only?	
  
•  There	
  are	
  a	
  lot	
  of	
  languages	
  out	
  there	
  in	
  the	
  world!	
  
•  But	
  there	
  are	
  a	
  lot	
  more	
  NLP	
  tools	
  for	
  English	
  than	
  
   anything	
  else	
  
•  However,	
  there	
  is	
  starting	
  to	
  be	
  fairly	
  reasonable	
  
   support	
  (or	
  the	
  ability	
  to	
  build	
  it)	
  for	
  most	
  of	
  the	
  top	
  
   50	
  or	
  so	
  languages…	
  
•  I’ll	
  say	
  a	
  little	
  about	
  that,	
  since	
  some	
  people	
  are	
  
   definitely	
  interested,	
  even	
  if	
  I’ve	
  covered	
  mainly	
  
   English	
  
POS	
  taggers	
  for	
  many	
  languages?	
  
•  Two	
  choices:	
  
    1.  Find	
  a	
  tagger	
  with	
  an	
  existing	
  model	
  for	
  the	
  
        language	
  (and	
  period)	
  of	
  interest	
  
    2.  Find	
  POS-­‐tagged	
  training	
  data	
  for	
  the	
  language	
  
        (and	
  period)	
  of	
  interest	
  and	
  train	
  your	
  own	
  
        tagger	
  
        •  Most	
  downloadable	
  taggers	
  allow	
  you	
  to	
  train	
  new	
  
           models	
  –	
  e.g.,	
  the	
  Stanford	
  POS	
  tagger	
  	
  
             –  But	
  it	
  may	
  involve	
  considerable	
  data	
  preparation	
  work	
  and	
  
                understanding	
  and	
  not	
  be	
  for	
  the	
  faint-­‐hearted	
  
POS	
  taggers	
  for	
  many	
  languages?	
  
•  One	
  tagger	
  with	
  good	
  existing	
  multi-­‐lingual	
  support	
  
    –  TreeTagger	
  (Helmut	
  Schmid)	
  
         •  http://www.ims.uni-­‐stuttgart.de/projekte/corplex/
            TreeTagger/	
  
         •  Bulgarian,	
  Chinese,	
  Dutch,	
  English,	
  Estonian,	
  French,	
  Old	
  
            French,	
  Galician,	
  German,	
  Greek,	
  Italian,	
  Latin,	
  Portuguese,	
  
            Russian,	
  Spanish,	
  Swahili	
  
         •  Free	
  for	
  non-­‐commercial,	
  not	
  open	
  source;	
  Linux,	
  Mac,	
  
            Sparc	
  (not	
  Windows)	
  
    –  Stanford	
  POS	
  Tagger	
  presently	
  comes	
  with:	
  
         •  English,	
  Arabic,	
  Chinese,	
  German	
  
•  One	
  place	
  to	
  look	
  for	
  more	
  resources:	
  
    –  http://nlp.stanford.edu/links/statnlp.html	
  
         •  But	
  it’s	
  always	
  out	
  of	
  date,	
  so	
  also	
  try	
  a	
  Google	
  search	
  	
  
Chinese	
  example	
  
•  Chinese	
  doesn’t	
  put	
  spaces	
  between	
  words	
  
    –  Nor	
  did	
  Ancient	
  Greek	
  
•  So	
  almost	
  all	
  tools	
  first	
  require	
  word	
  
   segmentation	
  
         •  I	
  demonstrate	
  the	
  Stanford	
  Chinese	
  Word	
  Segmenter	
  	
  
         •  http://nlp.stanford.edu/software/segmenter.shtml	
  	
  
•  Even	
  in	
  English,	
  words	
  need	
  some	
  segmentation	
  
   –	
  often	
  called	
  tokenization	
  
         •  It	
  was	
  being	
  implicitly	
  done	
  before	
  further	
  processing	
  
            in	
  the	
  examples	
  till	
  now:	
  	
  “I’ll	
  go.”	
  	
  	
   	
  	
  	
  “	
  	
  	
  I	
  	
  	
  ’ll	
  	
  	
  go	
  	
  	
  .	
  	
  	
  ”	
  	
  
Chinese	
  example	
  
•  $	
  ../Software/stanford-­‐chinese-­‐
   segmenter-­‐2010-­‐03-­‐08/segment.sh	
  ctb	
  
   Xinhua.txt	
  utf-­‐8	
  0	
  >	
  Xinhua.seg	
  
•  $	
  java	
  -­‐mx300m	
  -­‐cp	
  ../Software/stanford-­‐
   postagger-­‐full-­‐2011-­‐05-­‐18/stanford-­‐postagger.jar	
  
   edu.stanford.nlp.tagger.maxent.MaxentTagger	
  -­‐
   model	
  ../Software/stanford-­‐postagger-­‐
   full-­‐2011-­‐05-­‐18/models/chinese.tagger	
  -­‐textFile	
  
   Xinhua.seg	
  >	
  Xinhua.tag	
  
Chinese	
  example	
  
#	
  space	
  before	
   	
  below!	
  
$	
  perl	
  -­‐pe	
  'if	
  (	
  !	
  m/^s*$/	
  &&	
  !	
  m/^.{100}/)	
  {	
  s/$/	
   /;	
  }'	
  <	
  Xinhua.seg	
  >	
  
Xinhua.seg.fixed	
  
$	
  java	
  -­‐mx600m	
  -­‐cp	
  ../Software/stanford-­‐parser-­‐2011-­‐06-­‐15/stford-­‐
parser.jar	
  edu.stanford.nlp.parser.lexparser.LexicalizedParser	
  -­‐
encoding	
  utf-­‐8	
  ../Software/stanford-­‐parser-­‐2011-­‐04-­‐17/
chineseFactored.ser.gz	
  Xinhua.seg.fixed	
  >	
  Xinhua.parsed	
  
$	
  java	
  -­‐mx1g	
  -­‐cp	
  ../Software/stanford-­‐parser-­‐2011-­‐06-­‐15/stanford-­‐
parser.jar	
  edu.stanford.nlp.parser.lexparser.LexicalizedParser	
  -­‐
encoding	
  utf-­‐8	
  -­‐outputFormat	
  typedDependencies	
  ../Software/
stanford-­‐parser-­‐2011-­‐04-­‐17/chineseFactored.ser.gz	
  
Xinhua.seg.fixed	
  >	
  Xinhua.sd	
  
Other	
  tools	
  
•  Dependency	
  parsers	
  are	
  now	
  available	
  for	
  many	
  
   languages,	
  especially	
  via	
  MaltParser:	
  
    –  http://maltparser.org/	
  
•  For	
  instance,	
  it’s	
  used	
  to	
  provide	
  a	
  Russian	
  parser	
  
   among	
  the	
  resources	
  here:	
  
    –  http://corpus.leeds.ac.uk/mocky/	
  	
  
•  The	
  OPUS	
  (Open	
  Parallel	
  Corpus)	
  collects	
  tools	
  for	
  
   various	
  languages:	
  
    –  http://opus.lingfil.uu.se/trac/wiki/Tagging%20and
       %20Parsing	
  
•  Look	
  around!	
  
Data	
  sources	
  
•  Parsers	
  depend	
  on	
  annotated	
  data	
  (treebanks)	
  
•  You	
  can	
  use	
  a	
  parser	
  trained	
  on	
  news	
  articles,	
  but	
  
   better	
  resources	
  for	
  humanities	
  scholars	
  will	
  
   depend	
  on	
  community	
  efforts	
  to	
  produce	
  better	
  
   data	
  
•  One	
  effort	
  is	
  the	
  construction	
  of	
  Greek	
  and	
  Latin	
  
   dependency	
  treebanks	
  by	
  the	
  Perseus	
  ProjectI:	
  
    –  http://nlp.perseus.tufts.edu/syntax/treebank/	
  	
  
PARTING	
  WORDS	
  
Applications?	
  (beyond	
  word	
  counts)	
  
•  There	
  are	
  starting	
  to	
  be	
  a	
  few	
  applications	
  in	
  the	
  
   humanities	
  using	
  richer	
  NLP	
  methods:	
  
•  But	
  only	
  a	
  few….	
  
Applications?	
  (beyond	
  word	
  counts)	
  
–  Cameron	
  Blevins.	
  2011.	
  Topic	
  Modeling	
  Historical	
  
   Sources:	
  Analyzing	
  the	
  Diary	
  of	
  Martha	
  Ballard.	
  
   DH	
  2011.	
  
    •  Uses	
  (latent	
  variable)	
  topic	
  models	
  (LDA	
  and	
  friends)	
  
         –  Topic	
  model	
  are	
  primarily	
  used	
  to	
  find	
  themes	
  or	
  topics	
  
            running	
  through	
  a	
  group	
  of	
  texts	
  
         –  But,	
  here,	
  also	
  helpful	
  for	
  dealing	
  with	
  spelling	
  variation	
  (!)	
  
         –  Uses	
  MALLET	
  (http://mallet.cs.umass.edu/),	
  a	
  toolkit	
  with	
  a	
  
            fair	
  amount	
  of	
  stuff	
  for	
  text	
  classification,	
  sequence	
  tagging	
  
            and	
  topic	
  models	
  
              »  We	
  also	
  have	
  the	
  Stanford	
  Topic	
  Modeling	
  Toolbox	
  
                      •  http://nlp.stanford.edu/software/tmt/tmt-­‐0.3/	
  
    •  Examines	
  change	
  in	
  diary	
  entry	
  topics	
  over	
  time	
  
Applications?	
  (beyond	
  word	
  counts)	
  
–  David	
  K.	
  Elson,	
  Nicholas	
  Dames,	
  Kathleen	
  R.	
  
   McKeown.	
  2010.	
  Extracting	
  Social	
  Networks	
  from	
  
   Literary	
  Fiction.	
  ACL	
  2010.	
  
    •  How	
  size	
  of	
  community	
  in	
  novel	
  or	
  world	
  relates	
  to	
  
       amount	
  of	
  conversation	
  
         –  (Stanford)	
  NER	
  tagger	
  to	
  identify	
  people	
  and	
  organizations	
  
         –  Heuristically	
  matching	
  to	
  name	
  variants/shortenings	
  
         –  System	
  for	
  speech	
  attribution	
  (Elson	
  &	
  McKeown	
  2010)	
  
         –  Social	
  network	
  construction	
  
    •  Results	
  showing	
  that	
  urban	
  novel	
  social	
  networks	
  are	
  
       not	
  richer	
  than	
  those	
  in	
  rural	
  settings,	
  etc.	
  
Applications?	
  (beyond	
  word	
  counts)	
  
–  Aditi	
  Muralidharan.	
  2011.	
  A	
  Visual	
  Interface	
  for	
  
   Exploring	
  Language	
  Use	
  in	
  Slave	
  Narratives	
  DH	
  
   2011.	
  http://bebop.berkeley.edu/wordseer	
  	
  
    •  A	
  visualization	
  and	
  reading	
  interface	
  to	
  American	
  Slae	
  
       Narratives	
  
         –  (Stanford)	
  Parser	
  used	
  to	
  allow	
  searching	
  of	
  particular	
  
            grammatical	
  relationships:	
  grammatical	
  search	
  
         –  Visualization	
  tools	
  to	
  show	
  a	
  word’s	
  distribution	
  in	
  text	
  and	
  to	
  
            provide	
  a	
  “collapsed	
  concordance”	
  view	
  –	
  and	
  for	
  close	
  
            reading	
  
    •  	
  Example	
  application	
  is	
  exploring	
  relationship	
  with	
  God	
  
Parting	
  words	
  
                                        	
  
             This	
  talk	
  has	
  been	
  about	
  tools	
  –	
  	
  
                       they’re	
  what	
  I	
  know	
  
                                        	
  
   But	
  you	
  should	
  focus	
  on	
  disciplinary	
  insight	
  –	
  
 not	
  on	
  building	
  corpora	
  and	
  tools,	
  but	
  on	
  using	
  
	
  them	
  as	
  tools	
  for	
  producing	
  disciplinary	
  research	
  
                                        	
  
Natural Language Processing Tools for the Digital Humanities

Mais conteúdo relacionado

Semelhante a Natural Language Processing Tools for the Digital Humanities

Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013
Iván Montes
 

Semelhante a Natural Language Processing Tools for the Digital Humanities (20)

Rustbridge
RustbridgeRustbridge
Rustbridge
 
Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2
 
MPI, Erlang and the web
MPI, Erlang and the webMPI, Erlang and the web
MPI, Erlang and the web
 
DMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesDMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slides
 
Learning to code
Learning to codeLearning to code
Learning to code
 
Dmdh winter 2015 session #1
Dmdh winter 2015 session #1Dmdh winter 2015 session #1
Dmdh winter 2015 session #1
 
Bioinformatica p1-perl-introduction
Bioinformatica p1-perl-introductionBioinformatica p1-perl-introduction
Bioinformatica p1-perl-introduction
 
Go language presentation
Go language presentationGo language presentation
Go language presentation
 
Bioinformatics p1-perl-introduction v2013
Bioinformatics p1-perl-introduction v2013Bioinformatics p1-perl-introduction v2013
Bioinformatics p1-perl-introduction v2013
 
Introduction to Coding
Introduction to CodingIntroduction to Coding
Introduction to Coding
 
Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013
 
Perl Myths 200909
Perl Myths 200909Perl Myths 200909
Perl Myths 200909
 
2016 bioinformatics i_python_part_1_wim_vancriekinge
2016 bioinformatics i_python_part_1_wim_vancriekinge2016 bioinformatics i_python_part_1_wim_vancriekinge
2016 bioinformatics i_python_part_1_wim_vancriekinge
 
python presntation 2.pptx
python presntation 2.pptxpython presntation 2.pptx
python presntation 2.pptx
 
Rust Programming Language
Rust Programming LanguageRust Programming Language
Rust Programming Language
 
Austin Python Learners Meetup - Everything you need to know about programming...
Austin Python Learners Meetup - Everything you need to know about programming...Austin Python Learners Meetup - Everything you need to know about programming...
Austin Python Learners Meetup - Everything you need to know about programming...
 
P1 2017 python
P1 2017 pythonP1 2017 python
P1 2017 python
 
Rails development environment talk
Rails development environment talkRails development environment talk
Rails development environment talk
 
P1 2018 python
P1 2018 pythonP1 2018 python
P1 2018 python
 
Intro to Perl
Intro to PerlIntro to Perl
Intro to Perl
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Natural Language Processing Tools for the Digital Humanities

  • 1. Natural  Language  Processing   Tools  for  the  Digital  Humanities   Christopher  Manning   Stanford  University   Digital  Humanities  2011   http://nlp.stanford.edu/~manning/courses/DigitalHumanities/    
  • 3. My  humanities  qualifications   •  B.A.  (Hons),  Australian  National  University   •  Ph.D.  Linguistics,  Stanford  University   •  But:   –  I’m  not  sure  I’ve  ever  taken  a  real  humanities  class   (if  you  discount  linguistics  classes  and  high  school   English…)  
  • 4. SO,  FEEL  FREE  TO  ASK   QUESTIONS!  
  • 6. The  promise   Phrase  Net  visualization  of     Pride  &  Prejudice  (*  (in|at)  *)   http://www-958.ibm.com/software/data/cognos/manyeyes/
  • 7. “How  I  write”  [code]   •  I  think  you  tend  to  get  too  much  of  people   showing  the  glitzy  output  of  something   •  So,  for  this  tutorial,  at  least  in  the  slides  I’m   trying  to  include  the  low-­‐level  hacking  and   plumbing   •  It’s  a  standard  truism  of  data  mining  that  more   time  goes  into  “data  preparation”  than  anything   else.  Definitely  goes  for  text  processing.  
  • 8. Outline   1.  Introduction   2.  Getting  some  text   3.  Words   4.  Collocations,  etc.   5.  NLP  Frameworks  and  tools   6.  Part-­‐of-­‐speech  tagging   7.  Named  entity  recognition   8.  Parsing   9.  Coreference  resolution   10.  The  rest  of  the  languages  of  the  world   11.  Parting  words  
  • 10. First  step:  Text   •  To  do  anything,  you  need  some  texts!   –  Many  sites  give  you  various  sorts  of  search-­‐and-­‐ display  interfaces   –  But,  normally  you  just  can’t  do  what  you  want  in  NLP   for  the  Digital  Humanities  unless  you  have  a  copy  of   the  texts  sitting  on  your  computer   –  This  may  well  change  in  the  future:  There  is   increasing  use  of  cloud  computing  models  where  you   might  be  able  to  upload  code  to  run  it  on  data  on  a   server   •  or,  conversely,  upload  data  to  be  processed  by  code  on  a  server      
  • 11. First  step:  Text   •  People  in  the  audience  are  probably  more  familiar   with  the  state  of  play  here  than  me,  but  my   impression  is:   1.  There  are  increasingly  good  supplies  of  critical  texts   in  well-­‐marked-­‐up  XML  available  commercially  for   license  to  university  libraries   2.  There  are  various,  more  community  efforts  to   produce  good  digitized  collections,  but  most  of   those  seem  to  be  available  to  “friends”  rather  than   to  anybody  with  a  web  browser   3.  There’s  Project  Gutenberg     •  Plain  text,  or  very  simple  HTML,  which  may  or  may  not  be   automatically  generated   •  Unicode  utf-­‐8  if  you’re  lucky,  US-­‐ASCII  if  you’re  not  
  • 12. 1.  Early  English  Books  Online   •  TEI-­‐compliant  XML  texts   •  http://eebo.chadwyck.com/  
  • 13. 2.  Old  Bailey  Online  
  • 15. Running  example:  H.  Rider  Haggard   •  The  hugely  popular  King  Solomon's  Mines  (1885)  by  H.   Rider  Haggard  is  sometimes  considered  the  first  of  the   “Lost  World”  or  “Imperialist  Romance”  genres   •  Allan  Quatermain  (1887)   •  She  (1887)   •  Nada  the  Lily  (1892)   •  Ayesha:  The  Return  of  She   (1905)   •  She  and  Allan  (1921)   •  Zip  file  at:   http://nlp.stanford.edu/~manning/courses/DigitalHumanities/    
  • 16. Interfaces  to  tools   Web   Programming   applications   APIs   Command-­‐ GUI   line   applications   applications  
  • 17. You’ll  need  to  program   •  Lisa  Spiro,  TAMU  Digital  Scholarship  2009:   I’m a digital humanist with only limited programming skills (Perl & XSLT). Enhancing my programming skills would allow me to: •  Avoid so much tedious, manual work •  Do citation analysis •  Pre-process texts (remove the junk) •  Automatically download web pages •  And much more…
  • 18. You’ll  need  to  program   •  Program  in  what?   –  Perl   •  Traditional  seat-­‐of-­‐the-­‐pants  scripting  language  for    text   processing  (it  nailed  flexible  regex).    I  use  it  some  below….   –  Python   •  Cleaner,  more  modern  scripting  language  with  a  lot  of   energy,  and  the  best-­‐documented  NLP  framework,  NLTK.   –  Java   •  There  are  more  NLP  tools  for  Java  than  any  other  language.   And  it’s  one  of  those  most  popular  languages  in  general.   Good  regular  expressions,  Unicode,  etc.  
  • 19. You’ll  need  to  program   •  Program  with  what?   –  There  are  some  general  skills  that  you’ll  want  the   cut  across  programming  languages   •  Regular  expressions   •  XML,  especially  XPath  and  XSLT   •  Unicode   •  But  I’m  wisely  not  going  to  try  to  teach   programming  or  these  skills  in  this  tutorial    
  • 20. Grabbing  files  from  websites   •  wget  (Linux)  or  curl  (Mac  OS  X,  BSD)   –  wget  http://www.gutenberg.org/browse/authors/h   –  curl  -­‐O  http://www.gutenberg.org/browse/authors/h   •  If  you  really  want  to  use  your  browser,  there  are   things  you  can  get  like  this  Firefox  plug-­‐in   –  DownThemAll    http://www.downthemall.net/            but  then  you  just  can’t  do  things  as  flexibly  
  • 21. Grabbing  files  from  websites   #!/usr/bin/perl                                                                                                                                                                                                                                 while  (<>)  {  last  if  (m/Haggard/);  }   while  (<>)  {          last  if  (m/Hague/);          if  (m!pgdbetext"><a  href="/ebooks/(d+)">(.*)</a>  (English)!)  {                  $title  =  $2;                  $num  =  $1;                  $title  =~  s/<br>/  /g;                  $title  =~  s/r//g;                  print  "curl  -­‐o  "$title  $num.txt"  http://www.gutenberg.org/cache/epub/$num/pg$num.txtn";                  #  Expect  only  one  of  the  html  to  exist                                                                                                                                                                                  print  "curl  -­‐o  "$title  $num.html"  http://www.gutenberg.org/files/$num/$num-­‐h/$num-­‐h.htmn";                  print  "curl  -­‐o  "$title  $num-­‐g.html"  http://www.gutenberg.org/cache/epub/$num/pg$num.htmln";          }   }    
  • 22. Grabbing  files  from  websites   wget  http://www.gutenberg.org/browse/authors/h   perl  getHaggard.pl  <  h  >  h.sh   chmod  755  h.sh   ./h.sh   #  and  a  bit  of  futzing  by  hand  that  I  will  leave  out….     •  Often  you  want  the  90%  solution:  automating   nothing  would  be  slow  and  painful,  but  automating   everything  is  more  trouble  than  it’s  worth  for  a  one-­‐ off  process  
  • 23. Typical  text  problems   "Devilish  strange!"  thought  he,  chuckling  to  himself;  "queer  business!  Capital  trick  of  the  cull  in  the  cloak  to  make  another  person's  brat  stand  the  brunt   for  his  own-­‐-­‐-­‐capital!  ha!  ha!  Won't  do,  though.  He  must  be  a  sly  fox  to  get  out  of  the  Mint  without  my     [Page  59  ]     knowledge.  I've  a  shrewd  guess  where  he's  taken  refuge;  but  I'll  ferret  him  out.  These  bloods  will  pay  well  for  his  capture;  if  not,  he'll  pay  well  to  get  out   of  their  hands;  so  I'm  safe  either  way-­‐-­‐-­‐ha!  ha!  Blueskin,"  he  added  aloud,  and  motioning  that  worthy,  "follow  me."   Upon  which,  he  set  off  in  the  direction  of  the  entry.  His  progress,  however,  was  checked  by  loud  acclamations,  announcing  the  arrival  of  the  Master  of   the  Mint  and  his  train.   Baptist  Kettleby  (for  so  was  the  Master  named)  was  a  "goodly  portly  man,  and  a  corpulent,"  whose  fair  round  paunch  bespoke  the  affection  he   entertained  for  good  liquor  and  good  living.  He  had  a  quick,  shrewd,  merry  eye,  and  a  look  in  which  duplicity  was  agreeably  veiled  by  good  humour.  It   was  easy  to  discover  that  he  was  a  knave,  but  equally  easy  to  perceive  that  he  was  a  pleasant  fellow;  a  combination  of  qualities  by  no  means  of  rare   occurrence.  So  far  as  regards  his  attire,  Baptist  was  not  seen  to  advantage.  No  great  lover  of  state  or  state  costume  at  any  time,  he  was     [Page  60  ]     generally,  towards  the  close  of  an  evening,  completely  in  dishabille,  and  in  this  condition  he  now  presented  himself  to  his  subjects.  His  shirt  was   unfastened,  his  vest  unbuttoned,  his  hose  ungartered;  his  feet  were  stuck  into  a  pair  of  pantoufles,  his  arms  into  a  greasy  flannel  dressing-­‐gown,  his   head  into  a  thrum-­‐cap,  the  cap  into  a  tie-­‐periwig,  and  the  wig  into  a  gold-­‐edged  hat.  A  white  apron  was  tied  round  his  waist,  and  into  the  apron  was   thrust  a  short  thick  truncheon,  which  looked  very  much  like  a  rolling-­‐pin.   The  Master  of  the  Mint  was  accompanied  by  another  gentleman  almost  as  portly  as  himself,  and  quite  as  deliberate  in  his  movements.  The  costume  of   this  personage  was  somewhat  singular,  and  might  have  passed  for  a  masquerading  habit,  had  not  the  imperturbable  gravity  of  his  demeanour   forbidden  any  such  supposition.  It  consisted  of  a  close  jerkin  of  brown  frieze,  ornamented  with  a  triple  row  of  brass  buttons;  loose  Dutch  slops,  made   very  wide  in  the  seat  and  very  tight  at  the  knees;  red  stockings  with  black  clocks,  and     [Page  61  ]     a  fur  cap.  The  owner  of  this  dress  had  a  broad  weather-­‐beaten  face,  small  twinkling  eyes,  and  a  bushy,  grizzled  beard.  Though  he  walked  by  the  side  of   the  governor,  he  seldom  exchanged  a  word  with  him,  but  appeared  wholly  absorbed  in  the  contemplations  inspired  by  a  broad-­‐bowled  Dutch  pipe.  
  • 24. There  are  always  text-­‐processing   gotchas  …   •  …  and  not  dealing  with  them  can  badly  degrade   the  quality  of  subsequent  NLP  processing.   1.  The  Gutenberg  *.txt  files  frequently  represent   italics  with  _underscores_.   2.  There  may  be  file  headers  and  footers   3.  Elements  like  headings  may  be  run  together   with  following  sentences  if  not  demarcated  or   eliminated  (example  later).  
  • 25. There  are  always  text-­‐processing   gotchas  …   #!/usr/bin/perl   $finishedHeader  =  0;   $startedFooter  =  0;   while  ($line  =  <>)  {      if  ($line  =~  /^***s*END/  &&  $finishedHeader)  {          $startedFooter  =  1;      }      if  ($finishedHeader  &&  !  $startedFooter)  {          $line  =~  s/_//g;    #  minor  cleanup  of  italics          print  $line;      }      if  ($line  =~  /^***s*START/  &&  !  $finishedHeader)  {          $finishedHeader  =  1;      }   }   if  (  !  ($finishedHeader  &&  $startedFooter))  {      print  STDERR  "****  Probable  book  format  problem!n";   }  
  • 27. In  the  beginning  was  the  word   •  Word  counts   •  Word  counts  are  the  basis  of  all  the  simple,  first   order  methods  of  text  analysis   –  tag  clouds,  collocations,  topic  models   •  Sometimes  you  can  get  a  fair  distance  with  word   counts  
  • 28. She  (1887)   http://wordle.net/    Jonathan  Feinberg  
  • 29. Ayesha:  The  Return  of  She  (1905)  
  • 30. She  and  Allan  (1921)  
  • 31. Wisdom's  Daughter:  The  Life  and  Love  Story  of  She-­‐Who-­‐Must-­‐Be-­‐Obeyed  (1923)  
  • 32. Wisdom's  Daughter:  The  Life  and  Love  Story  of  She-­‐Who-­‐Must-­‐Be-­‐Obeyed  (1923)  
  • 33. Google  Books  Ngram  Viewer   http://ngrams.googlelabs.com/  
  • 34. Google  Books  Ngram  Viewer   •  …  you  have  to  be  the  most  jaded  or  cynical  scholar   not  to  be  excited  by  the  release  of  the   Google  Books  Ngram  Viewer  …  Digital  humanities   needs  gateway  drugs.  …  “Culturomics”   sounds  like  an  80s  new  wave  band.  If  we’re  going  to   coin  neologisms,  let’s  at  least  go  with  Sean  Gillies’   satirical  alternative:  Freakumanities.…  For  me,  the   biggest  problem  with  the  viewer  and  the  data  is  that   you  cannot  seamlessly  move  from  distant  reading  to   close  reading  
  • 35. Language  change:  as  least  as   C.  D.  Manning.  2003.  Probabilistic  Syntax     •  I  found  this  example  in  Russo  R.,  2001,  Empire   Falls  (on  p.3!):   –  By  the  time  their  son  was  born,  though,  Honus   Whiting  was  beginning  to  understand  and   privately  share  his  wife’s  opinion,  as  least  as  it   pertained  to  Empire  Falls.   •  What’s  interesting  about  it?  
  • 36. Language  change:  as  least  as   •  A  language  change  in  progress?  I  found  a  bunch  of  other   examples:   –  Indeed,  the  will  and  the  means  to  follow  through  are  as   least  as  important  as  the  initial  commitment  to  deficit   reduction.   –  As  many  of  you  know  he  had  his  boat  built  at  the  same   time  as  mine  and  it’s  as  least  as  well  maintained  and   equipped.   •  Apparently  not  a  “dialect”   –  Second,  if  the  required  disclosures  are  made  by  on-­‐screen   notice,  the  disclosure  of  the  vendor’s  legal  name  and  address   must  appear  on  one  of  several  specified  screens  on  the  vendor’s   electronic  site  and  must  be  at  least  as  legible  and  set  in  a  font   as  least  as  large  as  the  text  of  the  offer  itself.  
  • 37. Language  change:  as  least  as  
  • 38. Language  change:  as  least  as  
  • 40. Using  a  text  editor   •  You  can  get  a  fair  distance  with  a  text  editor  that   allows  multi-­‐file  searches,  regular  expressions,   etc.   –  It’s  like  a  little  concordancer  that’s  good  for  close   reading   •  jEdit        http://www.jedit.org/               •  BBedit  on  Windows  
  • 41.
  • 42. Traditional  Concordancers   •  WordSmith  Tools        Commercial;  Windows   –  http://www.lexically.net/wordsmith/   •  Concordance          Commercial;  Windows   –  http://www.concordancesoftware.co.uk/   •  AntConc      Free;  Windows,  Mac  OS  X  (only  under  X11);  Linux   –  http://www.antlab.sci.waseda.ac.jp/antconc_index.html   •  CasualConc      Free;  Mac  OS  X   –  http://sites.google.com/site/casualconc/   •  by  Yasu  Imao  
  • 43.
  • 44.
  • 45.
  • 46. The  decline  of  honour  
  • 47. 5.  NLP  FRAMEWORKS   AND  TOOLS  
  • 48. The  Big  3  NLP  Frameworks   •  GATE  –  General  Architecture  for  Text  Engineering  (U.  Sheffield)   •  http://gate.ac.uk/   •  Java,  quite  well  maintained  (now)   •  Includes  tons  of  components   •  UIMA  –  Unstructured  Information  Management  Architecture.   Originally  IBM;  now  Apache  project   •  http://uima.apache.org/   •  Professional,  scalable,  etc.   •  But,  unless  you’re  comfortable  with  Xml,  Eclipse,  Java  or  C++,  etc.,  I   think  it’s  a  non-­‐starter   •  NLTK  –  Natural  Language  To0lkit  (started  by  Steven  Bird)   •  http://www.nltk.org/   •  Big  community;  large  Python  package;  corpora  and  books  about  it   •  But  it’s  code  modules  and  API,  no  GUI  or  command-­‐line  tools   •  Like  R  for  NLP.    But,  hey,  R’s  becoming  very  successful….  
  • 49. The  main  NLP  Packages   •  NLTK      Python   –  http://www.nltk.org/   •  OpenNLP   –  http://incubator.apache.org/opennlp/   •  Stanford  NLP   –  http://nlp.stanford.edu/software/   •  LingPipe   –  http://alias-­‐i.com/lingpipe/     •  More  one-­‐off  packages  than  I  can  fit  on  this  slide   –  http://nlp.stanford.edu/links/statnlp.html  
  • 50. NLP  tools:  Rules  of  thumb  for  2011   1.  Unless  you’re  unlucky,  the  tool  you  want  to  use   will  work  with  Unicode  (at  least  BMP),  so  most   any  characters  are  okay   2.  Unless  you’re  lucky,  the  tool  you  want  to  use   will  work  only  on  completely  plain  text,  or   extremely  simple  XML-­‐style  mark-­‐up  (e.g.,  <s>   …  </s>  around  sentences,  recognized  by  regexp)   3.  By  default,  you  should  assume  that  any  tool  for   English  was  trained  on  American  newswire  
  • 52. Rule-­‐based  NLP  and  Statistical/ Machine  Learning  NLP   •  Most  work  on  NLP  in  the  1960s,  70s  and  80s  was   with  hand-­‐built  grammars  and  morphological   analyzers  (finite  state  transducers),  etc.   –  ANNIE  in  GATE  is  still  in  this  space   •  Most  academic  research  work  in  NLP  in  the   1990s  and  2000s  use  probabilistic  or  more   generally  machine  learning  methods  (“Statistical   NLP”)   –  The  Stanford  NLP  tools  and  MorphAdorner,   which  we  will  come  to  soon,  are  in  this  space  
  • 53. Rule-­‐based  NLP  and  Statistical/ Machine  Learning  NLP   •  Hand-­‐built  grammars  are  fine  for  tasks  in  a  closed   space  which  do  not  involve  reasoning  about   contexts   –  E.g.,  finding  the  possible  morphological  parses  of  a   word   •  In  the  old  days  they  worked  really  badly  on  “real   text”     –  They  were  always  insufficiently  tolerant  of  the   variability  of  real  language   –  But,  built  with  modern,  empirical  approaches,  they   can  do  reasonably  well   •  ANNIE  is  an  example  of  this  
  • 54. Rule-­‐based  NLP  and  Statistical/ Machine  Learning  NLP   •  In  Statistical  NLP:   –  You  gather  corpus  data,  and  usually  hand-­‐annotate  it  with  the   kind  of  information  you  want  to  provide,  such  as  part-­‐of-­‐speech   –  You  then  train  (or  “learn”)  a  model  that  learns  to  try  to  predict   annotations  based  on  features  of  words  and  their  contexts  via   numeric  feature  weights   –  You  then  apply  the  trained  model  to  new  text   •  This  tends  to  work  much  better  on  real  text   –  It  more  flexibly  handles  contextual  and  other  evidence   •  But  the  technology  is  still  far  from  perfect,  it  requires  annotated   data,  and  degrades  (sometimes  very  badly)  when  there  are   mismatches  between  the  training  data  and  the  runtime  data  
  • 55. How  much  hardware  do  you  need?   •  NLP  software  often  needs  plenty  of  RAM  (especially)   and  processing  power   •  But  these  days  we  have  really  powerful  laptops!   •  Some  of  the  software  I  show  you  could  run  on  a   machine  with  256  MB  of  RAM  (e.g.,  Stanford   Parser),  but  much  of  it  requires  more   •  Stanford  CoreNLP  requires  a  machine  with  4GB  of   RAM   •  I  ran  everything  in  this  tutorial  on  the  laptop  I’m   presenting  on  …  4GB  RAM,  2.8  GHz  Core  2  Duo   •  But  it  wasn’t  always  pleasant  writing  the  slides  while   software  was  running….  
  • 56. How  much  hardware  do  you  need?   •  Why  do  you  need  more  hardware?   –  More  speed   •  It  took  me  95  minutes  to  run  Ayesha,  the  Return  of  She   through  Stanford  CoreNLP  on  my  laptop….   –  More  scale   •  You’d  like  to  be  able  to  analyze  1  million  books   •  Order  of  magnitude  rules  of  thumb:   –  POS  tagging,  NER,  etc:  5–10,000  words/second   –  Parsing:  1–10  sentences  per  second  
  • 57. How  much  hardware  do  you  need?   •  Luckily,  most  of  our  problems  are  trivially   parallelizable   –  Each  book/chapter  can  be  run  separately,  perhaps   on  a  separate  machine   •  What  do  we  actually  use?   –  We  do  most  of  our  computing  on  rack  mounted   Linux  servers   •  Currently  4  x  quad  core  Xeon  processors  with  24  GB  of   RAM  seem  about  the  sweet  spot   •  About  $3500  per  machine  …  not  like  the  old  days  
  • 59. Part-­‐of-­‐Speech  Tagging   •  Part-­‐of-­‐speech  tagging  is  normally  done  by  a  sequence   model  (acronyms:  HMM,  CRM,  MEMM/CMM)   –  A  POS  tag  is  to  be  placed  above  each  word   –  The  model  considers  a  local  context  of  possible  previous   and  following  POS  tags,  the  current  word,  neighboring   words,  and  features  of  them  (capitalized?,  ends  in  -­‐ing?)   –  Each  such  feature  has  a  weight,  and  the  evidence  is   combined,  and  the  most  likely  sequence  of  tags   (according  to  the  model)  is  chosen   RB   NNP   NNP   RB   VBD   ,   JJ   NNS   When   Mr.   Holly   last   wrote   ,   many   years  
  • 60. Stanford  POS  tagger   http://nlp.stanford.edu/software/tagger.shtml   $  java  -­‐mx1g  -­‐cp  ../Software/stanford-­‐postagger-­‐full-­‐2011-­‐06-­‐19/ stanford-­‐postagger.jar  edu.stanford.nlp.tagger.maxent.MaxentTagger  -­‐ model  ../Software/stanford-­‐postagger-­‐full-­‐2011-­‐06-­‐19/models/ left3words-­‐distsim-­‐wsj-­‐0-­‐18.tagger  -­‐outputFormat  tsv  -­‐tokenizerOptions   untokenizable=allKeep  -­‐textFile  She  3155.txt  >  She  3155.tsv   Loading  default  properties  from  trained  tagger  ../Software/stanford-­‐ postagger-­‐full-­‐2011-­‐06-­‐19/models/left3words-­‐distsim-­‐wsj-­‐0-­‐18.tagger   Reading  POS  tagger  model  from  ../Software/stanford-­‐postagger-­‐ full-­‐2011-­‐06-­‐19/models/left3words-­‐distsim-­‐wsj-­‐0-­‐18.tagger  ...  done  [2.2   sec].   Jun  15,  2011  8:17:15  PM  edu.stanford.nlp.process.PTBLexer  next   Greek  stand-­‐ alone   WARNING:  Untokenizable:  ?  (U+1FBD,  decimal:  8125)   Koronis   character  (a   Tagged  132377  words  at  5559.72  words  per  second.   little   obscure?)  
  • 61. Stanford  POS  tagger   •  For  the  second  time  you  do  it…   $  alias  stanfordtag  "java  -­‐mx1g  -­‐cp  /Users/manning/Software/ stanford-­‐postagger-­‐full-­‐2011-­‐06-­‐19/stanford-­‐postagger.jar   edu.stanford.nlp.tagger.maxent.MaxentTagger  -­‐model  /Users/ manning/Software/stanford-­‐postagger-­‐full-­‐2011-­‐06-­‐19/models/ left3words-­‐distsim-­‐wsj-­‐0-­‐18.tagger  -­‐outputFormat  tsv  -­‐ tokenizerOptions  untokenizable=allKeep  -­‐textFile"   $  stanfordtag  RiderHaggard/King  Solomon's  Mines  2166.txt  >   tagged/King  Solomon's  Mines  2166.tsv   Reading  POS  tagger  model  from  /Users/manning/Software/ stanford-­‐postagger-­‐full-­‐2011-­‐06-­‐19/models/left3words-­‐distsim-­‐ wsj-­‐0-­‐18.tagger  ...  done  [2.1  sec].   Tagged  98178  words  at  9807.99  words  per  second.  
  • 62. MorphAdorner   http://morphadorner.northwestern.edu/   •  MorphAdorner  is  a  set  of  NLP  tools  developed  at   Northwestern  by  Martin  Mueller  and  colleagues   specifically  for  English  language  fiction,  over  a   long  historical  period  from  EME  onwards   –  lemmatizer,  named  entity  recognizer,  POS   tagger,  spelling  standardizer,  etc.   •  Aims  to  deal  with  variation  in  word  breaking  and   spelling  over  this  period   •  Includes  its  own  POS  tag  set:  NUPOS  
  • 63. MorphAdorner   $  ./adornplaintext  temp  temp/3155.txt   2011-­‐06-­‐15  20:30:52,111  INFO    -­‐  MorphAdorner  version  1.0   2011-­‐06-­‐15  20:30:52,111  INFO    -­‐  Initializing,  please  wait...   2011-­‐06-­‐15  20:30:52,318  INFO    -­‐  Using  Trigram  tagger.   2011-­‐06-­‐15  20:30:52,319  INFO    -­‐  Using  I  retagger.   2011-­‐06-­‐15  20:30:53,578  INFO    -­‐  Loaded  word  lexicon  with  151,922  entries  in  2  seconds.   2011-­‐06-­‐15  20:30:55,920  INFO    -­‐  Loaded  suffix  lexicon  with  214,503  entries  in  3  seconds.   2011-­‐06-­‐15  20:30:57,927  INFO    -­‐  Loaded  transition  matrix  in  3  seconds.   2011-­‐06-­‐15  20:30:58,137  INFO    -­‐  Loaded  162,248  standard  spellings  in  1  second.   2011-­‐06-­‐15  20:30:58,697  INFO    -­‐  Loaded  5,434  alternative  spellings  in  1  second.   2011-­‐06-­‐15  20:30:58,703  INFO    -­‐  Loaded  349  more  alternative  spellings  in  14  word  classes  in  1  second.   2011-­‐06-­‐15  20:30:58,713  INFO    -­‐  Loaded  0  names  into  name  standardizer  in  <  1  second.   2011-­‐06-­‐15  20:30:58,779  INFO    -­‐  1  file  to  process.   2011-­‐06-­‐15  20:30:58,789  INFO    -­‐  Before  processing  input  texts:  Free  memory:  105,741,696,  total  memory:  480,694,272   2011-­‐06-­‐15  20:30:58,789  INFO    -­‐  Processing  file  'temp/3155.txt'  .   2011-­‐06-­‐15  20:30:58,789  INFO    -­‐  Adorning  temp/3155.txt  with  parts  of  speech.   2011-­‐06-­‐15  20:30:58,832  INFO    -­‐  Loaded  text  from  temp/3155.txt  in  1  second.   2011-­‐06-­‐15  20:31:01,498  INFO    -­‐        Extracted  131,875  words  in  4,556  sentences  in  3  seconds.   2011-­‐06-­‐15  20:31:03,860  INFO    -­‐              lines:  1,000;  words:  27,756   2011-­‐06-­‐15  20:31:04,364  INFO    -­‐              lines:  2,000;  words:  58,728   2011-­‐06-­‐15  20:31:04,676  INFO    -­‐              lines:  3,000;  words:  84,735   2011-­‐06-­‐15  20:31:04,990  INFO    -­‐              lines:  4,000;  words:  115,396   2011-­‐06-­‐15  20:31:05,152  INFO    -­‐              lines:  4,556;  words:  131,875   2011-­‐06-­‐15  20:31:05,152  INFO    -­‐        Part  of  speech  adornment  completed  in  4  seconds.  36,100  words  adorned  per  second.   2011-­‐06-­‐15  20:31:05,152  INFO    -­‐        Generating  other  adornments.   2011-­‐06-­‐15  20:31:13,840  INFO    -­‐        Adornments  written  to  temp/3155-­‐005.txt  in  9  seconds.   2011-­‐06-­‐15  20:31:13,840  INFO    -­‐  All  files  adorned  in  16  seconds.    
  • 64. Ah,  the  old  days!   $  ./adornplaintext  temp  temp/Hunter  Quartermain.txt     2011-­‐06-­‐15  17:18:15,551  INFO    -­‐  MorphAdorner  version  1.0   2011-­‐06-­‐15  17:18:15,552  INFO    -­‐  Initializing,  please  wait...   2011-­‐06-­‐15  17:18:15,730  INFO    -­‐  Using  Trigram  tagger.   2011-­‐06-­‐15  17:18:15,731  INFO    -­‐  Using  I  retagger.   2011-­‐06-­‐15  17:18:16,972  INFO    -­‐  Loaded  word  lexicon  with  151,922  entries  in  2   seconds.   2011-­‐06-­‐15  17:18:18,684  INFO    -­‐  Loaded  suffix  lexicon  with  214,503  entries  in  2   seconds.   2011-­‐06-­‐15  17:18:20,662  INFO    -­‐  Loaded  transition  matrix  in  2  seconds.   2011-­‐06-­‐15  17:18:20,887  INFO    -­‐  Loaded  162,248  standard  spellings  in  1  second.   2011-­‐06-­‐15  17:18:21,300  INFO    -­‐  Loaded  5,434  alternative  spellings  in  1  second.   2011-­‐06-­‐15  17:18:21,303  INFO    -­‐  Loaded  349  more  alternative  spellings  in  14  word   classes  in  1  second.   2011-­‐06-­‐15  17:18:21,312  INFO    -­‐  Loaded  0  names  into  name  standardizer  in  1  second.   2011-­‐06-­‐15  17:18:21,381  INFO    -­‐  No  files  found  to  process.   •  But  it  works  better  if  you  make  sure  the  filename  has   no  spaces  in  it    
  • 65. Comparing  taggers:  Penn  Treebank  vs.   NUPOS   Holly  NNP  Holly  n1   going  VBG          going  vvg   ,    ,    ,    ,   to    TO    to    pc-­‐acp   if    IN    if    cs   leave  VB    leave  vvi   you    PRP  you    pn22   you  PRP          you    pn22   will    MD    will    vmb   that  IN    that  d   accept  VB    accept  vvi   boy  NN    boy's  ng1     the    DT    the    dt   's    POS   trust  NN    trust  n1   sole  JJ    sole  j   ,    ,    ,    ,   guardian  NN  guardian  n1   I    PRP  I    pns11   .    .    .    .   am    VBP  am    vbm      
  • 66. Comparing  taggers:  Penn  Treebank  vs.   NUPOS   Holly  NNP  Holly  n1   going  VBG          going  vvg   ,    ,    ,    ,   to    TO    to    pc-­‐acp   if    IN    if    cs   leave  VB    leave  vvi   you    PRP  you    pn22   you  PRP          you    pn22   will    MD    will    vmb   that  IN    that  d   accept  VB    accept  vvi   boy  NN    boy's  ng1     the    DT    the    dt   's    POS   trust  NN    trust  n1   sole  JJ    sole  j   ,    ,    ,    ,   guardian  NN  guardian  n1   I    PRP  I    pns11   .    .    .    .   am    VBP  am    vbm      
  • 67. Stylistic  factors  from  POS   14000   12000   10000   8000   JJ   6000   MD   4000   DT   2000   0   She   Ayesha   She  and  Allan   Wisdom's   Daughter  
  • 68. 7.  NAMED  ENTITY   RECOGNITION   (NER)  
  • 69. Named  Entity  Recognition     –  “the  Chad  problem”   Germanyʼ’s representative to the European Unionʼ’s veterinary committee Werner Zwingman said on Wednesday consumers should … IL-2 gene expression and NF-kappa B activation through CD28 requires reactive oxygen production by 5-lipoxygenase.
  • 70. Conditional  Random  Fields  (CRFs)   O   PER   PER   O   O   O   O   O   When   Mr.   Holly   last   wrote   ,   many   years   •  We  again  use  a  sequence  model  –  different   problem,  but  same  technology   –  Indeed,  sequence  models  are  used  for  lots  of  tasks   that  can  be  construed  as  labeling  tasks  that   require  only  local  context  (to  do  quite  well)   •  There  is  a  background  label  –  O  –  and  labels  for   each  class   •  Entities  are  both  segmented  and  categorized  
  • 71. Stanford  NER  Features   •  Word  features:  current  word,  previous  word,  next   word,  a  word  is  anywhere  in  a  +/–  4  word  window   •  Orthographic  features:     –  Jenny        Xxxx   –  IL-­‐2                XX-­‐#   •  Prefixes  and  Suffixes:   –  Jenny        <J,  <Je,  <Jen,  …,  nny>,  ny>,  y>   •  Label  sequences   •  Lots  of  feature  conjunctions  
  • 72. Stanford  NER   http://nlp.stanford.edu/software/CRF-­‐NER.shtml   $  java  -­‐mx500m  -­‐Dfile.encoding=utf-­‐8  -­‐cp  Software/stanford-­‐ ner-­‐2011-­‐06-­‐19/stanford-­‐ner.jar  edu.stanford.nlp.ie.crf.CRFClassifier  -­‐ loadClassifier  Software/stanford-­‐ner-­‐2011-­‐06-­‐19/classifiers/all. 3class.distsim.crf.ser.gz  -­‐textFile  RiderHaggard/She  3155.txt  >  ner/She   3155.ner     For  thou  shalt  rule  this  <LOCATION>England</LOCATION>-­‐-­‐-­‐-­‐”   "But  we  have  a  queen  already,"  broke  in  <LOCATION>Leo</LOCATION>,   hastily.   "It  is  naught,  it  is  naught,"  said  <PERSON>Ayesha</PERSON>;  "she  can   be  overthrown.”   At  this  we  both  broke  out  into  an  exclamation  of  dismay,  and  explained   that  we  should  as  soon  think  of  overthrowing  ourselves.   "But  here  is  a  strange  thing,"  said  <PERSON>Ayesha</PERSON>,  in   astonishment;  "a  queen  whom  her  people  love!  Surely  the  world  must   have  changed  since  I  dwelt  in  <LOCATION>Kôr</LOCATION>."  
  • 74. Statistical  parsing   •  One  of  the  big  successes  of  1990s  statistical  NLP   was  the  development  of  statistical  parsers   •  These  are  trained  from  hand-­‐parsed  sentences   (“treebanks”),  and  know  statistics  about  phrase   structure  and  word  relationships,  and  use  them  to   assign  the  most  likely  structure  to  a  new  sentence   •  They  will  return  a  sentence  parse  for  any  sequence   of  words.  And  it  will  usually  be  mostly  right   •  There  are  many  opportunities  for  exploiting  this   richer  level  of  analysis,  which  have  only  been  partly   realized.  
  • 75. Phrase  structure  Parsing   •  Phrase  structure  representations  have  dominated   American  linguistics  since  the  1930s   •  They  focus  on  showing  words  that  go  together  to  form   natural  groups  (constituents)  that  behave  alike   •  They  are  good  for  showing  and  querying  details  of   sentence  structure  and  embedding   S VP NP VBD VP NP PP VBN PP IN NP IN NP NNS NNS CC NN NNP NNP Bills on ports and immigration were submitted by Senator Brownback
  • 76. Dependency  parsing   •  A  dependency  parse  shows  which  words  in  a  sentence  modify  other  words   •  The  key  notion  are  governors  with  dependents   •  Widespread  use:  Pāṇini,  early  Arabic  grammarians,  diagramming  sentences,  …   submitted nsubjpass auxpass prep Bills were by prep pobj on Brownback pobj nn appos ports Senator Republican cc conj prep and immigration of pobj Kansas
  • 77. Stanford  Dependencies   •  SD  is  a  particular  dependency  representation  designed  for  easy   extraction  of  meaning  relationships    [de  Marneffe  &  Manning,  2008]   –  It’s  basic  form  in  the  last  slide  has  each  word  as  is   –  A  “collapsed”  form  focuses  on  relations  between  main  words   submitted nsubjpass auxpass Bills were agent prep_on Brownback nn appos ports Senator Republican conj_and prep_on prep_of immigration Kansas
  • 78. Statistical  Parsers     •  There  are  now  many  good  statistical  parsers  that   are  freely  downloadable   –  Constituency  parsers   •  Collins/Bikel  Parser   •  Berkeley  Parser   •  BLLIP  Parser  =  Charniak/Johnson  Parser   –  Dependency  parsers   •  MaltParser   •  MST  Parser   •  But  I’ll  show  the  Stanford  Parser    
  • 79. Tregex/Tgrep2  –  Tools  for  searching   over  syntax    
  • 80. dreadful  things   She   Ayesha   amod(day-­‐18,  dreadful-­‐17)   amod(clouds-­‐5,  dreadful-­‐2)   amod(day-­‐45,  dreadful-­‐44)   amod(debt-­‐26,  dreadful-­‐25)   amod(feast-­‐33,  dreadful-­‐32)   amod(doom-­‐21,  dreadful-­‐20)   amod(fits-­‐51,  dreadful-­‐50)   amod(fashion-­‐50,  dreadful-­‐47)   amod(form-­‐59,  dreadful-­‐58)   amod(form-­‐10,  dreadful-­‐7)   amod(laugh-­‐9,  dreadful-­‐8)   amod(oath-­‐42,  dreadful-­‐41)   amod(manifestation-­‐9,  dreadful-­‐8)   amod(road-­‐23,  dreadful-­‐22)   amod(manner-­‐29,  dreadful-­‐28)   amod(silence-­‐5,  dreadful-­‐4)   amod(marshes-­‐17,  dreadful-­‐16)   amod(threat-­‐19,  dreadful-­‐18)   amod(people-­‐12,  dreadful-­‐11)   amod(people-­‐46,  dreadful-­‐45)   amod(place-­‐16,  dreadful-­‐15)   amod(place-­‐6,  dreadful-­‐5)   amod(sight-­‐5,  dreadful-­‐4)   amod(spot-­‐13,  dreadful-­‐12)   amod(thing-­‐41,  dreadful-­‐40)   amod(thing-­‐5,  dreadful-­‐4)   amod(tragedy-­‐22,  dreadful-­‐21)   amod(wilderness-­‐43,  dreadful-­‐42)  
  • 81. Making  use  of  dependency  structure   J.  Engelberg  Costly  Information  Processing  (AFA,  2009):     •  An  efficient  market  should  immediately  incorporate  all   publicly  available  information.   •  But  many  studies  have  shown  there  is  a  lag   –  And  the  lag  is  greater  on  Fridays  (!)   •  An  explanation  for  this  is  that  there  is  a  cost  to  information   processing   •  Engelberg  tests  and  shows  that   soft  (textual)  information   takes  longer  to  be  absorbed  than   hard  (numeric)   information  …  it s  higher  cost  information  processing   •  But   soft  information  has  value  beyond   hard  information   –  It’s  especially  valuable  for  predicting  further  out  in  time      
  • 82. Evidence from earnings announcements [Engelberg AFA 2009] •  But  how  do  you  use  the   soft  information?   •  Simply  using  proportion  of   negative  words  (from  the   Harvard  General  Inquirer  lexicon)  is  a  useful  predictive  feature   of  future  stock  behavior        Although  sales  remained  steady,  the  firm  continues  to   suffer  from  rising  oil  prices.   •  But  this  [or  text  categorization]  is  not  enough.  In  order  to   refine  my  analysis,  I  need  to  know  that  the  negative   sentiment  is  about  oil  prices.   •  He  thus  turns  to  use  of  the  typed  dependencies   representation  of  the  Stanford  Parser.   –  Words  that  negative  words  relate  to  are  grouped  into  1  of   6  categories  [5  word  lists  or   other ]  
  • 83. Evidence from earnings announcements [Engelberg 2009] •  In  a  regression  model  with  many  standard  quantitative   predictors…   –  Just  the  negative  word  fraction  is  a  significant  predictor  of  3   day  or  80  day  post  earnings  announcement  abnormal   returns  (CAR)   •  Coefficient  −0.173,  p  <  0.05  for  80  day  CAR   –  Negative  sentiment  about  different  things  has  differential   effects   •  Fundamentals:  −0.198,  p  <  0.01  for  80  day  CAR   •  Future:  −0.356,  p  <  0.05  for  80  day  CAR   •  Other:  −0.023,  p  <  0.01  for  80  day  CAR   –  Only  some  of  which  analysts  pay  attention  to   •  Analyst  forecast-­‐for-­‐quarter-­‐ahead  earnings  is  predicted  by   negative  sentiment  on  Environment  and  Other  but  not   Fundamentals  or  Future!  
  • 84. Syntactic Packaging and Implicit Sentiment [Greene 2007; Greene and Resnik 2009] •  Positive  or  negative  sentiment  can  be  carried  by  words  (e.g.,   adjectives),  but  often  it  isn’t….   –  These  sentences  differ  in  sentiment,  even  though  the   words  aren’t  so  different:   •  A  soldier  veered  his  jeep  into  a  crowded  market  and  killed   three  civilians   •  A  soldier s  jeep  veered  into  a  crowded  market  and  three   civilians  were  killed   •  As  a  measurable  version  of  such  issues  of  linguistic  perspective,   they  define  OPUS  features   –  For  domain  relevant  terms,  OPUS  features  pair  the  word  with  a   syntactic  Stanford  Dependency:   •  killed:DOBJ    NSUBJ:soldier    killed:NSUBJ  
  • 85. Predicting Opinions of the Death Penalty [Greene 2007; Greene and Resnik 2009] •  Collected  pro-­‐  and  anti-­‐  death  penalty  texts  from  websites  with   manual  checking   •  Training  is  cross-­‐validation  of  training  on  some  pro-­‐  and  anti-­‐  sites   and  testing  on  documents  from  others                [can t  use  site-­‐specific   nuances]   •  Baseline  is  word  and  word  bigram  features  in  a  support  vector   machine          [SVM  =  good  classifier]   Condition SVM accuracy Baseline 72.0% With OPUS features 88.1% •  58%  error  reduction!  
  • 86. 9.  COREFERENCE   RESOLUTION  
  • 87. Coreference  resolution   •  The  goal  is  to  work  out  which  (noun)  phrases   refer  to  the  same  entities  in  the  world   –  Sarah  asked  her  father  to  look  at  her.  He   appreciated  that  his  eldest  daughter  wanted  to   speak  frankly.   •  ≈  anaphora  resolution  ≈  pronoun  resolution  ≈   entity  resolution  
  • 88. Coreference  resolution  warnings   •  Warning:  The  tools  we  have  looked  at  so  far  work   one  sentence  at  a  time  –  or  use  the  whole   document  but  ignore  all  structure  and  just  count   –  but  coreference  uses  the  whole  document   •  The  resources  used  will  grow  with  the  document   size  –  you  might  want  to  try  a  chapter  not  a  novel   •  Coreference  systems  normally  require   processing  with  parsers,  NER,  etc.  first,  and  use   of  lexicons  
  • 89. Coreference  resolution  warnings   •  English-­‐only  for  the  moment….   •  While  there  are  some  papers  on  coreference   resolution  in  other  languages,  I  am  aware  of  no   downloadable  coreference  systems  for  any   language  other  than  English   •  For  English,  there  are  a  good  number  of   downloadable  systems,  but  their  performance   remains  modest.    It’s  just  not  like  POS  tagging,   NER  or  parsing  
  • 90. Coreference  resolution  warnings   Nevertheless,  it’s  not  yet  known  to  the  State  of   California  to  cause  cancer,  so  let’s  continue….  
  • 91. Stanford  CoreNLP   http://nlp.stanford.edu/software/corenlp.shtml   •  Stanford  CoreNLP  is  our  new  package  that  ties   together  a  bunch  of  NLP  tools   –  POS  tagging   –  Named  Entity  Recognition   –  Parsing   –  and  Coreference  Resolution   •  Output  is  an  XML  representation  [only  choice  at  present]   •  Contains  a  state-­‐of-­‐the-­‐art  coreference  system!  
  • 92. Stanford  CoreNLP   $  java  -­‐mx3g  -­‐Dfile.encoding=utf-­‐8  -­‐cp  "Software/ stanford-­‐corenlp-­‐2011-­‐06-­‐08/stanford-­‐ corenlp-­‐2011-­‐06-­‐08.jar:Software/stanford-­‐ corenlp-­‐2011-­‐06-­‐08/stanford-­‐corenlp-­‐ models-­‐2011-­‐06-­‐08.jar:Software/stanford-­‐ corenlp-­‐2011-­‐06-­‐08/xom.jar:Software/stanford-­‐ corenlp-­‐2011-­‐06-­‐08/jgrapht.jar"   edu.stanford.nlp.pipeline.StanfordCoreNLP  -­‐file   RiderHaggard/Hunter  Quatermain's  Story   2728.txt  -­‐outputDirectory  corenlp    
  • 93. What  Stanford  CoreNLP  gives   –  Sarah  asked  her  father  to  look  at  her  .     –  He  appreciated  that  his  eldest  daughter  wanted   to  speak  frankly  .   •  Coreference  resolution  graph   –  sentence  1,  headword  1  (gov)     –  sentence  1,  headword  3   –  sentence  1,  headword  4  (gov)     –  sentence  2,  headword  1   –  sentence  2,  headword  4  
  • 94. What  Stanford  CoreNLP  gives   –  Sarah  asked  her  father  to  look  at  her  .     –  He  appreciated  that  his  eldest  daughter  wanted   to  speak  frankly  .   •  Coreference  resolution  graph   –  sentence  1,  headword  1  (gov)     –  sentence  1,  headword  3   –  sentence  1,  headword  4  (gov)     –  sentence  2,  headword  1   –  sentence  2,  headword  4  
  • 95. THE  REST  OF  THE   LANGUAGES  OF  THE   WORLD    
  • 96. English-­‐only?   •  There  are  a  lot  of  languages  out  there  in  the  world!   •  But  there  are  a  lot  more  NLP  tools  for  English  than   anything  else   •  However,  there  is  starting  to  be  fairly  reasonable   support  (or  the  ability  to  build  it)  for  most  of  the  top   50  or  so  languages…   •  I’ll  say  a  little  about  that,  since  some  people  are   definitely  interested,  even  if  I’ve  covered  mainly   English  
  • 97. POS  taggers  for  many  languages?   •  Two  choices:   1.  Find  a  tagger  with  an  existing  model  for  the   language  (and  period)  of  interest   2.  Find  POS-­‐tagged  training  data  for  the  language   (and  period)  of  interest  and  train  your  own   tagger   •  Most  downloadable  taggers  allow  you  to  train  new   models  –  e.g.,  the  Stanford  POS  tagger     –  But  it  may  involve  considerable  data  preparation  work  and   understanding  and  not  be  for  the  faint-­‐hearted  
  • 98. POS  taggers  for  many  languages?   •  One  tagger  with  good  existing  multi-­‐lingual  support   –  TreeTagger  (Helmut  Schmid)   •  http://www.ims.uni-­‐stuttgart.de/projekte/corplex/ TreeTagger/   •  Bulgarian,  Chinese,  Dutch,  English,  Estonian,  French,  Old   French,  Galician,  German,  Greek,  Italian,  Latin,  Portuguese,   Russian,  Spanish,  Swahili   •  Free  for  non-­‐commercial,  not  open  source;  Linux,  Mac,   Sparc  (not  Windows)   –  Stanford  POS  Tagger  presently  comes  with:   •  English,  Arabic,  Chinese,  German   •  One  place  to  look  for  more  resources:   –  http://nlp.stanford.edu/links/statnlp.html   •  But  it’s  always  out  of  date,  so  also  try  a  Google  search    
  • 99. Chinese  example   •  Chinese  doesn’t  put  spaces  between  words   –  Nor  did  Ancient  Greek   •  So  almost  all  tools  first  require  word   segmentation   •  I  demonstrate  the  Stanford  Chinese  Word  Segmenter     •  http://nlp.stanford.edu/software/segmenter.shtml     •  Even  in  English,  words  need  some  segmentation   –  often  called  tokenization   •  It  was  being  implicitly  done  before  further  processing   in  the  examples  till  now:    “I’ll  go.”            “      I      ’ll      go      .      ”    
  • 100. Chinese  example   •  $  ../Software/stanford-­‐chinese-­‐ segmenter-­‐2010-­‐03-­‐08/segment.sh  ctb   Xinhua.txt  utf-­‐8  0  >  Xinhua.seg   •  $  java  -­‐mx300m  -­‐cp  ../Software/stanford-­‐ postagger-­‐full-­‐2011-­‐05-­‐18/stanford-­‐postagger.jar   edu.stanford.nlp.tagger.maxent.MaxentTagger  -­‐ model  ../Software/stanford-­‐postagger-­‐ full-­‐2011-­‐05-­‐18/models/chinese.tagger  -­‐textFile   Xinhua.seg  >  Xinhua.tag  
  • 101. Chinese  example   #  space  before    below!   $  perl  -­‐pe  'if  (  !  m/^s*$/  &&  !  m/^.{100}/)  {  s/$/   /;  }'  <  Xinhua.seg  >   Xinhua.seg.fixed   $  java  -­‐mx600m  -­‐cp  ../Software/stanford-­‐parser-­‐2011-­‐06-­‐15/stford-­‐ parser.jar  edu.stanford.nlp.parser.lexparser.LexicalizedParser  -­‐ encoding  utf-­‐8  ../Software/stanford-­‐parser-­‐2011-­‐04-­‐17/ chineseFactored.ser.gz  Xinhua.seg.fixed  >  Xinhua.parsed   $  java  -­‐mx1g  -­‐cp  ../Software/stanford-­‐parser-­‐2011-­‐06-­‐15/stanford-­‐ parser.jar  edu.stanford.nlp.parser.lexparser.LexicalizedParser  -­‐ encoding  utf-­‐8  -­‐outputFormat  typedDependencies  ../Software/ stanford-­‐parser-­‐2011-­‐04-­‐17/chineseFactored.ser.gz   Xinhua.seg.fixed  >  Xinhua.sd  
  • 102. Other  tools   •  Dependency  parsers  are  now  available  for  many   languages,  especially  via  MaltParser:   –  http://maltparser.org/   •  For  instance,  it’s  used  to  provide  a  Russian  parser   among  the  resources  here:   –  http://corpus.leeds.ac.uk/mocky/     •  The  OPUS  (Open  Parallel  Corpus)  collects  tools  for   various  languages:   –  http://opus.lingfil.uu.se/trac/wiki/Tagging%20and %20Parsing   •  Look  around!  
  • 103. Data  sources   •  Parsers  depend  on  annotated  data  (treebanks)   •  You  can  use  a  parser  trained  on  news  articles,  but   better  resources  for  humanities  scholars  will   depend  on  community  efforts  to  produce  better   data   •  One  effort  is  the  construction  of  Greek  and  Latin   dependency  treebanks  by  the  Perseus  ProjectI:   –  http://nlp.perseus.tufts.edu/syntax/treebank/    
  • 105. Applications?  (beyond  word  counts)   •  There  are  starting  to  be  a  few  applications  in  the   humanities  using  richer  NLP  methods:   •  But  only  a  few….  
  • 106. Applications?  (beyond  word  counts)   –  Cameron  Blevins.  2011.  Topic  Modeling  Historical   Sources:  Analyzing  the  Diary  of  Martha  Ballard.   DH  2011.   •  Uses  (latent  variable)  topic  models  (LDA  and  friends)   –  Topic  model  are  primarily  used  to  find  themes  or  topics   running  through  a  group  of  texts   –  But,  here,  also  helpful  for  dealing  with  spelling  variation  (!)   –  Uses  MALLET  (http://mallet.cs.umass.edu/),  a  toolkit  with  a   fair  amount  of  stuff  for  text  classification,  sequence  tagging   and  topic  models   »  We  also  have  the  Stanford  Topic  Modeling  Toolbox   •  http://nlp.stanford.edu/software/tmt/tmt-­‐0.3/   •  Examines  change  in  diary  entry  topics  over  time  
  • 107. Applications?  (beyond  word  counts)   –  David  K.  Elson,  Nicholas  Dames,  Kathleen  R.   McKeown.  2010.  Extracting  Social  Networks  from   Literary  Fiction.  ACL  2010.   •  How  size  of  community  in  novel  or  world  relates  to   amount  of  conversation   –  (Stanford)  NER  tagger  to  identify  people  and  organizations   –  Heuristically  matching  to  name  variants/shortenings   –  System  for  speech  attribution  (Elson  &  McKeown  2010)   –  Social  network  construction   •  Results  showing  that  urban  novel  social  networks  are   not  richer  than  those  in  rural  settings,  etc.  
  • 108. Applications?  (beyond  word  counts)   –  Aditi  Muralidharan.  2011.  A  Visual  Interface  for   Exploring  Language  Use  in  Slave  Narratives  DH   2011.  http://bebop.berkeley.edu/wordseer     •  A  visualization  and  reading  interface  to  American  Slae   Narratives   –  (Stanford)  Parser  used  to  allow  searching  of  particular   grammatical  relationships:  grammatical  search   –  Visualization  tools  to  show  a  word’s  distribution  in  text  and  to   provide  a  “collapsed  concordance”  view  –  and  for  close   reading   •   Example  application  is  exploring  relationship  with  God  
  • 109. Parting  words     This  talk  has  been  about  tools  –     they’re  what  I  know     But  you  should  focus  on  disciplinary  insight  –   not  on  building  corpora  and  tools,  but  on  using    them  as  tools  for  producing  disciplinary  research