Natural Language Processing Tools for the Digital Humanities

Natural
Language
Processing

Tools
for
the
Digital
Humanities

Christopher
Manning

Stanford
University

Digital
Humanities
2011

http://nlp.stanford.edu/~manning/courses/DigitalHumanities/

My
humanities
qualiﬁcations

•  B.A.
(Hons),
Australian
National
University

•  Ph.D.
Linguistics,
Stanford
University

•  But:

–  I’m
not
sure
I’ve
ever
taken
a
real
humanities
class

(if
you
discount
linguistics
classes
and
high
school

English…)

SO,
FEEL
FREE
TO
ASK

QUESTIONS!

The
promise

Phrase
Net
visualization
of

Pride
&
Prejudice
(*
(in|at)
*)

http://www-958.ibm.com/software/data/cognos/manyeyes/

“How
I
write”
[code]

•  I
think
you
tend
to
get
too
much
of
people

showing
the
glitzy
output
of
something

•  So,
for
this
tutorial,
at
least
in
the
slides
I’m

trying
to
include
the
low-‐level
hacking
and

plumbing

•  It’s
a
standard
truism
of
data
mining
that
more

time
goes
into
“data
preparation”
than
anything

else.
Deﬁnitely
goes
for
text
processing.

Outline

1.  Introduction

2.  Getting
some
text

3.  Words

4.  Collocations,
etc.

5.  NLP
Frameworks
and
tools

6.  Part-‐of-‐speech
tagging

7.  Named
entity
recognition

8.  Parsing

9.  Coreference
resolution

10.  The
rest
of
the
languages
of
the
world

11.  Parting
words

2.
GETTING
SOME
TEXT

First
step:
Text

•  To
do
anything,
you
need
some
texts!

–  Many
sites
give
you
various
sorts
of
search-‐and-‐
display
interfaces

–  But,
normally
you
just
can’t
do
what
you
want
in
NLP

for
the
Digital
Humanities
unless
you
have
a
copy
of

the
texts
sitting
on
your
computer

–  This
may
well
change
in
the
future:
There
is

increasing
use
of
cloud
computing
models
where
you

might
be
able
to
upload
code
to
run
it
on
data
on
a

server

•  or,
conversely,
upload
data
to
be
processed
by
code
on
a
server

First
step:
Text

•  People
in
the
audience
are
probably
more
familiar

with
the
state
of
play
here
than
me,
but
my

impression
is:

1.  There
are
increasingly
good
supplies
of
critical
texts

in
well-‐marked-‐up
XML
available
commercially
for

license
to
university
libraries

2.  There
are
various,
more
community
eﬀorts
to

produce
good
digitized
collections,
but
most
of

those
seem
to
be
available
to
“friends”
rather
than

to
anybody
with
a
web
browser

3.  There’s
Project
Gutenberg


•  Plain
text,
or
very
simple
HTML,
which
may
or
may
not
be

automatically
generated

•  Unicode
utf-‐8
if
you’re
lucky,
US-‐ASCII
if
you’re
not

1.
Early
English
Books
Online

•  TEI-‐compliant
XML
texts

•  http://eebo.chadwyck.com/

2.
Old
Bailey
Online

Running
example:
H.
Rider
Haggard

•  The
hugely
popular
King
Solomon's
Mines
(1885)
by
H.

Rider
Haggard
is
sometimes
considered
the
ﬁrst
of
the

“Lost
World”
or
“Imperialist
Romance”
genres

•  Allan
Quatermain
(1887)

•  She
(1887)

•  Nada
the
Lily
(1892)

•  Ayesha:
The
Return
of
She

(1905)

•  She
and
Allan
(1921)

•  Zip
ﬁle
at:

http://nlp.stanford.edu/~manning/courses/DigitalHumanities/

Interfaces
to
tools

Web
Programming

applications
APIs

Command-‐
GUI

line

applications

applications

You’ll
need
to
program

•  Lisa
Spiro,
TAMU
Digital
Scholarship
2009:

I’m a digital humanist with only limited programming
skills (Perl & XSLT). Enhancing my programming
skills would allow me to:
•  Avoid so much tedious, manual work
•  Do citation analysis
•  Pre-process texts (remove the junk)
•  Automatically download web pages
•  And much more…

You’ll
need
to
program

•  Program
in
what?

–  Perl

•  Traditional
seat-‐of-‐the-‐pants
scripting
language
for

text

processing
(it
nailed
ﬂexible
regex).

I
use
it
some
below….

–  Python

•  Cleaner,
more
modern
scripting
language
with
a
lot
of

energy,
and
the
best-‐documented
NLP
framework,
NLTK.

–  Java

•  There
are
more
NLP
tools
for
Java
than
any
other
language.

And
it’s
one
of
those
most
popular
languages
in
general.

Good
regular
expressions,
Unicode,
etc.

You’ll
need
to
program

•  Program
with
what?

–  There
are
some
general
skills
that
you’ll
want
the

cut
across
programming
languages

•  Regular
expressions

•  XML,
especially
XPath
and
XSLT

•  Unicode

•  But
I’m
wisely
not
going
to
try
to
teach

programming
or
these
skills
in
this
tutorial


Grabbing
ﬁles
from
websites

•  wget
(Linux)
or
curl
(Mac
OS
X,
BSD)

–  wget
http://www.gutenberg.org/browse/authors/h

–  curl
-‐O

•  If
you
really
want
to
use
your
browser,
there
are

things
you
can
get
like
this
Firefox
plug-‐in

–  DownThemAll

http://www.downthemall.net/

but
then
you
just
can’t
do
things
as
ﬂexibly

Grabbing
ﬁles
from
websites

#!/usr/bin/perl

while
(<>)
{
last
if
(m/Haggard/);
}

while
(<>)
{

last
if
(m/Hague/);

if
(m!pgdbetext"><a
href="/ebooks/(d+)">(.*)</a>
(English)!)
{

$title
=
$2;

$num
=
$1;

$title
=~
s/<br>/
/g;

$title
=~
s/r//g;

print
"curl
-‐o
"$title
$num.txt"
http://www.gutenberg.org/cache/epub/$num/pg$num.txtn";

#
Expect
only
one
of
the
html
to
exist

print
"curl
-‐o
"$title
$num.html"
http://www.gutenberg.org/ﬁles/$num/$num-‐h/$num-‐h.htmn";

print
"curl
-‐o
"$title
$num-‐g.html"
http://www.gutenberg.org/cache/epub/$num/pg$num.htmln";

}

}

Grabbing
ﬁles
from
websites

wget

perl
getHaggard.pl
<
h
>
h.sh

chmod
755
h.sh

./h.sh

#
and
a
bit
of
futzing
by
hand
that
I
will
leave
out….

•  Often
you
want
the
90%
solution:
automating

nothing
would
be
slow
and
painful,
but
automating

everything
is
more
trouble
than
it’s
worth
for
a
one-‐
oﬀ
process

Typical
text
problems

"Devilish
strange!"
thought
he,
chuckling
to
himself;
"queer
business!
Capital
trick
of
the
cull
in
the
cloak
to
make
another
person's
brat
stand
the
brunt

for
his
own-‐-‐-‐capital!
ha!
ha!
Won't
do,
though.
He
must
be
a
sly
fox
to
get
out
of
the
Mint
without
my

[Page
59
]

knowledge.
I've
a
shrewd
guess
where
he's
taken
refuge;
but
I'll
ferret
him
out.
These
bloods
will
pay
well
for
his
capture;
if
not,
he'll
pay
well
to
get
out

of
their
hands;
so
I'm
safe
either
way-‐-‐-‐ha!
ha!
Blueskin,"
he
added
aloud,
and
motioning
that
worthy,
"follow
me."

Upon
which,
he
set
off
in
the
direction
of
the
entry.
His
progress,
however,
was
checked
by
loud
acclamations,
announcing
the
arrival
of
the
Master
of

the
Mint
and
his
train.

Baptist
Kettleby
(for
so
was
the
Master
named)
was
a
"goodly
portly
man,
and
a
corpulent,"
whose
fair
round
paunch
bespoke
the
affection
he

entertained
for
good
liquor
and
good
living.
He
had
a
quick,
shrewd,
merry
eye,
and
a
look
in
which
duplicity
was
agreeably
veiled
by
good
humour.
It

was
easy
to
discover
that
he
was
a
knave,
but
equally
easy
to
perceive
that
he
was
a
pleasant
fellow;
a
combination
of
qualities
by
no
means
of
rare

occurrence.
So
far
as
regards
his
attire,
Baptist
was
not
seen
to
advantage.
No
great
lover
of
state
or
state
costume
at
any
time,
he
was

[Page
60
]

generally,
towards
the
close
of
an
evening,
completely
in
dishabille,
and
in
this
condition
he
now
presented
himself
to
his
subjects.
His
shirt
was

unfastened,
his
vest
unbuttoned,
his
hose
ungartered;
his
feet
were
stuck
into
a
pair
of
pantoufles,
his
arms
into
a
greasy
flannel
dressing-‐gown,
his

head
into
a
thrum-‐cap,
the
cap
into
a
tie-‐periwig,
and
the
wig
into
a
gold-‐edged
hat.
A
white
apron
was
tied
round
his
waist,
and
into
the
apron
was

thrust
a
short
thick
truncheon,
which
looked
very
much
like
a
rolling-‐pin.

The
Master
of
the
Mint
was
accompanied
by
another
gentleman
almost
as
portly
as
himself,
and
quite
as
deliberate
in
his
movements.
The
costume
of

this
personage
was
somewhat
singular,
and
might
have
passed
for
a
masquerading
habit,
had
not
the
imperturbable
gravity
of
his
demeanour

forbidden
any
such
supposition.
It
consisted
of
a
close
jerkin
of
brown
frieze,
ornamented
with
a
triple
row
of
brass
buttons;
loose
Dutch
slops,
made

very
wide
in
the
seat
and
very
tight
at
the
knees;
red
stockings
with
black
clocks,
and

[Page
61
]

a
fur
cap.
The
owner
of
this
dress
had
a
broad
weather-‐beaten
face,
small
twinkling
eyes,
and
a
bushy,
grizzled
beard.
Though
he
walked
by
the
side
of

the
governor,
he
seldom
exchanged
a
word
with
him,
but
appeared
wholly
absorbed
in
the
contemplations
inspired
by
a
broad-‐bowled
Dutch
pipe.

There
are
always
text-‐processing

gotchas
…

•  …
and
not
dealing
with
them
can
badly
degrade

the
quality
of
subsequent
NLP
processing.

1.  The
Gutenberg
*.txt
ﬁles
frequently
represent

italics
with
_underscores_.

2.  There
may
be
ﬁle
headers
and
footers

3.  Elements
like
headings
may
be
run
together

with
following
sentences
if
not
demarcated
or

eliminated
(example
later).

There
are
always
text-‐processing

gotchas
…

#!/usr/bin/perl

$finishedHeader
=
0;

$startedFooter
=
0;

while
($line
=
<>)
{

if
($line
=~
/^***s*END/
&&
$finishedHeader)
{

$startedFooter
=
1;

}

if
($finishedHeader
&&
!
$startedFooter)
{

$line
=~
s/_//g;

#
minor
cleanup
of
italics

print
$line;

}

if
($line
=~
/^***s*START/
&&
!
$finishedHeader)
{

$finishedHeader
=
1;

}

}

if
(
!
($finishedHeader
&&
$startedFooter))
{

print
STDERR
"****
Probable
book
format
problem!n";

}

In
the
beginning
was
the
word

•  Word
counts

•  Word
counts
are
the
basis
of
all
the
simple,
ﬁrst

order
methods
of
text
analysis

–  tag
clouds,
collocations,
topic
models

•  Sometimes
you
can
get
a
fair
distance
with
word

counts

She
(1887)
http://wordle.net/

Jonathan
Feinberg

Ayesha:
The
Return
of
She
(1905)

She
and
Allan
(1921)

Wisdom's
Daughter:
The
Life
and
Love
Story
of
She-‐Who-‐Must-‐Be-‐Obeyed
(1923)

Google
Books
Ngram
Viewer

http://ngrams.googlelabs.com/

Google
Books
Ngram
Viewer

•  …
you
have
to
be
the
most
jaded
or
cynical
scholar

not
to
be
excited
by
the
release
of
the

Google
Books
Ngram
Viewer
…
Digital
humanities

needs
gateway
drugs.
…
“Culturomics”

sounds
like
an
80s
new
wave
band.
If
we’re
going
to

coin
neologisms,
let’s
at
least
go
with
Sean
Gillies’

satirical
alternative:
Freakumanities.…
For
me,
the

biggest
problem
with
the
viewer
and
the
data
is
that

you
cannot
seamlessly
move
from
distant
reading
to

close
reading

Language
change:
as
least
as

C.
D.
Manning.
2003.
Probabilistic
Syntax

•  I
found
this
example
in
Russo
R.,
2001,
Empire

Falls
(on
p.3!):

–  By
the
time
their
son
was
born,
though,
Honus

Whiting
was
beginning
to
understand
and

privately
share
his
wife’s
opinion,
as
least
as
it

pertained
to
Empire
Falls.

•  What’s
interesting
about
it?

Language
change:
as
least
as

•  A
language
change
in
progress?
I
found
a
bunch
of
other

examples:

–  Indeed,
the
will
and
the
means
to
follow
through
are
as

least
as
important
as
the
initial
commitment
to
deficit

reduction.

–  As
many
of
you
know
he
had
his
boat
built
at
the
same

time
as
mine
and
it’s
as
least
as
well
maintained
and

equipped.

•  Apparently
not
a
“dialect”

–  Second,
if
the
required
disclosures
are
made
by
on-‐screen

notice,
the
disclosure
of
the
vendor’s
legal
name
and
address

must
appear
on
one
of
several
specified
screens
on
the
vendor’s

electronic
site
and
must
be
at
least
as
legible
and
set
in
a
font

as
least
as
large
as
the
text
of
the
offer
itself.

Language
change:
as
least
as

Using
a
text
editor

•  You
can
get
a
fair
distance
with
a
text
editor
that

allows
multi-‐ﬁle
searches,
regular
expressions,

etc.

–  It’s
like
a
little
concordancer
that’s
good
for
close

reading

•  jEdit

http://www.jedit.org/

•  BBedit
on
Windows

Traditional
Concordancers

•  WordSmith
Tools

Commercial;
Windows

–  http://www.lexically.net/wordsmith/

•  Concordance

Commercial;
Windows

–  http://www.concordancesoftware.co.uk/

•  AntConc

Free;
Windows,
Mac
OS
X
(only
under
X11);
Linux

–  http://www.antlab.sci.waseda.ac.jp/antconc_index.html

•  CasualConc

Free;
Mac
OS
X

–  http://sites.google.com/site/casualconc/

•  by
Yasu
Imao

The
decline
of
honour

5.
NLP
FRAMEWORKS

AND
TOOLS

The
Big
3
NLP
Frameworks

•  GATE
–
General
Architecture
for
Text
Engineering
(U.
Sheﬃeld)

•  http://gate.ac.uk/

•  Java,
quite
well
maintained
(now)

•  Includes
tons
of
components

•  UIMA
–
Unstructured
Information
Management
Architecture.

Originally
IBM;
now
Apache
project

•  http://uima.apache.org/

•  Professional,
scalable,
etc.

•  But,
unless
you’re
comfortable
with
Xml,
Eclipse,
Java
or
C++,
etc.,
I

think
it’s
a
non-‐starter

•  NLTK
–
Natural
Language
To0lkit
(started
by
Steven
Bird)

•  http://www.nltk.org/

•  Big
community;
large
Python
package;
corpora
and
books
about
it

•  But
it’s
code
modules
and
API,
no
GUI
or
command-‐line
tools

•  Like
R
for
NLP.

But,
hey,
R’s
becoming
very
successful….

The
main
NLP
Packages

•  NLTK

Python

–  http://www.nltk.org/

•  OpenNLP

–  http://incubator.apache.org/opennlp/

•  Stanford
NLP

–  http://nlp.stanford.edu/software/

•  LingPipe

–  http://alias-‐i.com/lingpipe/

•  More
one-‐oﬀ
packages
than
I
can
ﬁt
on
this
slide

–  http://nlp.stanford.edu/links/statnlp.html

NLP
tools:
Rules
of
thumb
for
2011

1.  Unless
you’re
unlucky,
the
tool
you
want
to
use

will
work
with
Unicode
(at
least
BMP),
so
most

any
characters
are
okay

2.  Unless
you’re
lucky,
the
tool
you
want
to
use

will
work
only
on
completely
plain
text,
or

extremely
simple
XML-‐style
mark-‐up
(e.g.,
<s>

…
</s>
around
sentences,
recognized
by
regexp)

3.  By
default,
you
should
assume
that
any
tool
for

English
was
trained
on
American
newswire

Rule-‐based
NLP
and
Statistical/
Machine
Learning
NLP

•  Most
work
on
NLP
in
the
1960s,
70s
and
80s
was

with
hand-‐built
grammars
and
morphological

analyzers
(ﬁnite
state
transducers),
etc.

–  ANNIE
in
GATE
is
still
in
this
space

•  Most
academic
research
work
in
NLP
in
the

1990s
and
2000s
use
probabilistic
or
more

generally
machine
learning
methods
(“Statistical

NLP”)

–  The
Stanford
NLP
tools
and
MorphAdorner,

which
we
will
come
to
soon,
are
in
this
space

Rule-‐based
NLP
and
Statistical/
Machine
Learning
NLP

•  Hand-‐built
grammars
are
fine
for
tasks
in
a
closed

space
which
do
not
involve
reasoning
about

contexts

–  E.g.,
finding
the
possible
morphological
parses
of
a

word

•  In
the
old
days
they
worked
really
badly
on
“real

text”

–  They
were
always
insufficiently
tolerant
of
the

variability
of
real
language

–  But,
built
with
modern,
empirical
approaches,
they

can
do
reasonably
well

•  ANNIE
is
an
example
of
this

Rule-‐based
NLP
and
Statistical/
Machine
Learning
NLP

•  In
Statistical
NLP:

–  You
gather
corpus
data,
and
usually
hand-‐annotate
it
with
the

kind
of
information
you
want
to
provide,
such
as
part-‐of-‐speech

–  You
then
train
(or
“learn”)
a
model
that
learns
to
try
to
predict

annotations
based
on
features
of
words
and
their
contexts
via

numeric
feature
weights

–  You
then
apply
the
trained
model
to
new
text

•  This
tends
to
work
much
better
on
real
text

–  It
more
ﬂexibly
handles
contextual
and
other
evidence

•  But
the
technology
is
still
far
from
perfect,
it
requires
annotated

data,
and
degrades
(sometimes
very
badly)
when
there
are

mismatches
between
the
training
data
and
the
runtime
data

How
much
hardware
do
you
need?

•  NLP
software
often
needs
plenty
of
RAM
(especially)

and
processing
power

•  But
these
days
we
have
really
powerful
laptops!

•  Some
of
the
software
I
show
you
could
run
on
a

machine
with
256
MB
of
RAM
(e.g.,
Stanford

Parser),
but
much
of
it
requires
more

•  Stanford
CoreNLP
requires
a
machine
with
4GB
of

RAM

•  I
ran
everything
in
this
tutorial
on
the
laptop
I’m

presenting
on
…
4GB
RAM,
2.8
GHz
Core
2
Duo

•  But
it
wasn’t
always
pleasant
writing
the
slides
while

software
was
running….

How
much
hardware
do
you
need?

•  Why
do
you
need
more
hardware?

–  More
speed

•  It
took
me
95
minutes
to
run
Ayesha,
the
Return
of
She

through
Stanford
CoreNLP
on
my
laptop….

–  More
scale

•  You’d
like
to
be
able
to
analyze
1
million
books

•  Order
of
magnitude
rules
of
thumb:

–  POS
tagging,
NER,
etc:
5–10,000
words/second

–  Parsing:
1–10
sentences
per
second

How
much
hardware
do
you
need?

•  Luckily,
most
of
our
problems
are
trivially

parallelizable

–  Each
book/chapter
can
be
run
separately,
perhaps

on
a
separate
machine

•  What
do
we
actually
use?

–  We
do
most
of
our
computing
on
rack
mounted

Linux
servers

•  Currently
4
x
quad
core
Xeon
processors
with
24
GB
of

RAM
seem
about
the
sweet
spot

•  About
$3500
per
machine
…
not
like
the
old
days

6.
PART-‐OF-‐SPEECH

TAGGING

Part-‐of-‐Speech
Tagging

•  Part-‐of-‐speech
tagging
is
normally
done
by
a
sequence

model
(acronyms:
HMM,
CRM,
MEMM/CMM)

–  A
POS
tag
is
to
be
placed
above
each
word

–  The
model
considers
a
local
context
of
possible
previous

and
following
POS
tags,
the
current
word,
neighboring

words,
and
features
of
them
(capitalized?,
ends
in
-‐ing?)

–  Each
such
feature
has
a
weight,
and
the
evidence
is

combined,
and
the
most
likely
sequence
of
tags

(according
to
the
model)
is
chosen

RB
NNP
NNP
RB
VBD
,
JJ
NNS

When
Mr.
Holly
last
wrote
,
many
years

Stanford
POS
tagger

http://nlp.stanford.edu/software/tagger.shtml

$
java
-‐mx1g
-‐cp
../Software/stanford-‐postagger-‐full-‐2011-‐06-‐19/
stanford-‐postagger.jar
edu.stanford.nlp.tagger.maxent.MaxentTagger
-‐
model
../Software/stanford-‐postagger-‐full-‐2011-‐06-‐19/models/
left3words-‐distsim-‐wsj-‐0-‐18.tagger
-‐outputFormat
tsv
-‐tokenizerOptions

untokenizable=allKeep
-‐textFile
She
3155.txt
>
She
3155.tsv

Loading
default
properties
from
trained
tagger
../Software/stanford-‐
postagger-‐full-‐2011-‐06-‐19/models/left3words-‐distsim-‐wsj-‐0-‐18.tagger

Reading
POS
tagger
model
from
../Software/stanford-‐postagger-‐
full-‐2011-‐06-‐19/models/left3words-‐distsim-‐wsj-‐0-‐18.tagger
...
done
[2.2

sec].

Jun
15,
2011
8:17:15
PM
edu.stanford.nlp.process.PTBLexer
next
Greek
stand-‐
alone

WARNING:
Untokenizable:
?
(U+1FBD,
decimal:
8125)
Koronis

character
(a

Tagged
132377
words
at
5559.72
words
per
second.
little

obscure?)

Stanford
POS
tagger

•  For
the
second
time
you
do
it…

$
alias
stanfordtag
"java
-‐mx1g
-‐cp
/Users/manning/Software/
stanford-‐postagger-‐full-‐2011-‐06-‐19/stanford-‐postagger.jar

-‐model
/Users/
manning/Software/stanford-‐postagger-‐full-‐2011-‐06-‐19/models/
left3words-‐distsim-‐wsj-‐0-‐18.tagger
-‐outputFormat
tsv
-‐
tokenizerOptions
untokenizable=allKeep
-‐textFile"

$
stanfordtag
RiderHaggard/King
Solomon's
Mines
2166.txt
>

tagged/King
Solomon's
Mines
2166.tsv

Reading
POS
tagger
model
from
/Users/manning/Software/
stanford-‐postagger-‐full-‐2011-‐06-‐19/models/left3words-‐distsim-‐
wsj-‐0-‐18.tagger
...
done
[2.1
sec].

Tagged
98178
words
at
9807.99
words
per
second.

MorphAdorner

http://morphadorner.northwestern.edu/

•  MorphAdorner
is
a
set
of
NLP
tools
developed
at

Northwestern
by
Martin
Mueller
and
colleagues

speciﬁcally
for
English
language
ﬁction,
over
a

long
historical
period
from
EME
onwards

–  lemmatizer,
named
entity
recognizer,
POS

tagger,
spelling
standardizer,
etc.

•  Aims
to
deal
with
variation
in
word
breaking
and

spelling
over
this
period

•  Includes
its
own
POS
tag
set:
NUPOS

MorphAdorner

$
./adornplaintext
temp
temp/3155.txt

2011-‐06-‐15
20:30:52,111
INFO

-‐
MorphAdorner
version
1.0

2011-‐06-‐15
20:30:52,111
INFO

-‐
Initializing,
please
wait...

2011-‐06-‐15
20:30:52,318
INFO

-‐
Using
Trigram
tagger.

2011-‐06-‐15
20:30:52,319
INFO

-‐
Using
I
retagger.

2011-‐06-‐15
20:30:53,578
INFO

-‐
Loaded
word
lexicon
with
151,922
entries
in
2
seconds.

2011-‐06-‐15
20:30:55,920
INFO

-‐
Loaded
suffix
lexicon
with
214,503
entries
in
3
seconds.

2011-‐06-‐15
20:30:57,927
INFO

-‐
Loaded
transition
matrix
in
3
seconds.

2011-‐06-‐15
20:30:58,137
INFO

-‐
Loaded
162,248
standard
spellings
in
1
second.

2011-‐06-‐15
20:30:58,697
INFO

-‐
Loaded
5,434
alternative
spellings
in
1
second.

2011-‐06-‐15
20:30:58,703
INFO

-‐
Loaded
349
more
alternative
spellings
in
14
word
classes
in
1
second.

2011-‐06-‐15
20:30:58,713
INFO

-‐
Loaded
0
names
into
name
standardizer
in
<
1
second.

2011-‐06-‐15
20:30:58,779
INFO

-‐
1
file
to
process.

2011-‐06-‐15
20:30:58,789
INFO

-‐
Before
processing
input
texts:
Free
memory:
105,741,696,
total
memory:
480,694,272

2011-‐06-‐15
20:30:58,789
INFO

-‐
Processing
file
'temp/3155.txt'
.

2011-‐06-‐15
20:30:58,789
INFO

-‐
Adorning
temp/3155.txt
with
parts
of
speech.

2011-‐06-‐15
20:30:58,832
INFO

-‐
Loaded
text
from
temp/3155.txt
in
1
second.

2011-‐06-‐15
20:31:01,498
INFO

-‐

Extracted
131,875
words
in
4,556
sentences
in
3
seconds.

2011-‐06-‐15
20:31:03,860
INFO

-‐

lines:
1,000;
words:
27,756

2011-‐06-‐15
20:31:04,364
INFO

-‐

lines:
2,000;
words:
58,728

2011-‐06-‐15
20:31:04,676
INFO

-‐

lines:
3,000;
words:
84,735

2011-‐06-‐15
20:31:04,990
INFO

-‐

lines:
4,000;
words:
115,396

2011-‐06-‐15
20:31:05,152
INFO

-‐

lines:
4,556;
words:
131,875

2011-‐06-‐15
20:31:05,152
INFO

-‐

Part
of
speech
adornment
completed
in
4
seconds.
36,100
words
adorned
per
second.

2011-‐06-‐15
20:31:05,152
INFO

-‐

Generating
other
adornments.

2011-‐06-‐15
20:31:13,840
INFO

-‐

Adornments
written
to
temp/3155-‐005.txt
in
9
seconds.

2011-‐06-‐15
20:31:13,840
INFO

-‐
All
files
adorned
in
16
seconds.

Ah,
the
old
days!

$
./adornplaintext
temp
temp/Hunter
Quartermain.txt

2011-‐06-‐15
17:18:15,551
INFO

-‐
MorphAdorner
version
1.0

2011-‐06-‐15
17:18:15,552
INFO

-‐
Initializing,
please
wait...

2011-‐06-‐15
17:18:15,730
INFO

-‐
Using
Trigram
tagger.

2011-‐06-‐15
17:18:15,731
INFO

-‐
Using
I
retagger.

2011-‐06-‐15
17:18:16,972
INFO

-‐
Loaded
word
lexicon
with
151,922
entries
in
2

seconds.

2011-‐06-‐15
17:18:18,684
INFO

-‐
Loaded
suffix
lexicon
with
214,503
entries
in
2

seconds.

2011-‐06-‐15
17:18:20,662
INFO

-‐
Loaded
transition
matrix
in
2
seconds.

2011-‐06-‐15
17:18:20,887
INFO

-‐
Loaded
162,248
standard
spellings
in
1
second.

2011-‐06-‐15
17:18:21,300
INFO

-‐
Loaded
5,434
alternative
spellings
in
1
second.

2011-‐06-‐15
17:18:21,303
INFO

-‐
Loaded
349
more
alternative
spellings
in
14
word

classes
in
1
second.

2011-‐06-‐15
17:18:21,312
INFO

-‐
Loaded
0
names
into
name
standardizer
in
1
second.

2011-‐06-‐15
17:18:21,381
INFO

-‐
No
files
found
to
process.

•  But
it
works
better
if
you
make
sure
the
filename
has

no
spaces
in
it


Comparing
taggers:
Penn
Treebank
vs.

NUPOS

Holly
NNP
Holly
n1
going
VBG

going
vvg

,

,

,

,
to

TO

to

pc-‐acp

if

IN

if

cs
leave
VB

leave
vvi

you

PRP
you

pn22
you
PRP

you

pn22

will

MD

will

vmb
that
IN

that
d

accept
VB

accept
vvi

boy
NN

boy's
ng1

the

DT

the

dt

's

POS

trust
NN

trust
n1

sole
JJ

sole
j

,

,

,

,

guardian
NN
guardian
n1

I

PRP
I

pns11

.

.

.

.

am

VBP
am

vbm

Stylistic
factors
from
POS

14000

12000

10000

8000

JJ

6000
MD

4000
DT

2000

0

She
Ayesha
She
and
Allan
Wisdom's

Daughter

7.
NAMED
ENTITY

RECOGNITION

(NER)

Named
Entity
Recognition

–
“the
Chad
problem”

Germanyʼ’s representative to the
European Unionʼ’s veterinary
committee Werner Zwingman said on
Wednesday consumers should …

IL-2 gene expression and NF-kappa B
activation through CD28 requires
reactive oxygen production by
5-lipoxygenase.

Conditional
Random
Fields
(CRFs)

O
PER
PER
O
O
O
O
O

When
Mr.
Holly
last
wrote
,
many
years

•  We
again
use
a
sequence
model
–
diﬀerent

problem,
but
same
technology

–  Indeed,
sequence
models
are
used
for
lots
of
tasks

that
can
be
construed
as
labeling
tasks
that

require
only
local
context
(to
do
quite
well)

•  There
is
a
background
label
–
O
–
and
labels
for

each
class

•  Entities
are
both
segmented
and
categorized

Stanford
NER
Features

•  Word
features:
current
word,
previous
word,
next

word,
a
word
is
anywhere
in
a
+/–
4
word
window

•  Orthographic
features:

–  Jenny

Xxxx

–  IL-‐2

XX-‐#

•  Preﬁxes
and
Suﬃxes:

–  Jenny

<J,
<Je,
<Jen,
…,
nny>,
ny>,
y>

•  Label
sequences

•  Lots
of
feature
conjunctions

Stanford
NER

http://nlp.stanford.edu/software/CRF-‐NER.shtml

$
java
-‐mx500m
-‐Dfile.encoding=utf-‐8
-‐cp
Software/stanford-‐
ner-‐2011-‐06-‐19/stanford-‐ner.jar
edu.stanford.nlp.ie.crf.CRFClassifier
-‐
loadClassifier
Software/stanford-‐ner-‐2011-‐06-‐19/classifiers/all.
3class.distsim.crf.ser.gz
-‐textFile
RiderHaggard/She
3155.txt
>
ner/She

3155.ner

For
thou
shalt
rule
this
<LOCATION>England</LOCATION>-‐-‐-‐-‐”

"But
we
have
a
queen
already,"
broke
in
<LOCATION>Leo</LOCATION>,

hastily.

"It
is
naught,
it
is
naught,"
said
<PERSON>Ayesha</PERSON>;
"she
can

be
overthrown.”

At
this
we
both
broke
out
into
an
exclamation
of
dismay,
and
explained

that
we
should
as
soon
think
of
overthrowing
ourselves.

"But
here
is
a
strange
thing,"
said
<PERSON>Ayesha</PERSON>,
in

astonishment;
"a
queen
whom
her
people
love!
Surely
the
world
must

have
changed
since
I
dwelt
in
<LOCATION>Kôr</LOCATION>."

Statistical
parsing

•  One
of
the
big
successes
of
1990s
statistical
NLP

was
the
development
of
statistical
parsers

•  These
are
trained
from
hand-‐parsed
sentences

(“treebanks”),
and
know
statistics
about
phrase

structure
and
word
relationships,
and
use
them
to

assign
the
most
likely
structure
to
a
new
sentence

•  They
will
return
a
sentence
parse
for
any
sequence

of
words.
And
it
will
usually
be
mostly
right

•  There
are
many
opportunities
for
exploiting
this

richer
level
of
analysis,
which
have
only
been
partly

realized.

Phrase
structure
Parsing

•  Phrase
structure
representations
have
dominated

American
linguistics
since
the
1930s

•  They
focus
on
showing
words
that
go
together
to
form

natural
groups
(constituents)
that
behave
alike

•  They
are
good
for
showing
and
querying
details
of

sentence
structure
and
embedding

S
VP
NP
VBD VP
NP PP
VBN PP
IN NP
IN NP
NNS NNS CC NN
NNP NNP

Bills on ports and immigration were submitted by Senator Brownback

Dependency
parsing

•  A
dependency
parse
shows
which
words
in
a
sentence
modify
other
words

•  The
key
notion
are
governors
with
dependents

•  Widespread
use:
Pāṇini,
early
Arabic
grammarians,
diagramming
sentences,
…

submitted
nsubjpass auxpass prep

Bills were by
prep pobj
on Brownback
pobj nn appos
ports Senator Republican
cc conj prep
and immigration of
pobj
Kansas

Stanford
Dependencies

•  SD
is
a
particular
dependency
representation
designed
for
easy

extraction
of
meaning
relationships

[de
Marneﬀe
&
Manning,
2008]

–  It’s
basic
form
in
the
last
slide
has
each
word
as
is

–  A
“collapsed”
form
focuses
on
relations
between
main
words

submitted
nsubjpass auxpass
Bills were agent

prep_on Brownback
nn appos
ports Senator Republican
conj_and prep_on prep_of

immigration Kansas

Statistical
Parsers

•  There
are
now
many
good
statistical
parsers
that

are
freely
downloadable

–  Constituency
parsers

•  Collins/Bikel
Parser

•  Berkeley
Parser

•  BLLIP
Parser
=
Charniak/Johnson
Parser

–  Dependency
parsers

•  MaltParser

•  MST
Parser

•  But
I’ll
show
the
Stanford
Parser


Tregex/Tgrep2
–
Tools
for
searching

over
syntax

dreadful
things

She
Ayesha

amod(day-‐18,
dreadful-‐17)
amod(clouds-‐5,
dreadful-‐2)

amod(day-‐45,
dreadful-‐44)
amod(debt-‐26,
dreadful-‐25)

amod(feast-‐33,
dreadful-‐32)
amod(doom-‐21,
dreadful-‐20)

amod(ﬁts-‐51,
dreadful-‐50)
amod(fashion-‐50,
dreadful-‐47)

amod(form-‐59,
dreadful-‐58)
amod(form-‐10,
dreadful-‐7)

amod(laugh-‐9,
dreadful-‐8)
amod(oath-‐42,
dreadful-‐41)

amod(manifestation-‐9,
dreadful-‐8)
amod(road-‐23,
dreadful-‐22)

amod(manner-‐29,
dreadful-‐28)
amod(silence-‐5,
dreadful-‐4)

amod(marshes-‐17,
dreadful-‐16)
amod(threat-‐19,
dreadful-‐18)

amod(people-‐12,
dreadful-‐11)

amod(people-‐46,
dreadful-‐45)

amod(place-‐16,
dreadful-‐15)

amod(place-‐6,
dreadful-‐5)

amod(sight-‐5,
dreadful-‐4)

amod(spot-‐13,
dreadful-‐12)

amod(thing-‐41,
dreadful-‐40)

amod(thing-‐5,
dreadful-‐4)

amod(tragedy-‐22,
dreadful-‐21)

amod(wilderness-‐43,
dreadful-‐42)

Making
use
of
dependency
structure

J.
Engelberg
Costly
Information
Processing
(AFA,
2009):

•  An
eﬃcient
market
should
immediately
incorporate
all

publicly
available
information.

•  But
many
studies
have
shown
there
is
a
lag

–  And
the
lag
is
greater
on
Fridays
(!)

•  An
explanation
for
this
is
that
there
is
a
cost
to
information

processing

•  Engelberg
tests
and
shows
that
soft
(textual)
information

takes
longer
to
be
absorbed
than
hard
(numeric)

information
…
it s
higher
cost
information
processing

•  But
soft
information
has
value
beyond
hard
information

–  It’s
especially
valuable
for
predicting
further
out
in
time

Evidence from earnings announcements
[Engelberg AFA 2009]

•  But
how
do
you
use
the
soft
information?

•  Simply
using
proportion
of
negative
words
(from
the

Harvard
General
Inquirer
lexicon)
is
a
useful
predictive
feature

of
future
stock
behavior

Although
sales
remained
steady,
the
firm
continues
to

suffer
from
rising
oil
prices.

•  But
this
[or
text
categorization]
is
not
enough.
In
order
to

refine
my
analysis,
I
need
to
know
that
the
negative

sentiment
is
about
oil
prices.

•  He
thus
turns
to
use
of
the
typed
dependencies

representation
of
the
Stanford
Parser.

–  Words
that
negative
words
relate
to
are
grouped
into
1
of

6
categories
[5
word
lists
or
other ]

Evidence from earnings announcements
[Engelberg 2009]

•  In
a
regression
model
with
many
standard
quantitative

predictors…

–  Just
the
negative
word
fraction
is
a
significant
predictor
of
3

day
or
80
day
post
earnings
announcement
abnormal

returns
(CAR)

•  Coefficient
−0.173,
p
<
0.05
for
80
day
CAR

–  Negative
sentiment
about
different
things
has
differential

effects

•  Fundamentals:
−0.198,
p
<
0.01
for
80
day
CAR

•  Future:
−0.356,
p
<
0.05
for
80
day
CAR

•  Other:
−0.023,
p
<
0.01
for
80
day
CAR

–  Only
some
of
which
analysts
pay
attention
to

•  Analyst
forecast-‐for-‐quarter-‐ahead
earnings
is
predicted
by

negative
sentiment
on
Environment
and
Other
but
not

Fundamentals
or
Future!

Syntactic Packaging and Implicit Sentiment
[Greene 2007; Greene and Resnik 2009]

•  Positive
or
negative
sentiment
can
be
carried
by
words
(e.g.,

adjectives),
but
often
it
isn’t….

–  These
sentences
differ
in
sentiment,
even
though
the

words
aren’t
so
different:

•  A
soldier
veered
his
jeep
into
a
crowded
market
and
killed

three
civilians

•  A
soldier s
jeep
veered
into
a
crowded
market
and
three

civilians
were
killed

•  As
a
measurable
version
of
such
issues
of
linguistic
perspective,

they
define
OPUS
features

–  For
domain
relevant
terms,
OPUS
features
pair
the
word
with
a

syntactic
Stanford
Dependency:

•  killed:DOBJ

NSUBJ:soldier

killed:NSUBJ

Predicting Opinions of the Death Penalty
[Greene 2007; Greene and Resnik 2009]

•  Collected
pro-‐
and
anti-‐
death
penalty
texts
from
websites
with

manual
checking

•  Training
is
cross-‐validation
of
training
on
some
pro-‐
and
anti-‐
sites

and
testing
on
documents
from
others

[can t
use
site-‐speciﬁc

nuances]

•  Baseline
is
word
and
word
bigram
features
in
a
support
vector

machine

[SVM
=
good
classiﬁer]

Condition SVM accuracy
Baseline 72.0%
With OPUS features 88.1%

•  58%
error
reduction!

9.
COREFERENCE

RESOLUTION

Coreference
resolution

•  The
goal
is
to
work
out
which
(noun)
phrases

refer
to
the
same
entities
in
the
world

–  Sarah
asked
her
father
to
look
at
her.
He

appreciated
that
his
eldest
daughter
wanted
to

speak
frankly.

•  ≈
anaphora
resolution
≈
pronoun
resolution
≈

entity
resolution

Coreference
resolution
warnings

•  Warning:
The
tools
we
have
looked
at
so
far
work

one
sentence
at
a
time
–
or
use
the
whole

document
but
ignore
all
structure
and
just
count

–
but
coreference
uses
the
whole
document

•  The
resources
used
will
grow
with
the
document

size
–
you
might
want
to
try
a
chapter
not
a
novel

•  Coreference
systems
normally
require

processing
with
parsers,
NER,
etc.
ﬁrst,
and
use

of
lexicons

Coreference
resolution
warnings

•  English-‐only
for
the
moment….

•  While
there
are
some
papers
on
coreference

resolution
in
other
languages,
I
am
aware
of
no

downloadable
coreference
systems
for
any

language
other
than
English

•  For
English,
there
are
a
good
number
of

downloadable
systems,
but
their
performance

remains
modest.

It’s
just
not
like
POS
tagging,

NER
or
parsing

Coreference
resolution
warnings

Nevertheless,
it’s
not
yet
known
to
the
State
of

California
to
cause
cancer,
so
let’s
continue….

Stanford
CoreNLP

http://nlp.stanford.edu/software/corenlp.shtml

•  Stanford
CoreNLP
is
our
new
package
that
ties

together
a
bunch
of
NLP
tools

–  POS
tagging

–  Named
Entity
Recognition

–  Parsing

–  and
Coreference
Resolution

•  Output
is
an
XML
representation
[only
choice
at
present]

•  Contains
a
state-‐of-‐the-‐art
coreference
system!

Stanford
CoreNLP

$
java
-‐mx3g
-‐Dﬁle.encoding=utf-‐8
-‐cp
"Software/
stanford-‐corenlp-‐2011-‐06-‐08/stanford-‐
corenlp-‐2011-‐06-‐08.jar:Software/stanford-‐
corenlp-‐2011-‐06-‐08/stanford-‐corenlp-‐
models-‐2011-‐06-‐08.jar:Software/stanford-‐
corenlp-‐2011-‐06-‐08/xom.jar:Software/stanford-‐
corenlp-‐2011-‐06-‐08/jgrapht.jar"

edu.stanford.nlp.pipeline.StanfordCoreNLP
-‐ﬁle

RiderHaggard/Hunter
Quatermain's
Story

2728.txt
-‐outputDirectory
corenlp

What
Stanford
CoreNLP
gives

–  Sarah
asked
her
father
to
look
at
her
.

–  He
appreciated
that
his
eldest
daughter
wanted

to
speak
frankly
.

•  Coreference
resolution
graph

–  sentence
1,
headword
1
(gov)

–  sentence
1,
headword
3

–  sentence
1,
headword
4
(gov)

–  sentence
2,
headword
1

–  sentence
2,
headword
4

THE
REST
OF
THE

LANGUAGES
OF
THE

WORLD

English-‐only?

•  There
are
a
lot
of
languages
out
there
in
the
world!

•  But
there
are
a
lot
more
NLP
tools
for
English
than

anything
else

•  However,
there
is
starting
to
be
fairly
reasonable

support
(or
the
ability
to
build
it)
for
most
of
the
top

50
or
so
languages…

•  I’ll
say
a
little
about
that,
since
some
people
are

deﬁnitely
interested,
even
if
I’ve
covered
mainly

English

POS
taggers
for
many
languages?

•  Two
choices:

1.  Find
a
tagger
with
an
existing
model
for
the

language
(and
period)
of
interest

2.  Find
POS-‐tagged
training
data
for
the
language

(and
period)
of
interest
and
train
your
own

tagger

•  Most
downloadable
taggers
allow
you
to
train
new

models
–
e.g.,
the
Stanford
POS
tagger

–  But
it
may
involve
considerable
data
preparation
work
and

understanding
and
not
be
for
the
faint-‐hearted

POS
taggers
for
many
languages?

•  One
tagger
with
good
existing
multi-‐lingual
support

–  TreeTagger
(Helmut
Schmid)

•  http://www.ims.uni-‐stuttgart.de/projekte/corplex/
TreeTagger/

•  Bulgarian,
Chinese,
Dutch,
English,
Estonian,
French,
Old

French,
Galician,
German,
Greek,
Italian,
Latin,
Portuguese,

Russian,
Spanish,
Swahili

•  Free
for
non-‐commercial,
not
open
source;
Linux,
Mac,

Sparc
(not
Windows)

–  Stanford
POS
Tagger
presently
comes
with:

•  English,
Arabic,
Chinese,
German

•  One
place
to
look
for
more
resources:

–  http://nlp.stanford.edu/links/statnlp.html

•  But
it’s
always
out
of
date,
so
also
try
a
Google
search


Chinese
example

•  Chinese
doesn’t
put
spaces
between
words

–  Nor
did
Ancient
Greek

•  So
almost
all
tools
ﬁrst
require
word

segmentation

•  I
demonstrate
the
Stanford
Chinese
Word
Segmenter

•  http://nlp.stanford.edu/software/segmenter.shtml

•  Even
in
English,
words
need
some
segmentation

–
often
called
tokenization

•  It
was
being
implicitly
done
before
further
processing

in
the
examples
till
now:

“I’ll
go.”

“

I

’ll

go

.

”

Chinese
example

•  $
../Software/stanford-‐chinese-‐
segmenter-‐2010-‐03-‐08/segment.sh
ctb

Xinhua.txt
utf-‐8
0
>
Xinhua.seg

•  $
java
-‐mx300m
-‐cp
../Software/stanford-‐
postagger-‐full-‐2011-‐05-‐18/stanford-‐postagger.jar

-‐
model
../Software/stanford-‐postagger-‐
full-‐2011-‐05-‐18/models/chinese.tagger
-‐textFile

Xinhua.seg
>
Xinhua.tag

Chinese
example

#
space
before

below!

$
perl
-‐pe
'if
(
!
m/^s*$/
&&
!
m/^.{100}/)
{
s/$/
/;
}'
<
Xinhua.seg
>

Xinhua.seg.fixed

$
java
-‐mx600m
-‐cp
../Software/stanford-‐parser-‐2011-‐06-‐15/stford-‐
parser.jar
edu.stanford.nlp.parser.lexparser.LexicalizedParser
-‐
encoding
utf-‐8
../Software/stanford-‐parser-‐2011-‐04-‐17/
chineseFactored.ser.gz
Xinhua.seg.fixed
>
Xinhua.parsed

$
java
-‐mx1g
-‐cp
../Software/stanford-‐parser-‐2011-‐06-‐15/stanford-‐
parser.jar
edu.stanford.nlp.parser.lexparser.LexicalizedParser
-‐
encoding
utf-‐8
-‐outputFormat
typedDependencies
../Software/
stanford-‐parser-‐2011-‐04-‐17/chineseFactored.ser.gz

Xinhua.seg.fixed
>
Xinhua.sd

Other
tools

•  Dependency
parsers
are
now
available
for
many

languages,
especially
via
MaltParser:

–  http://maltparser.org/

•  For
instance,
it’s
used
to
provide
a
Russian
parser

among
the
resources
here:

–  http://corpus.leeds.ac.uk/mocky/

•  The
OPUS
(Open
Parallel
Corpus)
collects
tools
for

various
languages:

–  http://opus.lingﬁl.uu.se/trac/wiki/Tagging%20and
%20Parsing

•  Look
around!

Data
sources

•  Parsers
depend
on
annotated
data
(treebanks)

•  You
can
use
a
parser
trained
on
news
articles,
but

better
resources
for
humanities
scholars
will

depend
on
community
eﬀorts
to
produce
better

data

•  One
eﬀort
is
the
construction
of
Greek
and
Latin

dependency
treebanks
by
the
Perseus
ProjectI:

–  http://nlp.perseus.tufts.edu/syntax/treebank/

Applications?
(beyond
word
counts)

•  There
are
starting
to
be
a
few
applications
in
the

humanities
using
richer
NLP
methods:

•  But
only
a
few….

Applications?
(beyond
word
counts)

–  Cameron
Blevins.
2011.
Topic
Modeling
Historical

Sources:
Analyzing
the
Diary
of
Martha
Ballard.

DH
2011.

•  Uses
(latent
variable)
topic
models
(LDA
and
friends)

–  Topic
model
are
primarily
used
to
find
themes
or
topics

running
through
a
group
of
texts

–  But,
here,
also
helpful
for
dealing
with
spelling
variation
(!)

–  Uses
MALLET
(http://mallet.cs.umass.edu/),
a
toolkit
with
a

fair
amount
of
stuff
for
text
classification,
sequence
tagging

and
topic
models

»  We
also
have
the
Stanford
Topic
Modeling
Toolbox

•  http://nlp.stanford.edu/software/tmt/tmt-‐0.3/

•  Examines
change
in
diary
entry
topics
over
time

Applications?
(beyond
word
counts)

–  David
K.
Elson,
Nicholas
Dames,
Kathleen
R.

McKeown.
2010.
Extracting
Social
Networks
from

Literary
Fiction.
ACL
2010.

•  How
size
of
community
in
novel
or
world
relates
to

amount
of
conversation

–  (Stanford)
NER
tagger
to
identify
people
and
organizations

–  Heuristically
matching
to
name
variants/shortenings

–  System
for
speech
attribution
(Elson
&
McKeown
2010)

–  Social
network
construction

•  Results
showing
that
urban
novel
social
networks
are

not
richer
than
those
in
rural
settings,
etc.

Applications?
(beyond
word
counts)

–  Aditi
Muralidharan.
2011.
A
Visual
Interface
for

Exploring
Language
Use
in
Slave
Narratives
DH

2011.
http://bebop.berkeley.edu/wordseer

•  A
visualization
and
reading
interface
to
American
Slae

Narratives

–  (Stanford)
Parser
used
to
allow
searching
of
particular

grammatical
relationships:
grammatical
search

–  Visualization
tools
to
show
a
word’s
distribution
in
text
and
to

provide
a
“collapsed
concordance”
view
–
and
for
close

reading

• 
Example
application
is
exploring
relationship
with
God

Parting
words

This
talk
has
been
about
tools
–

they’re
what
I
know

But
you
should
focus
on
disciplinary
insight
–

not
on
building
corpora
and
tools,
but
on
using

them
as
tools
for
producing
disciplinary
research

Natural Language Processing Tools for the Digital Humanities

Natural Language Processing Tools for the Digital Humanities

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Natural Language Processing Tools for the Digital Humanities

Semelhante a Natural Language Processing Tools for the Digital Humanities (20)

Último

Último (20)

Natural Language Processing Tools for the Digital Humanities