1. Less
Grammar,
More
Features
David
Hall,
Greg
Durre6
and
Dan
Klein
@
Berkeley
能地
宏
(@nozyh)
NII
2. この論文の主張
‣ 低レイヤー
NLP
タスクの曖昧性を解消するには、単語の表層から
の素性があれば十分
評判分析
Recursive Deep Models for Semantic Compositionality
Over a Sentiment Treebank
Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang,
Christopher D. Manning, Andrew Y. Ng and Christopher Potts
Stanford University, Stanford, CA 94305, USA
richard@socher.org,{aperelyg,jcchuang,ang}@cs.stanford.edu
{jeaneis,manning,cgpotts}@stanford.edu
Abstract
Semantic word spaces have been very use-
ful but cannot express the meaning of longer
phrases in a principled way. Further progress
towards understanding compositionality in
tasks such as sentiment detection requires
richer supervised training and evaluation re-
sources and more powerful models of com-
position. To remedy this, we introduce a
Sentiment Treebank. It includes fine grained
sentiment labels for 215,154 phrases in the
parse trees of 11,855 sentences and presents
new challenges for sentiment composition-
ality. To address them, we introduce the
Recursive Neural Tensor Network. When
–
0
0
This
0
film
–
–
–
0
does
0
n’t
0
+
care
+
0
about
+
+
+
+
+
cleverness
0
,
0
wit
0
or
+
0
0
any
0
0
other
+
kind
+
0
of
+
+
intelligent
+ +
humor
0
.
Figure 1: Example of the Recursive Neural Tensor Net-
work accurately predicting 5 sentiment classes, very neg-
ative to very positive (– –, –, 0, +, + +), at every node of a
parse tree and capturing the negation and its scope in this
Sochar
et
al.’13
Deep
Learning
以上の性能
構文解析
多くの言語で
Berkeley
parser
以上
6. He
eats
sushi
with
chops,cks
N
S
PPNP
V
VP
VP
P
NP
He
eats
sushi
with
chops,cks
N
S
PPNP
V
NP
VP
P
NP
Naive
PCFG
では性能が低い
VP
-‐>
V
NP 0.2
NP
-‐>
NP
PP 0.15
VP
-‐>
VP
PP 0.1
0.1
×
0.2
=
0.02 0.2
×
0.15
=
0.03
PCFG
は曖昧性の解消には不十分
F1-‐Score:
72.1 Treebank
から確率値を推定
7. He
eats
sushi
with
chops,cks
S
PP
V
VP
N
NP
VP
P
NP
Head
lexicalizaWon
Eisner’96;
Collins’97
[I]
[eat]
[I]
[sushi]
[sushi]
[eat]
[eat]
[with]
[eat]
•
葉ノードの情報を伝播させる
•
(eats,
with)
の関係を捉えられる
•
ルールの数が膨大
•
多言語への拡張性
(headの情報に依存)
欠点:
8. Latent
annotaWon
(state
spliang)
Matsuzaki
et
al.’05;
Petrov
et
al.’06
He
eats
sushi
with
chops,cks
S
PP-‐1
V-‐1
VP-‐2
N-‐3
NP-‐4
VP-‐3
P-‐1
NP-‐2
•
各ノードに存在する隠れ状態を推定
•
現在の
Berkeley
Parser
の実装;
F1-‐score:
90.2
9. これまでの手法のまとめ
‣ これまでの手法は基本的に、
CFG
のルールを増やすことで、
大域的な情報を取り出してきた
‣ lexicalizaWon:
部分木に
head
の情報を付与する
-‐ shic-‐reduce
系の手法も当てはまる
Zhang
and
Clark’09;
Zhu
et
al.’13
‣ ノードに粗い情報を付与する
-‐ 言語学的な分析に基づく
Klein
and
Manning’03
(Stanford
parser)
-‐ 隠れ変数として
EM
で推定
Petrov
et
al.’06
(Berkeley
parser)
VP
[eat]
VP
[eat] PP
[with]
VP
^S
VP PP
^VP
VP-‐3
VP-‐2 PP-‐1
11. 本研究のアプローチ
‣ 曖昧性の解消の多くは、ルールの貼るスパンの回りの表層を見る
ので十分なのではないか?
He
eats
sushi
with
chops,cks
N
S
PPNP
V
NP
VP
P
NP
He
eats
sushi
with
chops,cks
N
S
PPNP
V
VP
VP
P
NP
[FIRSTWORD=eats
×
RULE=VP→V
PP]
[SPANLENGTH=5
×
RULE=VP→V
PP]
[LASTWORD=chop..
×
RULE=VP→V
PP] [LASTWORD=chop..
×
RULE=VP→V
NP]
[SPANLENGTH=5
×
RULE=VP→V
NP]
[FIRSTWORD=eats
×
RULE=VP→V
NP]
負の重みが学習されて欲しい
12. Result
Overview
n 40
0.1
0.5
0.2
0.9
0.3
reebank develop-
40, for different
d on top of the X-
y span feature is
ules and rule par-
nchored rule pro-
ing an annotation
does is refine the
Test 40 Test all
Berkeley 90.6 90.1
This work 89.9 89.2
Table 3: Final Parseval results for the v = 1, h = 0
parser on Section 23 of the Penn Treebank.
5.2 Lexical Annotation
Another commonly-used kind of structural an-
notation is lexicalization (Eisner, 1996; Collins,
1997; Charniak, 1997). By annotating grammar
nonterminals with their headwords, the idea is to
better model phenomena that depend heavily on
the semantics of the words involved, such as coor-
dination and PP attachment.
Table 2 shows results from lexicalizing the X-
Arabic Basque French German Hebrew Hungarian Korean Polish Swedish Avg
Dev, all lengths
Berkeley 78.24 69.17 79.74 81.74 87.83 83.90 70.97 84.11 74.50 78.91
Berkeley-Rep 78.70 84.33 79.68 82.74 89.55 89.08 82.84 87.12 75.52 83.28
Our work 78.89 83.74 79.40 83.28 88.06 87.44 81.85 91.10 75.95 83.30
Test, all lengths
Berkeley 79.19 70.50 80.38 78.30 86.96 81.62 71.42 79.23 79.18 78.53
Berkeley-Tags 78.66 74.74 79.76 78.28 85.42 85.22 78.56 86.75 80.64 80.89
Our work 78.75 83.39 79.70 78.43 87.18 88.25 80.18 90.66 82.00 83.17
Table 4: Results for the nine treebanks in the SPMRL 2013 Shared Task; all values are F-scores for
sentences of all lengths using the version of evalb distributed with the shared task. Berkeley-Rep is
the best single parser from (Bj¨orkelund et al., 2013); we only compare to this parser on the development
Berkeley-‐Rep:
Berkeley
parser
で、低頻度語を言語毎にチューニング
した素性表現で置き換える
多言語データ:
SPMPL
2013
Shared
Task
13. モデル:CRF
Parsing
Finkel
et
al.’07
ng comes
tput be a
es a min-
ure a ba-
but relies
ive accu-
e the fea-
all back-
, such as
ured by a
, are nat-
on. The
s are ade-
on, which
e reflexes
will often
wer seems
Finkel et al. (2008) and Petrov and Klein (2008a).
Formally, we define the probability of a tree T
conditioned on a sentence w as
p(T|w) / exp ✓|
X
r2T
f(r, w)
!
(1)
where the feature domains r range over the (an-
chored) rules used in the tree. An anchored rule
r is the conjunction of an unanchored grammar
rule rule(r) and the start, stop, and split indexes
where that rule is anchored, which we refer to as
span(r). It is important to note that the richness of
the backbone grammar is reflected in the structure
of the trees T, while the features that condition di-
rectly on the input enter the equation through the
anchoring span(r). To optimize model parame-
ters, we use the Adagrad algorithm of Duchi et al.
I
eat
sushi
with
chops.cks
S
PP
V
VP
N
NP
VP
P
NP
Inside-‐Outsideで周辺確率を計算
AdaGrad
+
L2
(オンライン学習)
14. 素性の抽出
averted financial disaster
VP
NPVBD
JJ NN
PARENT = VP
FIRSTWORD = averted
LENGTH = 3
RULE = VP → VBD NP
PARENT = VP
Span properties
Rule backoffs
Features
...
5 6 7 8
... LASTWORD = disaster
FIRSTWORD = averted
LASTWORD = disaster PARENT = VP
FIRSTWORD = averted RULE = VP → VBD NP
Figure 1: Features computed over the application
of the rule VP ! VBD NP over the anchored
span averted financial disaster with the shown in-
for parsing – if nothing else, parsing comes
a structural requirement that the output be a
-formed, nested tree. Our parser uses a min-
PCFG backbone grammar to ensure a ba-
evel of structural well-formedness, but relies
ly on features of surface spans to drive accu-
Formally, our model is a CRF where the fea-
factor over anchored rules of a small back-
grammar, as shown in Figure 1.
ome aspects of the parsing problem, such as
ree constraint, are clearly best captured by a
G. Others, such as heaviness effects, are nat-
y captured using surface information. The
question is whether surface features are ade-
e for key effects like subcategorization, which
deep definitions but regular surface reflexes
the preposition selected by a verb will often
rly follow it). Empirically, the answer seems
yes, and our system produces strong results,
up to 90.5 F1 on English parsing. Our parser
so able to generalize well across languages
Finkel et al. (2008) and Petrov and Klein (
Formally, we define the probability of a
conditioned on a sentence w as
p(T|w) / exp ✓|
X
r2T
f(r, w)
!
where the feature domains r range over t
chored) rules used in the tree. An anchor
r is the conjunction of an unanchored gr
rule rule(r) and the start, stop, and split
where that rule is anchored, which we ref
span(r). It is important to note that the rich
the backbone grammar is reflected in the st
of the trees T, while the features that condi
rectly on the input enter the equation thro
anchoring span(r). To optimize model p
ters, we use the Adagrad algorithm of Duc
(2010) with L2 regularization.
We start with a simple X-bar grammar
only symbols are NP, NP-bar, VP, and so o
base model has no surface features: form
0
0
1
0
…
0
1
0
1
10.3
-‐1.2
3.2
0.01
…
0.3
0.1
-‐20.1
10.1
内積でスコア計算
PCFG
のルール確率に対応
CKY
チャートのスコアに
15. どのような素性が有効か
Features Section F1
RULE 4 73.0
+ SPAN FIRST WORD + SPAN LAST WORD + LENGTH 4.1 85.0
+ WORD BEFORE SPAN + WORD AFTER SPAN 4.2 89.0
+ WORD BEFORE SPLIT + WORD AFTER SPLIT 4.3 89.7
+ SPAN SHAPE 4.4 89.9
1: Results for the Penn Treebank development set, reported in F1 on sentences of length
ction 22, for a number of incrementally growing feature sets. We show that each feature
ted in Section 4 adds benefit over the previous, and in combination they produce a reaso
yet simple parser.
atures are bucketed together. During train-
ere are no collisions between positive fea-
which generally receive positive weight, and
fixes of the current word up to length 5, regar
of frequency.
Subsequent lines in Table 1 indicate addi
長さ40以下、WSJ
Sec.
22
(development)
ほとんどの意味は直感的に分かる
以下、具体例でどのような文に役立つか説明
16. Word
before/acer
span
no read messages in his inbox
VP
VBP NNS
VP → no VBP NNS
gure 2: An example showing the utility of span
ntext. The ambiguity about whether read is an
jective or a verb is resolved when we construct
VP and notice that the word proceeding it is un-
ely.
NP → (NP ... impact) PP)
( CEO of Enron )
PRN
(XxX)
Figure 4: Computation o
two examples. Parenthe
punctuation-heavy, short
being explicitly modeled
stance of this feature tem
that is more likely to tak
no
read
messages
in
...
JJ NNS
NP
read
の品詞は
VBP
か
JJ
か?
read
messages
を張るルールを決める際、VP
の前に
no
は来ない、
という情報が手がかりになる(負の重みが学習されてほしい)
17. Word
before/acer
split
adjective or a verb is resolved when we construct
a VP and notice that the word proceeding it is un-
likely.
has an impact on the market
PPNP
NP
NP → (NP ... impact) PP)
Figure 3: An example showing split point features
disambiguating a PP attachment. Because impact
is likely to take a PP, the monolexical indicator
feature that conjoins impact with the appropriate
rule will help us parse this example correctly.
lengths 1, 2, 3, 4, 5, 10, 20, and 21 words.
punctuation
being expli
stance of th
that is mor
and so we
and encour
generally u
attachment
of the noun
example, c
indicator o
diately afte
tures with i
with a rule
split point.
4.4 Span
We add on
PP
a6achment
impact
は修飾を受けやすい名詞
大きい重みが学習されて欲しい
各句の
head
は、前後両端のどちらかに来やすいという情報を利用
(多くの言語で成り立つ;日本語の文節の
head
は右端)
18. Span
shape
box
e utility of span
ether read is an
en we construct
ceeding it is un-
( CEO of Enron )
PRN
(XxX)
said , “ Too bad , ”
VP
x,“Xx,”
Figure 4: Computation of span shape features on
two examples. Parentheticals, quotes, and other
punctuation-heavy, short constituents benefit from
being explicitly modeled by a descriptor like this.
stance of this feature template. impact is a noun
that is more likely to take a PP than other nouns,
and so we expect this feature to have high weight
先頭の大文字、括弧を抽出する
(英語の場合)named
enWty
の判別、括弧の一致など
20. 余談:この研究の方向性
‣ EMNLP
2013
の共参照の論文と方向性が同じに見える
-‐ 共参照解析は、menWon
間の表層から取り出した素性のみに基づく
識別モデルを用いることで、最高精度を達成できる
(WordNet
等の外部知識は必要ではない)
-‐ Berkeley
coreference
はツール公開中で、Stanford
より高い精度(のはず)
Easy
Victories
and
Uphill
Ba6les
in
Coreference
ResoluWon
Greg
Durre6
and
Dan
Klein
(Berkeley)
[Barack$Obama]1$met$with$[David$Cameron]2$.$[He]1$said$...
[with$X%−%.%Y]
[with$X%−%Y%said]
...
Centering
with%[X]%. .%[X]%said
NLP
の多くの解析タスクは、
単語の表層からうまく素性を
選べば高精度が達成できる
21. SenWment
analysis
‣ Mechanical
turk
を使って木構造の上に5段階のラベルを付与した
‣ Neural
net
で既存手法より良いことを示した
(去年の
EMNLP)
ing,cgpotts}@stanford.edu
-
r
s
n
s
-
-
a
d
e
s
-
e
–
0
0
This
0
film
–
–
–
0
does
0
n’t
0
+
care
+
0
about
+
+
+
+
+
cleverness
0
,
0
wit
0
or
+
0
0
any
0
0
other
+
kind
+
0
of
+
+
intelligent
+ +
humor
0
.
Figure 1: Example of the Recursive Neural Tensor Net-
work accurately predicting 5 sentiment classes, very neg-
ative to very positive (– –, –, 0, +, + +), at every node of a
parse tree and capturing the negation and its scope in this
Sochar
et
al.’13
22. 本研究の手法がそのまま適応できる
‣ 木構造が与えられた上で、各スパンを5段階に分類
-‐ 構造を固定して
Inside-‐Outside,
CKY
を走らせる
While “ Gangs ” is never lethargic , it is hindered by its plot .
4 1
2
2 → (4 While...) 1
Figure 5: An example of a sentence from the Stan-
ford Sentiment Treebank which shows the utility
of our span features for this task. The presence
7.1 Ada
Our parse
parser tha
the treeb
with the
with very
terminals
fective an
are not us
One s
analysis a
スパンの先頭の語が論理関係であることが多い;
but
など
23. Neural
net
よりも高い性能
Root All Spans
Non-neutral Dev (872 trees)
Stanford CoreNLP current 50.7 80.8
This work 53.1 80.5
Non-neutral Test (1821 trees)
Stanford CoreNLP current 49.1 80.2
Stanford EMNLP 2013 45.7 80.7
This work 49.6 80.4
Table 5: Fine-grained sentiment analysis results
on the Stanford Sentiment Treebank of Socher et
al. (2013). We compare against the printed num-
bers in Socher et al. (2013) as well as the per-
formance of the corresponding release, namely
the sentiment component in the latest version of
the Stanford CoreNLP at the time of this writ-
References
Anders Bj¨orke
Thomas M
(Re)ranking
Results from
ceedings of
ing of Morp
Rens Bod. 1
Stochastic
Conference
for Comput
Peter F Brow
Vincent J D
Class-based
Computatio
参考:今年の
ACL
で別の論文
Nal
Kalchbrenner,
Edward
GrefensteJe,
Phil
Blunsom:
A
ConvoluRonal
Neural
Network
for
Modelling
Sentences
Neural
net
で、木構造を仮定せずに、senWment
を分類する
Test
set
で、48.5
point
(Stanford
current
より少し低い)