2. Product information extraction
An Italian product. This is a fruity
red wine that mainly consists of
sangiovese grapes of Tuscany.
Type
Red
Grape
variety
Sangiovese
Region
Italy,
Tuscany
2
3. Background
• Structured data play a crucial role for
making Rakuten more attractive service.
– Faceted navigation, recommendation, and
market analysis.
ベリンダ・コーリー キアンティ
2011 750ml
トスカーナ州 キャ
ンティ地区のサン
ジョベーゼ種を主
体につくられる、
イタリアを代表す
る赤ワインの一つ。
Attribute
Value
Type
赤
Region
イタリア,
トスカーナ州キャンティ
地区
Grape
サンジョベーゼ
Vintage
2011
3
5. Background
• Structured data play a crucial role for
making Rakuten more attractive service.
– Faceted navigation, recommendation, and
market analysis.
• Unsupervised methodology is required.
– 100 million products / 40,000 categories.
ベリンダ・コーリー キアンティ
2011 750ml
トスカーナ州 キャ
ンティ地区のサン
ジョベーゼ種を主
体につくられる、
イタリアを代表す
る赤ワインの一つ。
Attribute
Value
Type
赤
Region
イタリア,
トスカーナ州キャンティ
地区
Grape
サンジョベーゼ
Vintage
2011
5
6. Table is an useful clue, but…
WINE > CHILE
WINE > CHILE
Montes Alpha M 2009
Montes Alpha M 2009
Type
Red
Region
Chile
38%
Grape
Cabernet
sauvignon,
Merlot,
Cabernet franc,
Petit verdot
Year
2009
Product page including a table
Montes Alpha M is a blend
of Cabernet
Sauvignon, Merlot, Cabern
et Franc, and Petit Verdot.
A powerful wine with very
good level of soft and
rounded tannins. Intense
dark red color. The wine is
elegant and has a …
Product page consists of
sentences
6
7. Product information extraction
WINE > CHILE
Montes Alpha M 2009
Montes Alpha M is a blend
of Cabernet Sauvignon,
Merlot, Cabernet Franc,
and Petit Verdot.
A powerful wine with very
good level of soft and
rounded tannins. Intense
dark red color. The wine is
elegant and has a very
well defined character. …
Product page (unstructured)
Attribute
Value
Type
Red
Region
Chile
Grape
Cabernet sauvignon,
Merlot,
Cabernet franc,
Petit verdot
Vintage
2009
Company
Montes
Structured data
• Issue1: How do we know attributes for a category ??
• Issue2: How do we extract attribute values from full
texts ??
7
8. Attribute name collection
Analyze a large amount of table data
for collecting attributes of an object
Attribute values
Attribute names
of Wine
Reference: http://item.rakuten.co.jp/redbox/odm3000728/
8
9. Attribute value database (wine)
ぶどう品種
(Grape
variety)
内容量
(Volume)
産地
(Region)
生産者
(Winery)
味わい
(Taste)
Chardonnay
750ML
France
Farnese
Dry
Chardonnay
100%
720ML
Italy
Mas de
Monistrol
Full body
Merlot
375ML
Spain
Leroy
Medium body
Riesling
500ML
Chile
M. Chapoutier
Slightly sweet
Syrah
1500ML
German
Mastroberardino
Sweet
Grenache
360ML
Australia
Santero
Medium dry
Merlot
200ML
America
Saltarelli
Extremely sweet
Tempranillo
3000ML
Bordeaux
Cavicchioli
Medium dry
Sangiovese
1800ML
Champagne
Fontodi
Red Full body
Syrah100%
1000ML
Argentina
Ca'Rugate
Middle sweet
Precision is high, but coverage is low.
9
10. Product information extraction
WINE > CHILE
Montes Alpha M 2009
Montes Alpha M is a blend
of Cabernet Sauvignon,
Merlot, Cabernet Franc,
and Petit Verdot.
A powerful wine with very
good level of soft and
rounded tannins. Intense
dark red color. The wine is
elegant and has a very
well defined character. …
Product page (unstructured)
Attribute
Value
Type
Red
Region
Chile
Grape
Cabernet sauvignon,
Merlot,
Cabernet franc,
Petit verdot
Vintage
2009
Company
Montes
Structured data
• Issue1: How do we know attributes for each category ??
• Issue2: How do we extract attribute values from product
descriptions ??
10
11. Unsupervised attribute value extraction
- distant supervision approach Semi-structured data
Generation
Chateau d’Issan 1994
Construction
Database
:
<Region, Margaux>
<Color, White>
:
This is a wine
from Margaux.
...
Annotation
Rule
wine from x
⇒ x is a Region
Rule is generated
through machine
learning algorithm.
Product page including
entries in the database
11
12. Corpus with attribute-value annotations (wine)
• <産地>アルザス</産地>で最も香り豊かと言われるスパイシーで華やかなワイ
J:
E: ン。
A spicy and gorgeous wine that is known as the richest aroma one in
J: <production_area> Alsace </production_area>.
•
最もお手頃で、<生産者>ドメーヌ・ペゴー</生産者>の美味しさを気軽に楽し
E: める、とっても嬉しい一本なのです
This is a very nice wine because we can easily enjoy the taste of <winery>
J: Domaine Pegau </winery> at the best price.
• <ぶどう品種>ソーヴィニヨン・ブラン</ぶどう品種>種の特長がよく表れたワ
E:
J: イン。
A wine that <grape_variety> Sauvignon Blanc </grape_variety> was well
E: featured.
•
<タイプ>白</タイプ>身魚の塩焼きやシンプルな味付けのソテー、焼き牡蠣、
豚のしょうが焼き、ボンゴレビアンコなどと。
12
13. Unsupervised attribute value extraction
- distant supervision approach Semi-structured data
Generation
Chateau d’Issan 1994
Construction
Database
:
<Region, Margaux>
<Color, White>
:
This is a wine
from Margaux.
...
Annotation
Rule
wine from x
⇒ x is a Region
Rule is generated
through machine
learning algorithm.
Product page including
entries in the database
13
14. Extraction rule generation
• Algorithm: Conditional random fields [Lafferty+ 2001]
• Chunk tag: Start/End (IOBES) model [Sekine+ 1998]
• Features:
–
–
–
–
–
–
–
Token: Surface form of the token.
Base: Base form of the token.
PoS: Part-of-Speech tag of the token.
Char. type: Types of characters in the token.
Prefix: Double character prefix of the token.
Suffix: Double character suffix of the token.
The above features of ±3 tokens surrounding the token.
They are frequently employed in the task of Japanese
named entity recognition.
14
15. Unsupervised attribute value extraction
- distant supervision approach Semi-structured data
Generation
Chateau d’Issan 1994
Construction
Database
:
<Region, Margaux>
<Color, White>
:
This is a wine
from Margaux.
...
Annotation
Rule
wine from x
⇒ x is a Region
Rule is generated
through machine
learning algorithm.
Product page including
entries in the database
15
16. Unsupervised attribute value extraction
- distant supervision approach Terre di matraja
Bianco 2012
Apply
Rule
wine from x
⇒ x is a Region
This is a wine
from Tuscany.
...
Rule
1800 < x <= 2013
⇒ x is a Vintage
Attribute
Region
Vintage
Grape
Value
Tuscany
2012
Chardonnay
16
18. Wine / Japanese
An Italian product. This is a fruity
red wine that mainly consists of
sangiovese grapes of Tuscany.
Type
Red
Grape
variety
Sangiovese
Region
Italy,
Tuscany
18
19. Shampoo / Japanese
``MCH Natural shampoo 1000ml’’ is a shampoo
consisting of cypress oil and charcoal.
Category
Product
name
Shampoo
MCH Natural shampoo
1000ml
Ingredient
Cypress oil,
Charcoal
19
20. Video game / French
Product
type
Saga
Nintendo 64,
Nintendo DS
Mario
20
21. Conclusion
• Developing a technique for extracting product
information from unstructured data.
– Independent of any category and language.
• Useful services can be realized on structured
product data.
• Our paper is available on the web.
– ACL anthology: http://aclweb.org/anthology//I/I13/
21