Scaling API-first – The story of a global engineering organization
DIADEM data extraction methodology domain-centric intelligent automated
1. WELCOME 1
DIADEM data extraction methodology
domain-centric intelligent automated
Web data as you want it
2. TEAM 2
Georg Gottlob
Professor, FRS
Project lead
Scientific director
Tim Furche
Postdoc
Technical director
Giovanni Grasso
Postdoc
Extraction infrastructure
Giorgio Orsi
Postdoc
Knowledge modelling
Christian Schallhart
Postdoc
Software engineering
Xiaonan Guo
Postdoc
Forms and interaction
3. TEAM 3
Omer Gunes
D.Phil. student
Jinsong Guo
D.Phil. student
Andrew Sellers
Captain USAF
former D.Phil. student
Andrey Kravchenko
D.Phil. student
Stefano Ortona
D.Phil. student
Cheng Wang
D.Phil. student
7. 7
50-80%
Data scientists […] spend 50 to 80 percent of their time […]
collecting and preparing […] digital data […] from sensors,
documents, the web and conventional databases.
–STEVE LOHR
New York Times, Aug. 2014
8. INTRODUCTION 8
Data … is still a pain
○ Data exists, but getting and using it is hard
◗ For example, when you are making decisions
○ Tipping point: tech leaders leverage data to striking effect
◗ Amazon, Walmart, Google
○ What about the rest of the world?
9. 9
collect &
prepare
data
“You can’t do this manually, you’re never going to find
enough data scientists and analysts.”
– SHARMILA SHAHANI-MULLIGAN
CEO Clearstory
(New York Times, Aug 2014)
10. INTRODUCTION 10
… but there is a remedy
○ We can get you the data you need in the form you need
◗ from competitors
◗ from open sources
◗ from your intranet
○ At any scale, covering popular as well as long tail sources
○ Far more comprehensive than manual solutions
○ Far cheaper even than partial, manual solution
17. HOW: TECHNOLOGY & TEAM 17
Technology: Our Strength
10,493
Sites from real-estate
and used-car
92%
Effective wrappers for
more than 92% of sites on
average
97%
Precision of extracted
primary attributes
20 2.1
Days on a 45 node
Amazon EC2 cluster
Days (one expert) to adjust
system to a new domain
18. HOW: TECHNOLOGY & TEAM 18
Technology: Our Strength
2000
1500
seconds)
1000
(time 500
0
number of records 0 250 500 750 1000
19. HOW: TECHNOLOGY & TEAM 19
Phenomenology
Self-organising
adjusts itself to observations on the pages
different sequence of tasks for every site
strong isolation of components
AI
Rule-based
AI
declarative rules instead of heuristics
uniform query of pages, phenomenology, …
all domain-independent
appearance of objects on the web
reason for DIADEM’s high accuracy
easily adapted to new domains
21. HOW: TECHNOLOGY & TEAM 21
Manual Automatic
Supervised
+
magic
Data extraction isn’t new …
Scaling costly
Very common
Fully algorithmic
Active research
Human + algorithm Most commercial products
22. HOW: TECHNOLOGY & TEAM 22
Competitors
DIADEM data extraction methodology
Mozenda, Lixto, Connotate, domain-centric intelligent automated
BlackLocus, import.io,
scrapinghub.com, promptcloud.com
massive human effort small human effort
continuously once
low scale
one or few sources
massive scale
thousands of sources
low cost efficiency high cost efficiency
23. HOW: TECHNOLOGY & TEAM 23
What about Google & Co.
○ Verticals are becoming ever more relevant for search
◗ the major change to Google’s result page in the last decade
◗ crucial for intelligent personal assistants (Siri, Google Now)
○ Revived interest in large-scale extraction of structured data
◗ as part of knowledge graph
◗ currently only good for common sense facts
○ Recent AI/deep learning acquisitions by Google, Facebook
24. HOW? INCUBATION PLAN 24
Data science—a huge market
$50
billion
Data science
market 2017
*ACCORDING TO FORBES,
WIKIBON FORECAST
$25
billion
Data collection &
cleaning
*ACCORDING NEW YORK TIMES
29. HOW? INCUBATION PLAN 29
DIADEM Vision
“Suggest the best smart watch
for my preferences!”
“Suggest a great evening out!”
“Suggest a cheap
headphone with great
bass!”
“Suggest a great hotel in an area
with lots of bars and close to my
conference!”
30. HOW: TECHNOLOGY & TEAM 30
WWW 2014: Fallacies in DE
–KEVIN C. CHANG
Co-Founder Cazoodle, move.com, UIUC
#1: Can not start with ‘given a set of result pages’
#2: Must not stop at 70% accuracy
DIADEM
#3: Must be scalable to more than thousands of sources
#4: Must leverage human feedback
✓
✓
✓
✓
31. DIADEM ANALYSIS 31
Table 3: Wrapper quality
Wrapper quality
5
wrapper
effective wrong or
missing data
no data
UK real estate 91% 7% 2%
Oxford real estate 90% 6% 4%
ViNTs10 4% 5% 91%
UK used cars 93% 4% 3%
US real estate 90% 5% 5%
32. DIADEM ANALYSIS 32
Competition?
precision recall
84%
88%
95%
98%
99%
77%
56%
38%
97%
99%
72%
78%
81%
48%
53%
58%
MDR
DEPTA
ViNTs
DIADEM
0%
25%
50%
75%
100%
0%
25%
50%
75%
100%
Records RE⌧RND UC⌧RND
CONCLUSION:
Do only a part of the job, and poorly
33. DIADEM ANALYSIS 33
Competition?
precision recall
83%
84%
97%
95%
42%
48%
96%
95%
65%
60%
58%
74%
RoadRunner
DEPTA
DIADEM
0%
25%
50%
75%
100%
0%
25%
50%
75%
100%
Attributes RE⌧RND UC⌧RND
CONCLUSION:
Do only a part of the job, and poorly
34. DIADEM ANALYSIS 34
25%
Competition?
unit
beds
CONCLUSION:
make
transmission
age
engine_size
Do only a part of the job, and poorly
period_baths
receptions
0%
price
location
postcode
model
colour
body_type
fuel_type
registration
door_number
mileage
Attribute quality
ICQ dataset HA [14] ExQ [41] StatParser [36] DIADEM [17]
F1 for labeling 92% 96% 96% 98%
Table 3: Form labeling accuracy
cars are more prominently placed on the site. There are about 3%
of sites where no wrapper can be induced, typically as they con-tain
no properties, all properties are on aggregators, or they contain
no pivot attribute. For these sites, DIADEM correctly detects that
there is no effective wrapper. The final case is that DIADEM fails
to produce an effective wrapper, yet one exists. The most common
reasons for these failures are dynamic forms (15%), result pages
35. DIADEM 35
DIADEM’s Components
1 ROSeAnn (VLDB’14)
World-best entity extraction from text and structure
36. DIADEM 36
DIADEM’s Components
The Ontological ROSeAnn Key: (Automatically VLDB’14)
Understanding and Integrating Forms 1 World-best entity extraction from text and structure
1 TEMPLATE OPAL field_(WWW’by_proper<12, VLDBJ’C,A> {13)
field<C>(N)(N@A{d,e,p}}
2
2
World-most-effective form understanding & filling
3 TEMPLATE field_by_segment<C,A>{field<C>(N)(N@A{e,p}}
4
5 TEMPLATE field_by_value<C,A> {field<C>(N)(N@A{m},
6 ¬(A16= A, N@A1{d,e,p}_N@A1{e,p}) }
7
8 TEMPLATE field_minmax<C,CM,A> {
Range widget ⟸ two fields + connected by “to” or other range connector
9 field<CM>(N1)(+ some child(clues in N1,the G),annotations child(or N2,classifications
G),adjacent(N1,N2),
10 N1@A{e,d},(field<C>(N2)_N2@A{e,d})
11 field<C_range>(N2)(child(N1,G),child(N2,G),next(N2,N1),
12 field<C>(N1),N2@range_connector{e,d},¬(A1$ C,N2@A1{d})
13 field<CM>(N1)(child(N1,!
G),child(N2,G),adjacent(N1,N2),
10 11 12 13
37. DIADEM 37
DIADEM’s Components
1 ROSeAnn (VLDB’14)
World-best entity extraction from text and structure
2
OPAL (WWW’12, VLDBJ’13)
World-most-effective form understanding & filling
3
AMBER (TWeb’14)
World-most-accurate record identification for listing pages
data area
a div a div a div a
p
span
PRICE
b
LOCATION
p
span
PRICE
b
LOCATION
p
span
PRICE
em p
span
strong
PRICE
div
b
LOCATION
span
PRICE
LOCATION
i
BEDS
38. DIADEM 38
DIADEM’s Components
1 2
3
4
Bitemporal Complex Event Processing of
ROSeAnn (VLDB’14)
World-best entity extraction from text and structure
Web Event Advertisements?
OPAL (WWW’12, VLDBJ’13)
World-most-effective form understanding & filling
Tim Furche1, Giovanni Grasso1, Michael Huemer2,
Christian Schallhart1, and Michael Schrefl2
AMBER (TWeb’14)
World-most-accurate record identification for listing pages
1 Department of Computer Science, Oxford University,
Wolfson Building, Parks Road, Oxford OX1 3QD
firstname.lastname@cs.ox.ac.uk
OXPath (VLDB’11, VLDBJ’13)
World-most-efficient extraction language
2 Department of Business Informatics – Data & Knowledge Engineering,
Johannes Kepler University, Altenberger Str. 69, Linz, Austria
lastname@dke.uni-linz.ac.at
doc(’http://www.scottfraser.co.uk/’)//select[@id=’search-type’]/{1 /}
2 //input/{click /}/(//div[1]/table//td[4]/a/{click /})*{0,500}
//div[@class=’property-wrapper’]:<record>
4 [? .:<ORIGIN_URL=current-url()>]
39. DIADEM 39
DIADEM’s Components
1 ROSeAnn (VLDB’14)
World-best entity extraction from text and structure
2
OPAL (WWW’12, VLDBJ’13)
World-most-effective form understanding & filling
3
AMBER (TWeb’14)
World-most-accurate record identification for listing pages
4
OXPath (VLDB’11, VLDBJ’13)
World-most-efficient extraction language
5
DIADEM (VLDB’14)
World-first accurate, automatic full-site extraction system
40. FORM PHENOMENOLOGY 40
Example 1: Form
○ Task: classify and group form fields into semantic segments
◗ Problem: HTML structure is only an approximation
○ Phenomenology: Detect semantic segments, e.g.,
◗ if there is a continuous list of option fields (, ☑️)
◗ with the same type
◗ and a parent that can’t be classified
41. FORM PHENOMENOLOGY 41
Example 1: Form
s e g m e n t < C > ( ∃ X ) : - h t m l - c h i l d ( N 1 , P ) ,
parent can not
be classified
html-child(N2, P) , N1 ≠ N2, ¬segment(P),
o p t i o n - f i e l d ( N 1) , o p t i o n - f i e l d ( N 2) ,
concept<C>(N1), concept<C>(N2),
m a x - c o n t - l i s t - o f - f i e l d s - w i t h - t y p e < C > ( N 1, N 2) .
both option fields
same type C
end points of largest continuous list of type C
42. RESULT PAGE PHENOMENOLOGY 42
Example 2: Dataareas
○ Task: Finding areas on a page that contain relevant data
○ Idea: Use the regularity resulting from the DB templates
○ Problem: Distinguishing regular noise, e.g., featured properties
○ Solution: Maximisation problem over pivot elements
◗ occurrences of mandatory attributes such as price
43. RESULT PAGE PHENOMENOLOGY 43
D1
M1,1
M1,2
D2
…
D3
…
M1,3 E
M1,4
Figure 3: Data area identification
consistent_cluster_members(C, N1, N2, N3) :- pivot(N1), pivot(N2), ...
similar_depth(N1, N2), similar_depth(N2, N3), similar_depth(N1,N3),
similar_tree_distance(N1, N2, N3).
its of order dominance: The pivot nodes in E are organized rather
regularly, whereas the pivot nodes in D1 vary quite notably. How-ever,
cluster(C,N) :- ... continuous, lca, contains at least one of all mandatories
there variation is small enough that M1,1 to M1,4 are depth and
44. RESULT PAGE PHENOMENOLOGY 44
Example 2: Record alignment
data area
a img div
img a img img a img img
£860
div
div
£900 £500
p
£900
○ set of uniform, non-overlapping records
○ maximise regularity, minimise outliers
◗ pairwise edit distance with bias towards pivot nodes
p
£900
Figure 4: Record Segmentation
Algorithm 2: Segmentation(DOM P,Data Area d)
1 L {n : child(f(d),n) 2 P^9n0 2 y(d) : desc-or-self(n,n0)};
2 sort L in document order;
3 foreach 1 k |L|−1 do Partition[k] {n : L[k] ( n ) L[k+1]};
4 Len min{|Partition[i]|: |{j : |Partition[ j]| = |Partition[i]|}| maximal};
5 while L[1]−sibl L[2] < Len do delete L[1];
6 while L[|L|−1]−sibl L[|L|] < Len do delete L[|L|];
7 while 1 < k < |L| do
8 if L[k]−sibl L[k+1] < Len then delete L[k+1] else k++;
9 StartCandidates {L}[{{n : 9l 2 L : n−sibl l = i} : i Len};
10 OptimalSegmentation / 0; OptimalSim •;
11 foreach S 2 StartCandidates do
12 sort S in document order;
13 foreach 1 k |L|−1 do
14 Segmentation[k] {n : n−sibl S[k] Len};
15 if 8P 2 Segmentation : |P| = Len then
16 if irregularity(Segmentation) < OptimalSim then
all text nodes. With the exception of a’s tag, all HTML tags are
annotated by the type of step.
For the leftmost a and its i descendant in Figure 5, e.g., the tag
path is a/first-child::p/first-child::span/next-sibl::i.
Based on the tag path, AMBER quantifies the fraction of records
that support the assumption that a node n is an attribute of type A
within record r with the support suppr(n,A).
DEFINITION 9. Let E be an extraction instance on DOM P,
containing a node n within record r belonging to data area d, and
A 2 A an attribute type. Then suppr(n,A) denotes the support of
n as attribute of type A within r, defined as the fraction of records
r06= r in d that contain a node n0 with tag-pathr(n) = tag-pathr0 (n0)
that is annotated with A.
Consider a data area with 10 records, containing 1 PRICE-annotated
46. BLOCK PHENOMENOLOGY 46
Example 3: Pagination links
○ Machine learning on top of derived features
Description Type Predicate
Content
1 Annotated as NEXT bool plm::annotated_by<NEXT>
2 Annotated as PAGINATION bool plm::annotated_by<PAGINATION>
3 Annotated as NUMBER bool plm::annotated_by<NUMBER>
4 Number of characters int plm::char_num
Page position
5 Relative position on page int2 plm::relative_position<css::page>
6 Relative position in first screen int2 plm::relative_position<std::first_screen>
7 In first screen bool plm::contained_in<std::first_screen>
8 In last screen bool plm::contained_in<std::last_screen>
Visual proximity
9 Pagination annotation close to node bool plm::in_proximity<plm::annotated_by<PAGINATION>>
10 Number of close numeric nodes int plm::num_in_proximity<numeric>
11 Closest numeric node is a link bool plm::closest<std::left_proximity>_with
<numeric>_is<non_link>
12 Closest numeric node has different style bool <numeric>_is<different_style>
13 Closest link annotated with NEXT bool <dom::clickable>_is<plm::annotated_by<NEXT>
14 Ascending w. closest numeric left, right bool plm::ascending-numerics
Structural
15 Preceding numeric node is a link bool plm::closest<std::preceding>_with
<numeric>_is<non_link>
16 Preceding numeric node has different style bool <numeric>_is<different_style>
17 Preceding link annotated with NEXT bool <dom::clickable>_is<plm::annotated_by<NEXT>
Table 3: PLM: Pagination Link Model
47. BLOCK PHENOMENOLOGY 47
Example 3: Pagination links
TEMPLATE annotated_by<Model,AType> {
2 <Model>::annotated_by<AType>(X) ( node_of_interest(X),
gate::annotation(X, <AType>, _). }
4 TEMPLATE in_proximity<Model,Property(Close)> {
○ Datalog± rules for deriving features
○ Lots of visual reasoning on the page
○ Rich template language to avoid duplication
<Model>::in_proximity<Property>(X) ( node_of_interest(X),
6 std::proximity(Y,X), <Property(Close)>. }
TEMPLATE num_in_proximity<Model,Property(Close)> {
8 <Model>::in_proximity<Property>(X,Num) ( node_of_interest(X),
std::proximity(Close,X), Num = #count(N: <Property(Close)>). }
10 TEMPLATE relative_position<Model,Within(Height,Width)> {
<Model>::relative_position<Within>(X, (PosH, PosV)) ( node_of_interest(X),
12 css::box(X, LeftX, TopX, _, _), <Within(Height,Width)>,
Width , PosV = 100·TopX
Height . }
PosH = 100·LeftX
14 TEMPLATE contained_in<Model,Container(Left,Top,Bottom,Right)> {
<Model>::contained_in<Container>(X) ( node_of_interest(X),
16 css::box(X,LeftX,TopX,RightX,BottomX), <Container(Left,Top,Right,Bottom)>,
Left < LeftX < RightX < Right, Top < TopX < BottomX < Bottom. }
18 TEMPLATE closest<Model,Relation(Closest,X),Property(Closest),Test(Closest)> {
<Model>::closest<Relation>_with<Property>_is<Test>(X) ( node_of_interest(X),
20 <Relation(Closest,X)>, <Property(Closest)>, <Test(Closest)>,
¬(<Relation(Y,X)>, <Property(Y)>, <Relation(Y,Closest)>). }
Fig. 4: BERyL feature templates
In a similar way, the second template defines a boolean feature that holds for nodes