The document describes a method called WADaR for repairing wrappers and the structured data they extract from web pages. WADaR first analyzes the extracted data to identify errors made by the wrapper, such as incorrectly segmented or misplaced values. It then uses techniques like sequence labeling and max-flow algorithms on a constructed network to identify the underlying correct structure of the data. Regular expressions are induced from the correctly structured data and used to repair both the extracted relations and the original wrappers. An evaluation on real-world datasets found the approach improved precision, recall, and F1-score of several existing wrapper generation systems by up to 30% across different domains.
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Joint Repairs for Web Wrappers
1. Joint Repairs for Web Wrappers
Stefano Ortona, Giorgio Orsi, Marcello Buoncristiano, and Tim Furche
ICDE Helsinki - May, 19 2016
E
E
Schindler’s List
Lawrence of Arabia (re-release)
Le cercle Rouge (re-release)
Director: Steven Spielberg Rating: R Runtime: 195 min
Director: David Lean Rating: PG Runtime: 216 min
Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min
Director: Steven Spielberg Rating: R Runtime: 195 min
Director: David Lean Rating: PG Runtime: 216 min
Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min
Director: David R Lynch Rating: Not Rated Runtime: 123 min
SUFFIX= substring-before(“_(“)
PREFIX= substring-after(“tor:_“)
SUFFIX= substring-before(“_Rat“)
PREFIX=
substring(string-length()-7)
WADaR induces regular expressions by looking, e.g., at
common prefixes, suffixes, non-content strings, and character
length.
Induced expressions improve recall
token value1 token token token token
token token value2 token token token
token token token value3 token token
When WADaR cannot induce regular expressions (not enough
regularity), data is repaired directly with annotators. Wrappers
are instead repaired with value-based expressions, i.e.,
disjunction of the annotated values.
ATTRIBUTE=
string-contains(“value1”|”value2”|”value3”)
y$
$
e)$
WADaR is highly robust to errors of the NERs.
Optimality
WADaR provably produces relations of maximum fitness,
provided that the number of correctly annotated tuples is more
than the maximum error rate of the annotators.
2. Background: Web wrapping
refcode postcode bedrooms bathrooms available price
33453 OX2 6AR 3 2 15/10/2013 £1280 pcm
33433 OX4 7DG 2 1 18/04/2013 £995 pcm
Process or turning semi-structured (templated) web data into structured form
Hidden databases are actually a form of dark / dim data (ref. panel on Tuesday)
3. manual / (semi) supervised
accurate
expensive + non-scalable
unsupervised
less accurate
cheaper + scalable
Wrapidity
Background: Web wrapping
4. Background: Web wrapping
From (manually or automatically) created examples to XPath-based wrappers
Even on templated websites, automatic wrapping can be inaccurate
Pairs <field,expression> that, once applied to the DOM, return structured records
field expression
listing //body
record //div[contains(@class,'movlist_wrap')]
title //span[contains(@class,’title’)]/text()
rated .//span[.='rating:']/following-sibling::strong/text()
genre .//span[.=genre']/following-sibling::strong/text()
releaseMo .//span[@class='release']/text()
releaseDy .//span[@class='release']/text()
releaseYr .//span[@class='release']/text()
image .//@src
runtime .//span[.=runtime']/following-sibling::strong/text()
5. Problems with wrapping
Inaccurate wrapping results in over(under) segmented data
Attribute_1 Attribute_2
Ava’s Possessions Release Date: March 4, 2016 | Rated: R | Genre(s) : Sci-Fi,
Mystery, Thriller, Horror | Production Company: Off
Hollywood Pictures “| Runtime: 216 min
Camino
Release Date: March 4, 2016 | Rated: Not Rated | Genre(s):
Action, Adventure, Thriller | Production Company: Bielberg
Entertainment | Runtime: 103 min
Cemetery of Splendor
Release Date: March 4, 2016 | Rated: Not Rated | Genre(s):
Drama | User Score: 4.6 | Production Company: Centre
National de la Cinématographie (CNC) | Runtime: 122 min
Title Release Genre Rating Runtime
RS: Source Relation
: Target Schema
Example extraction using
RoadRunner
(Crescenzi Et Al.)
6. Questions
The questions we want to answer are:
can we fix the data, and use what we learn to repair wrappers as well?
are the solutions scalable?
Why do we care?
Companies such as FB, and Skyscanner spend millions of dollars of engineering
time, creating and maintaining wrappers
Wrapper maintenance is a major cost of data acquisition from the web
7. Fixing the data
MAKE MODEL PRICEThe wrapper thinks it is filling this schema…
£19k Audi A3 Sportback
£43k Audi A6 Allroad quattro
£10k Citroën C3
£22k Ford C-max Titanium X
If all instances looked like this (i.e., mis-segmentation, no garbage, no shuffling)
table induction problem: TEGRA, WebTables, etc.
Moreover… we still have no clue on how to fix the wrapper afterwards
…but instead it produces this instance…
£19k Make: Audi Model:A3 Sportback
£43k Make: Audi Model: A6 Allroad
Citroën £10k Model: C3
Ford £22k Model: C-max Titanium X
8. What is a good relation?
The problem is that wrapper generated relations really look like this…
First, we need a way to determine how “far” we are from a good relation…
ū = ⟨u1, u2, …, un⟩
a tuple generated by the wrapper
Σ = ⟨A1, A2, …, Am⟩
the (target) schema for the extraction
Ω = {ωA1
, …, ωAarity(Σ)
}
set of oracles for Σ
The fitness then quantifies how well ū (resp. the whole instance) “fits” Σ
Ω = {ωMAKE, ωMODEL,ωPRICE}, Σ = ⟨MAKE, PRICE, MODEL⟩
ωA(u)=1 if u ∈ dom(A) or u=null, and ωA(u)=0 otherwise
f(R, Σ, Ω) = 1/2 = 50%
£19k Make: Audi Model:A3 Sportback
£43k Make: Audi Model: A6 Allroad
Citroën £10k Model: C3
Ford £22k Model: C-max Titanium X
ωMAKE ωPRICE ωMODEL
9. Problem Definition: Fitness
Σ = ⟨A1, A2, …, Am⟩ attributes (fields) of the target schema of the relation
ū = ⟨u1, u2, …, un⟩ tuple of the wrapper-generated relation R
Ω = {ωA1
, …, ωAarity(Σ)
} set of oracles for the fields of the Σ, s.t.,
ωA(u)=1 if u ∈ dom(A) or u=null, and ωA(u)=0 otherwise
We define the fitness of a tuple ū (resp. relation R) w.r.t. a schema Σ as:
f(ū, Σ, Ω) = ∑ ωAi
(ui) / d
i=1
c
where: c=min{ arity(Σ), arity(R) } and d=max{ arity(Σ), arity(R) }
resp. f(R, Σ, Ω) = ∑ f(ū, Σ, Ω) / |R|
ū∈R
Input: a wrapper W, a relation R | W(P)=R for some set of pages P, and a schema Σ
£19k Audi A3 Sportback
£43k Audi A6 Allroad quattro
Citroën £10k C3
Ford £22k C-max Titanium X
f(R, Σ, Ω) = 1/6 = 17%
MAKE MODEL PRICE
10. Problem Definition: Σ-repairs
Π = (i, j, …, k) permutation of the fields of R
ρ = { ⟨A1,ƐAi
⟩, ⟨A2,ƐAi
⟩, …, ⟨Am,ƐAm
⟩ } set of regexes for each attribute in Σ
A Σ-repair is a pair σ = ⟨Π,ρ⟩ where:
Σ-repairs can be applied to a tuple ū in the following way
σ(ū) = ⟨ ƐAi
(Π(ū)), ƐA2
(Π(ū)), … , ƐAm
(Π(ū)) ⟩
The notion of applicability extends naturally to relations σ(R) (i.e., sets of tuples)
Similarly, Σ-repairs can be applied to wrappers as well [details in the paper]
Output: a wrapper W’ and a relation R’ | W’(P)=R’ and R’ is of maximum fitness w.r.t. Σ
The goal is to find the Σ-repair that maximises the fitness
11. Computing Σ-repairs
Complexity [details in the paper]:
1. non atomic misplacements: NP-complete (red. from Weighted Set Packing)
2. atomic misplacements: polynomial (red. from Stars and Buckets)
We have an atomic misplacement when the correct value for an attribute is:
1. entirely misplaced, or
2. if it is over-segmented, the fragments are in adjacent fields in the relation
MAKE MODEL PRICE
£22k Ford C-max Titanium X
MAKE MODEL PRICE
C-max £22k X Ford Titanium
atomic misplacement non atomic misplacement
Naïve Algorithm:
For each tuple…
1. permute tuples in all possible ways (only if non-atomic misplacements)
2. segment tuples in all possible ways
3. ask the oracles and keep the segmentation of highest fitness
12. Approximating Σ-repairs
The naïve algorithm has the following problems:
1. oracles do not (always) exist
2. it fixes one tuple at a time, the wrapper needs a single fix for each attribute
3. even under the assumption of atomic misplacements we still have to try O(nk)
different segmentations (worst case) before finding the one of maximum fitness
(1) Weak oracles
Use noisy NERs in spite of oracles. If unavailable, it’s easy to build one.
In this work we use ROSeAnn (Chen&Al. PVLDB13)
(2 and 3) Approximate relation-wide repairs
Wrappers are programs, if they make a mistake they make it consistently
There is hope to have a common underlying attribute structure
13. Finding the right structure
We have to solve two problems:
find the underlying structure(s) of the relation
find an segmentation that maximises the fitness
An obvious way is sequence labelling (e.g., Markov chains + Viterbi) where oracles are
simulated by NERs (so they can make mistakes)
A SINK
5
SOURCE
B
C
3
D
2
3 3
4 4
2
The maximum likelihood sequence is actually <A,D> which “fits” ~28%
It looks like there’s another sequence that fits better…
a b c
a b c
a b c
a d
a d
b a d
b a d
Ω = {ωA, ωB, ωC, ωD}A B C D
14. Finding the right structure
The sequence corresponding to the max-flow is <A,B,C> which “fits” ~32%
vA,() SINK
13
SOURCE vB,(A)
vC,(A,B)
9 9 9
vD,(A)
4 4
vB,() vA,(B)
vD,(B,A)
4
4 4
4
The problem is that Markov chains are memory-less…
we have to remember the context and
make sure our sequence satisfies the oracles more than any other
Ok… this sounds like a max-flow!
Ω = {ωA, ωB, ωC, ωD}
a b c
a b c
a b c
a d
a d
b a d
b a d
A B C D
15. Iteratively compute max flows on the network, i.e., likely sequences of high fitness
MAKE
6/8
SINK
0/3
SOURCE
PRICE
MAKE
0/2
6/6
0/2
6/6PRICE, MAKE, MODEL
MODEL
Iteration 0
PRICE
0/3
0/3
MODEL
MODEL
0/3
6/6
MAKE
SINK
3/3
SOURCE
PRICE
3/3
3/33/3
MAKE, PRICE, MODEL
MODEL
Iteration 1
We stop when we covered “enough” of the tuples in the relation
First, annotate the relation using NERs (surrogate oracles) and build the network
MAKE
8
SINK
3
SOURCE
PRICE
MAKE
2
6
2
6
MODEL
PRICE
3
3
MODEL
MODEL
3
6
Example:
£19k Make: Audi Model:A3 Sportback
£43k Make: Audi Model: A6 Allroad quattro
Citroën £10k Model: C3
Ford £22k Model: C-max Titanium X
Ω = {ωMAKE, ωMODEL,ωPRICE}
Finding the right structure
16. Fixing the relation (and the wrapper)
Max flows represent likely sequences. We use them to eliminate unsound annotations.
We can use standard regex-induction algorithms to obtain robust expressions
£19k Make: Audi Model: A3 Sportback
MAKE [11,15)
MODEL [24,36)
PRICE [0,4)
The remaining annotations can be used as examples for regex induction
The induced expressions recover missing (incomplete) annotations
£19k Make: Audi Model: A3 Sportback
£43k Make: Audi Model: A6 Allroad quattro
Citroën £10k Model: C3
Ford £22k Model: C-max Titanium X
ρ = {
⟨MAKE, substring-before($, £) or
substring-before(substring-after($, ‘ke:␣’),’␣Mo’)⟩,
⟨MODEL, substring-after($, el:␣)⟩,
⟨PRICE, substring-after(substring-before($, ’kMa␣’ || ’kMo␣’),␣)⟩
}
17. Approximating Σ-repairs
MAKE MODEL PRICE
£19k Audi A3 Sportback
£43k Audi A6 Allroad quattro
Citroën £10k C3
Ford £22k C-max Titanium X
When an expression fails to match a minimum number of tuples, we fall back to
the NERs: value-based expressions
ρ = {
⟨MAKE, value-based($, [Audi, Ford] )⟩,
⟨MODEL, substring-after(substring-after($, ␣), ␣)⟩,
⟨PRICE, substring-after(substring-before($, k␣),␣)⟩
}
Example: (induction threshold 75%)
MAKE MODEL PRICE
£19k Audi A3 Sportback
£43k Audi A6 Allroad quattro
Citroën £10k C3
Ford £22k C-max Titanium X
ρ = {
⟨MAKE, substring-before($, £) or
substring-before(substring-after($, k␣),␣)⟩,
⟨MODEL, substring-after(substring-after($, ␣), ␣)⟩,
⟨PRICE, substring-after(substring-before($, k␣),␣)⟩
}
Example: (induction threshold 20%)
18. Evaluation
Dataset:
An enhanced version of the SWDE dataset (https://swde.codeplex.com)
10 domains, 100 websites, 78 attributes, ~100k pages, ~130k records
Systems:
wrapper generation systems: DIADEM, Depta, ViNTs, RoadRunner
Baseline wrapper induction/repair systems: WEIR (Crescenzi et Al. VLDB ‘13)
Implementation: WADaR (Wrapper and Data Repair) – Java + SQL
19. Evaluation: Highlights
0
0.2
0.4
0.6
0.8
1
ViN
Ts
(R
E)
ViN
Ts
(Auto)
D
IAD
EM
(R
E)
D
EPTA
(R
E)
D
IAD
EM
(Auto)
D
EPTA
(Auto)
R
R
(Auto)
R
R
(Book)
R
R
(C
am
era)
R
R
(Job)
R
R
(M
ovie)
R
R
(N
ba)R
R
(R
estaurant)R
R
(U
niversity)
Precision (Original) Precision (Repaired) Recall (Original)
Recall (Repaired) FScore (Original) FScore (Repaired)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FScore Original FScore Repaired
Fig. 2: Impact of repair.
to 30% in real estate, with an identical effect in almost all
domains.
Attribute-level accuracy. Another question is whether
there are substantial differences in attribute-level accuracy.
The top of Table III shows attributes where the repair is
very effective (F1-Score'1 after repair). These values appear
as highly structured attributes on web pages and the corre-
sponding expressions repair almost all tuples. As an example,
DOOR NUMBER is almost always followed by suffixes dr or
door. In these cases, the wrapper induction under-segmented
the text due to lack of sufficient examples.
TABLE III: Attribute-level evaluation.
System Domain Attribute Original F1-Score Repaired F1-Score
DIADEM real estate POSTCODE 0.304 0.947
DIADEM auto DOOR 0 0.984
shows Precision and Recall computed on the sample (values
higher than 0.9 are highlighted in bold). In order to estimate
TABLE IV: Accuracy of large scale evaluation.
Attribute Precision Recall % Modified values
LOCALITY 0.993 0.993 11.34%
OPENING HOURS 1.00 0.461 17.14%
LOCATED WITHIN 1.00 0.224 29.75%
PHONE 0.987 0.849 50.74%
POSTCODE 0.999 0.989 9.4%
STREET ADDRESS 0.983 0.98 83.78%
the impact of the repair, we computed, for each attribute, the
percentage of values that are different before and after the
repair step. These numbers are shown in the last column of
Table IV. Clearly, the repair is beneficial on all of the cases. For
OPENING HOURS and LOCATED WITHIN, where recall is very
WADaR increases F1-score between 15% and 60% (excluding ViNTs)
number of records
ws linearly w.r.t the
w.r.t. the number of
ted network contains
st network obtained
es and 45,797 edges
n 3 seconds.
pare our approach
data integration sys-
0.3$
0.4$
0.5$
0.6$
0.7$
0.8$
0.9$
1$
Auto$
Book$
Cam
era$
Job$
M
ovie$
Nba$Restaurant$
University$
WEIR$(Precision)$ Repair$(Precision)$ WEIR$(Recall)$
Repair$(Recall)$ WEIR$(FScore)$ Repair$(FScore)$
WADaR is 23% more accurate than WEIR on average
20. Evaluation: Robustness
We studied how F1-score varies w.r.t. annotation noise
The accuracy numbers are limited to those attributes where
our approach induces regular expressions, since it is already
clear that annotator errors directly reduce the accuracy of
value-based expressions. This is still a significant number of
attributes, i.e., 65% in all cases except for RoadRunner on
book (35%), and RoadRunner on movie (46%). Figure 8 shows
Fig. 8: Annotator recall drop - Fixed threshold
the impact of a drop in recall (x-axis) on F1-Score. As we
can see, our approach is robust to a drop in recall until we
reach 80% loss, then the performance rapidly decays. This is
somehow expected, since the regular expressions compensate
for the missing recall up to the point where the max-flow
sequences are no longer able to determine the underlying
attribute structure reliably.
Figure 9 show the effect on F1-Score if we set a low regex-
induction threshold (i.e., 0.1) instead. Clearly, in this case
our approach is highly robust to annotator inaccuracy and we
notice a loss in performance only after 80-90% loss in recall.
In summary, a lower regex-induction threshold is advisable
when we know that annotators have low recall. Even involving
an annotator with very low accuracy, our approach is robust
on
le
ef
re
Tu
of
sc
W
sh
th
ap
le
so
an
[1
(i)
in
in
re
so
En
re
cu
of
en
us
cl
in
[3
D
re
Fixed induction threshold 75%
(high dependence on annotation quality)
Fig. 8: Annotator recall drop - Fixed threshold
the impact of a drop in recall (x-axis) on F1-Score. As we
can see, our approach is robust to a drop in recall until we
reach 80% loss, then the performance rapidly decays. This is
somehow expected, since the regular expressions compensate
for the missing recall up to the point where the max-flow
sequences are no longer able to determine the underlying
attribute structure reliably.
Figure 9 show the effect on F1-Score if we set a low regex-
induction threshold (i.e., 0.1) instead. Clearly, in this case
our approach is highly robust to annotator inaccuracy and we
notice a loss in performance only after 80-90% loss in recall.
In summary, a lower regex-induction threshold is advisable
when we know that annotators have low recall. Even involving
an annotator with very low accuracy, our approach is robust
Fig. 9: F1-Score variation with a threshold value of 0.1
[1
(i)
in
in
re
so
En
re
cu
of
en
us
cle
in
[3
Di
re
co
in
re
va
flo
pr
pr
re
Fixed induction threshold 10%
(low dependence on annotation quality)
F1 starts being affected when recall loss at ~80%
Precision loss does not affect WADaR until ~300% (random noise)
21. Evaluation: Scalability
Worst-case scenario: all tuples are annotated with all attribute types
WADaR scales linearly w.r.t. the size of the relation and polynomially w.r.t. attributes
alue
s of
ated
ions
with
uish
n of
ocial
used
plied
been
mple
e IV
i.e., each record contains k annotated tokens, each annotation
has a different context and each record produces a different
path on the network. This results in a network with n · k + 2
nodes, and n · k + n edges. The chart on the left of Figure 3
plots the running time over an increasing number of records
(with number of attributes fixed), while the chart on the right
Fig. 3: Running time.
re the value
ses a loss of
he annotated
E in relations
of text with
ot distinguish
rmance.
extraction of
a large social
ites. We used
then applied
acy has been
on a sample
ed. Table IV
i.e., each record contains k annotated tokens, each annotation
has a different context and each record produces a different
path on the network. This results in a network with n · k + 2
nodes, and n · k + n edges. The chart on the left of Figure 3
plots the running time over an increasing number of records
(with number of attributes fixed), while the chart on the right
Fig. 3: Running time.
Oracles decouple the problem of finding similar instances from the segmentation
£19k Audi A3 Sportback
£43k Audi A6 Allroad quattro
£10k Citroën C3
£22k Ford C-max Titanium X
Ω = {ωMAKE, ωMODEL, ωPRICE}
22. Open issues
Learning oracles
Building oracles is not difficult but still requires engineering time.
The IBM SystemT people did some of good work in this direction. We can start there.
Missing attributes
Right now, if the wrapper fails to recover data, then we cannot repair it.
It is possible to manipulate the wrapper to match more content.
Markov Chains vs Max flows on wrapped relations
They seem to eventually compute the same sequences but in different order… proof?
What I know is that max-flows best approximate the maximum fitness at every step.
23. Questions?
L. Chen, S. Ortona, G. Orsi, M. Benedikt. Aggregating Semantic Annotators. PVLDB ’13
S. Ortona, G. Orsi, M. Buoncristiano, T. Furche. Joint Repairs for Web Wrappers. ICDE ’16
S. Ortona, G. Orsi, M. Buoncristiano, T. Furche. WADaR: Joint Wrapper and Data Repair. VLDB ’15 (Demo)
References:
T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, C. Wang. DIADEM: Thousands of websites
to a single database. PVLDB ’15
Title Director Rating Runtime
Schindler’s
List
Steven
Spielberg
R 195 min
Web Data Extraction
Road Runner
DEPTA
Attribute_1 Attribute_2
Schindler’s List Director: Steven Spielberg Rating: R
Runtime: 195 min
Lawrence of Arabia (re-release) Director: David Lean Rating: PG Runtime: 216 min
Le cercle Rouge (re-release) Director: Jean-Pierre Melville Rating: Not Rated Runtime:
140 min
Joint Data and Wrapper Repair
Attribute_1 Attribute_2
Schindler’s
List
Director: Steven Spielberg
Rating: R
Runtime: 195 min
Title Director Rating Runtime
Schindler’s
List
Steven
Spielberg
R 195 min
Maximal Repair is NP-complete
Attribute
Director: Steven Spielberg Rating: R
Runtime: 195 min
Director Rating Runtime
Director Rating Runtime
Steven Spielberg R 195 min
φ1:
φ2:
φ3:
φ4:
OBSERVATIONS
Templated Websites:
Data is published following a template.
Wrapper Behaviour
Wrappers rarely misplace and over-segment at the same time.
Wrappers make systematic errors.
Oracles
Oracles can be implemented as (ensembles of) NERs.
NERs are not perfect, i.e., they make mistakes
Joint Wrapper And
Data Repair
Authors
When values are both misplaced and
over-segmented, computing repairs
of maximal fitness is hard, otherwise,
just do the following:
(1) Compute all possible k non-
crossing partitions (k = |R|) of
tokens, i.e., assign to each
attribute an element of the
partition (O(nk) - Narayana
Number).
(2) Discard tokens never accepted by
oracles in any of the partitions.
(3) Collapse identical partitions and
choose the one with maximal
fitness.
Without misplacement and over-segmentation, solution in polynomial time by
computing non-crossing k-partition
NP-hardness: reduction from Weighted Set Packing. Membership in NP:
guess a partition, decide non crossing and compute fitness in PTIME.
Stefano Ortona stefano.ortona@cs.ox.ac.uk University of Oxford, UK
Giorgio Orsi giorgio.orsi@cs.ox.ac.uk University of Oxford, UK
Marcello Buoncristiano marcello.buoncrisitano@yahoo.it Università della Basilicata,Italy
Tim Furche tim.furche@cs.ox.ac.uk University of Oxford, UK
http://diadem.cs.ox.ac.uk/wadar
Web data extraction (aka scraping/wrapping) uses
wrappers to turn web pages into structured data.
Wrapper: structure { ⟨R,!R⟩ { ⟨A1,!A1⟩,…,⟨Am,!Am⟩ } } specifying objects to be
extracted (listings, records, attributes) and corresponding XPath expressions.
Wrappers are often created algorithmically and in large numbers.
Tools capable of maintaining them over time are missing.
⟨RATING, //li[@class=‘second’]/p⟩
⟨RUNTIME, //li[@class=‘third’]/ul/li[1]⟩
Algorithmically-created wrappers generate data that is far from perfect.
Data can be badly segmented and misplaced.
⟨TITLE,⟨1⟩,string($)⟩
⟨DIRECTOR,⟨2⟩,substring-before(substring-after($,tor:_),_Rat)⟩
⟨RATING,⟨2⟩,substring-before(substring-after($,ing:_),_Run)⟩
⟨RUNTIME,⟨2⟩,substring-after($,time:_)⟩
Take a set Ω of oracles, where each ωA in Ω can say whether a value vA
belongs to the domain of A. We define the fitness of a relation R w.r.t. Ω as:
Repair: specifies regular expressions that, when applied on the original
relation, produce a new relation with higher fitness.
<Director:
Steven>
<195 min>
<Director:><Steven
Spielberg>
<Rating: R
Runtime:195>
<Runtime:
k195 min>
<min Director:
Steven
Spielberg>
<Rating:
Runtime: 195>
<Director:
Steven
Spielberg>
<Rating: R>
<R>
<Spielberg
Rating: R
Runtime:>
WADaR:
⟨DIRECTOR, //li[@class=‘first’]/div/span⟩
APPROXIMATING JOINT REPAIRS
Annotation
1
Each record is interpreted as a string (concatenation of
attributes), where NERs analyse and identify relevant attributes.
Entity recognisers make mistakes, WADaR tolerates incorrect
and missing annotations.
Attribute_1 Attribute_2
Schindler’s List Director: Steven Spielberg Rating: R
Runtime: 195 min
Lawrence of Arabia (re-release) Director: David Lean Rating: PG Runtime: 216 min
Le cercle Rouge (re-release) Director: Jean-Pierre Melville Rating: Not Rated
Runtime: 140 min
The life of Jack Tarantino (coming
soon)
Director: David R Lynch Rating: Not Rated Runtime:
123 min
Title
Title
Director
Title
Director Director
Rating
Rating
Runtime
Runtime
Runtime
Runtime
Rating
Director
Segmentation
SINK
RATIN
G
RATIN
G
MAX FLOW SEQUENCE: DIRECTOR
Goal: Understand underlying structure of the relation.
START
TITLE
Two possible ways of encoding the problem:
2
TITLE
11
1. Max Flow Sequence in a Flow Network
RUN
TIME
RUN
TIME
DIREC
TOR
DIREC
TOR
START
TITLE
DIREC
TOR
RATIN
G
2. Most Likely Sequence in a Memoryless Markov Chain
RUN
TIME
SINK
Solutions often coincide.
Markov Chains: intuitive and faster to compute.
Max Flows: provably optimal.
RUNTIMERATING
TITLEMOST LIKELY SEQUENCE: DIRECTOR RATING RUNTIME
3/4
1/4
1 3/4
1/4
1
1
3
11 8
8
11
3
3 3
3
Induction
3
Schindler’s List
Lawrence of Arabia (re-release)
Le cercle Rouge (re-release)
Director: Steven Spielberg Rating: R Runtime: 195 min
Director: David Lean Rating: PG Runtime: 216 min
Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min
Director: Steven Spielberg Rating: R Runtime: 195 min
Director: David Lean Rating: PG Runtime: 216 min
Director: Jean-Pierre Melville Rating: Not Rated Runtime: 140 min
Director: David R Lynch Rating: Not Rated Runtime: 123 min
SUFFIX= substring-before(“_(“)
PREFIX= substring-after(“tor:_“)
SUFFIX= substring-before(“_Rat“)
PREFIX=
substring(string-length()-7)
Input: set of clean annotations to be used as positive examples.
WADaR induces regular expressions by looking, e.g., at
common prefixes, suffixes, non-content strings, and character
length.
Induced expressions improve recall
token value1 token token token token
token token value2 token token token
token token token value3 token token
When WADaR cannot induce regular expressions (not enough
regularity), data is repaired directly with annotators. Wrappers
are instead repaired with value-based expressions, i.e.,
disjunction of the annotated values.
ATTRIBUTE=
string-contains(“value1”|”value2”|”value3”)
Empirical Evaluation
Table 1
Precision_O
riginal
Precision_R
epaired
Recall_Origi
nal
Recall_Repai
red
FScore_Origi
nal
FScore_Rep
aired
0.013233 0.5689
0.004155 0.4255
0.006324 0.488
0.535259 0.1396
0.307571 0.2871
0.390661 0.2665
0.8243 0.0914
0.5348 0.3002
0.6487 0.2248
0.5264 0.3716
0.3501 0.5276
0.4205 0.4666
0.7332 0.1943
0.5147 0.3361
0.6048 0.2827
0.6703 0.2281
0.5091 0.3295
0.5787 0.2888
0.5777 0.2766
0.553 0.3314
0.5651 0.304
0.292 0.441
0.2317 0.4597
0.2584 0.4531
0.158 0.6278
0.1404 0.588
0.1487 0.6074
0.5636 0.2263
0.5191 0.235
0.5405 0.2311
0.446 0.3314
0.2552 0.471
0.3246 0.4263
0.6799 0.302
0.5609 0.299
0.6147 0.3022
0.7252 0.1589
0.6525 0.1429
0.6869 0.150
0.5461 0.3267
0.3965 0.3268
0.4594 0.3316
FScore_Origi
nal
FScore_Rep
aired
0.006324 0.488
0.390661 0.2665
0.6487 0.2248
0.4205 0.4666
0.6048 0.2827
0.5787 0.2888
0.5651 0.304
0.2584 0.4531
0.1487 0.6074
0.5405 0.2311
0.3246 0.4263
0.6147 0.1904
0.6869 0.1505
0.4594 0.3316
0
0.2
0.4
0.6
0.8
1
Vi
N
Ts
(R
E)
Vi
N
Ts
(A
ut
o)
D
IA
D
EM
(R
E)
D
EP
TA
(R
E)
D
IA
D
EM
(A
ut
o)
D
EP
TA
(A
ut
o)
R
R
(A
ut
o)
R
R
(B
oo
k)
R
R
(C
am
er
a)
R
R
(J
ob
)
R
R
(M
ov
ie
)
R
R
(N
ba
)R
R
(R
es
ta
ur
an
t)
R
R
(U
ni
ve
rs
ity
)
Precision (Original) Precision (Repaired) Recall (Original)
Recall (Repaired) FScore (Original) FScore (Repaired)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Vin
ts
_R
EVin
ts
_U
CD
IA
D
EM
_R
E
D
EPTA_R
E
D
IA
D
EM
_U
C
D
EPTA_U
CR
R
_AutoR
R
_BookR
R
_C
am
eraR
R
_JobR
R
_M
ovieR
R
_N
ba
R
R
_re
sta
ura
nt
R
R
_U
niv
ers
ity
U
ntitled
1
U
ntitled
2
FScore Original FScore Repaired
5.1 Setting
Datasets. The dataset consists of 100 websites from 10 do-
mains and is an enhanced version of SWDE [20], a benchmark com-
monly used in web data extraction. SWDE’s data is sourced from
80 sites and 8 domains: auto, book, camera, job, movie, NBA player,
restaurant, and university. For each website, SWDE provides collec-
tions of 400 to 2k detail pages (i.e., where each page corresponds
to a single record). We complemented SWDE with collections of
listing pages (i.e., pages with multiple records) from 20 websites
of real estate (RE) and auto domains. Table 1 summarises the char-
acteristics of the dataset. SWDE comes with ground-truth data cre-
Table 1: Dataset characteristics.
Domain Type Sites Pages Records Attributes
Real Estate listing 10 271 3,286 15
Auto listing 10 153 1,749 27
Auto detail 10 17,923 17,923 4
Book detail 10 20,000 20,000 5
Camera detail 10 5,258 5,258 3
Job detail 10 20,000 20,000 4
Movie detail 10 20,000 20,000 4
Nba Player detail 10 4,405 4,405 4
Restaurant detail 10 20,000 20,000 4
University detail 10 16,705 16,705 4
Total - 100 124,715 129,326 78
ated under the assumption that wrapper-generation systems could
only generate extraction rules with DOM-element granularity, i.e.,
without segmenting text nodes. Modern wrapper-generation sys-
tems support text-node segmentation and we therefore refined the
ground-truth accordingly. As an example, in the camera domain,
the original ground-truth values for MODEL consisted of the entire
product title. The text node includes, other than the model, COLOR,
PIXELS, MANUFACTURER. The ground-truth for real estate and auto
domains has been created following the SWDE format. The final
dataset consist of more than 120k pages, for almost 130k records
containing more than 500k attribute values.
Wrapper-generation systems. We generated input relations
for our evaluation using four wrapper-generation systems: DIA-
DEM [19], DEPTA [36] and ViNTs [39] for listing pages, and Road-
Runner [12] for detail pages.1 The output of DIADEM, DEPTA, and
RoadRunner can be readily used in the evaluation since these are
full fledged data extraction systems, supporting the segmentation
of both records and attributes within listings or (sets of) detail-
pages. ViNTs, on the other hand, segments rows into records within
a search result listing and, as such, it does not have a concept of
attribute. Instead, it segments rows within a record. We therefore
post-processed its output, typing the content of lines from differ-
ent records that are likely to have the same semantics. We used a
naïve heuristic similarity based on relative position in the record
and string-edit distance of the row’s content. This is a very simple
version of more advanced alignment methods based on instance-
level redundancy used by, e.g., WEIR and TEGRA [7].
Metrics. The performance of the repair is evaluated by com-
paring wrapper-generated relations against the SWDE ground truth
before and after the repair. The metrics used for the evaluation
are Precision, Recall, and F1-Score computed at attribute-level. Both
the ground truth and the extracted values are normalised, and exact
matching between the extracted values and the ground-truth is re-
quired for a hit. For space reasons, in this paper we only present
the most relevant results. The results of the full evaluation, together
1RoadRunner can be configured for listings but it performs better on
detail pages.
with the dataset, gold standard, extracted relations, the code of the
normaliser and of the scorer are available at the online appendix [1].
All experiments are run on a desktop with an Intel quad-core i7
at 3.40GHz with 16 GB Ram and Linux Mint OS 17.
5.2 Repair performance
Relation-level Accuracy. The first two questions we want to an-
swer are: whether joint repairs are necessary and what their impact
is in terms of quality. Table 2 reports, for each system, the percent-
age of: (i) Correctly extracted values. (ii) Under-segmentations,
i.e., when values for an attribute are extracted together with val-
ues of other attributes or spurious content. Indeed often websites
publish multiple attribute values within the same text node and the
involved extraction systems are not able to split values into multi-
ple attributes. (iii) Over-segmentations, i.e., when attribute values
are split over multiple fields. As anticipated in Section 2, this rarely
happens since an attribute value is often contained in a single text
node. In this setting an attribute value can be over-segmented only
if the extraction system is capable of splitting single text nodes
(DIADEM), but even in this case the splitting happens only when
the system can identify a strong regularity within the text node.
(iv) Misplacements, i.e., values are placed or labeled as the wrong
attribute. This is mostly due to lack of semantic knowledge and
confusion introduced by overlapping attribute domains. (v) Miss-
ing values, due to lack of regularity and optionality in the web
source (RoadRunner, DEPTA, ViNTs) or missing values from the do-
main knowledge (DIADEM). Note that the numbers do not add up to
Table 2: Wrapper generation system errors.
System Correct
(%)
Under
Segmented
(%)
Over
Segmented
(%)
Misplaced
(%)
Missing
(%)
DIADEM 60.9 34.6 0 23.2 3.5
DEPTA 49.7 44 0 25.3 6
ViNTs 23.9 60.8 0 36.4 15.2
RoadRunner 46.3 42.8 0 18.6 10.4
100% since errors may fall into multiple categories. These numbers
clearly show that there is a quality problem in wrapper-generated
relations and also support the atomic misplacement assumption.
Figure 2 shows, for each system and each domain, the impact
of the joint-repair on our metrics. Light (resp. dark)-colored bars
denote the quality of the relation before (resp. after) the repair.
A first conclusion that can be drawn is that a repair is always ben-
eficial. From 697 extracted attributes, 588 (84.4%) require some
form of repair and the average pre-repair F1-Score produced by the
systems is 50%. We are able to induce a correct regular expression
for 335 (57%) attributes, while for the remaining 253 (43%) it pro-
duces value-based expressions. We can repair at least one attribute
in each of the wrappers in all of the cases, and we can repair more
than 75% of attributes in more than 80% of the cases.
Among the considered systems, DIADEM delivers, in average,
the highest pre-repair F1-Score ( 60%), but it never exceeds 65%.
RoadRunner is in average worse than DIADEM but it reaches a bet-
ter 70% F1-Score on restaurant. Websites in this domain are in fact
highly structured and individual attribute values are contained in a
dedicated text node. When attributes are less structured, e.g., on
book, camera, movie, RoadRunner has a significant drop in perfor-
mance. As expected, ViNTs delivers the worst pre-cleaning results.
In terms of accuracy, our approach delivers a boost in F1-Score
between 15% and 60%. Performance is consistently close to or
above 80% across domains and, except for ViNTs, across systems,
with a peak of 91% for RoadRunner on NBA player.
The following are the remaining causes of errors: (i) Missing
values cannot be repaired as we can only use the data available in
8
0.3$
0.4$
0.5$
0.6$
0.7$
0.8$
0.9$
1$
Auto
$
Book$
Cam
era
$
Jo
b$
M
ovie
$
Nba$Rest
aura
nt$
Univers
ity$
WEIR$(Precision)$ Repair$(Precision)$ WEIR$(Recall)$
Repair$(Recall)$ WEIR$(FScore)$ Repair$(FScore)$
Evaluation
100 websites
10 domains
4 wrapper generation systems.
Precision, Recall, F1-Score
computed before and after repair.
WADaR boosts F1-Score between 15% and 60%.
Performance consistently close to or above 80%.
Metrics computed considering exact matches.
WADaR against WEIR.
WADaR is highly robust to errors of the NERs.
WADaR scales linearly with the size of the input
relation. Optimal joint-repair approximations
computed in polynomial time.
Optimality
WADaR provably produces relations of maximum fitness,
provided that the number of correctly annotated tuples is more
than the maximum error rate of the annotators.
More questions? Come to the poster later!!!
T. Furche, G. Gottlob, L. Libkin, G. Orsi, N. Paton. Data Wrangling for Big Data. EDBT ’16