Overview of ExAlg algorithm on Extracting Structured Data from Web Pages. Original worl by Arvind Arasu & Hector Garçia-Molina {arvinda,hector}@cs.stanford.edu . Presentation by Alessandro Manfredi & Marco Bontempi {a.n0on3,bontempi}@gmail.com
Automating Google Workspace (GWS) & more with Apps Script
ExAlg Overview
1. Overview of
ExAlg
Extracting Structured Data from Web Pages
Original work by
Arvind Arasu & Hector Garçia-Molina
{arvinda,hector}@cs.stanford.edu
Presentation by
Alessandro Manfredi & Marco Bontempi & Marco Giannone
{a.n0on3,bontempi,marco.giannone}@gmail.com
2. Introduction
WWW: billions of unstructured HTML pages
Many dynamically generated from underlying
structured source (e.g. relational database)
Information is hard to query
5. The goal
Full automatic extraction of structured data
... in order to pose complex queries over data
Semantic meaning for extracted data is not an aim
Avoid human input ( error prone and time consuming,
especially for frequent template changes )
6. Challenges
Distinguish template from data
Syntactically, there is nothing that tells
text part of the template apart from data
Deduce the schema for encoded information:
Complex schemas, arbitrary levels nesting
8. Data
Structure conformed to a common Schema or Type
Types:
β (Basic Type: string of tokens)
Token: word or HTML tag ( or bit, or char )
Tuple < T₁,T₂,...,Tn >, where element ei is an instance
of type Ti
Set { T }, where all elements are instance of type T
9. Template
Function that maps each type constructor τ
of S into an ordered set of strings T(τ), e.g.
A <html><body><b> Book :</b>
B by <html><body><b> Book :</b>
C <b>Cost :</b> C Programming Language
D </body></html> by Brian Kernighan and
E,G ε (empty string)
Dennis Ritchie <b>Cost :</b>
F _ (space)
$30.00 </body></html>
H and
x₁ = <t,{<f₁,l₁>,<f₂,l₂>},c>
λ(T₁,x₁) = AtBEf₁Fl₁GHEf₂Fl₂GCcD
where t is ‘title’, f is ‘first name’, l is ‘last name’ and c is ‘cost’
16. Equivalence classes
OV = <1,2,1,0>
OV = <1,2,1,0>
OV = <1,2,1,0>
OV = <1,2,1,0>
17. Assumption
For real pages, an equivalence class of large size and
support is usually derived from the same template
portion
Size: # of tokens in the Equivalence class
Support (of a token); # of pages in which the
token occurs. ( where OV is not 0 )
Tokens generated by data usually occur infrequently
and therefore do not occurr in an EQ that occur very
often (well ...)
19. LFEQs
Equivalence Class : A maximal set of tokens having the
same occurrence-vector
LFEQ: Large and Frequently occurring EQuivalence
classes
Large = Size ≥ SizeThres, F(o) = Support ≥ SupThres
SizeThres & SupThres are user-given parameters
Intuition : Almost always, LFEQs are formed by tokens
associated with the same type constructor in the
template used to create the input pages
22. Schema β
Type < .., .., ..>τ
τ : Type Constructor { T }τ
Differentiating token
roles using parse-tree
Instance path (1)
of Schema
Encoding Find LFEQs based on
τ-association and
λ(TS,x) Input token roles
Pages
Handle Invalid LFEQs
(unknown) either not ordered or
Template not correctly nested
LFEQs &
(constructed) Differentiating token
pages with
Template roles using Equivalence
differentiated
Classes (2)
tokens
23. Template Construction
Recursively consider the EQ ‘e’ between consecutive
tokens i and j from starting and trailing empty strings
Def. e ‘empty’ iff i and j always occur contiguously
if e is empty i and j are single template tokens
if e is not empty there are either basic types (coming
from database) or a more complex unknown type
25. Heuristic Assumption
Summary
A1 - A large number of template-tokens have unique
roles, allowing the formation of EQ and subsequent
differentiation
A2 - A large number of tokens derive from type
constructors. Each τ is instantiated many times in Ps,
so LFEQs can be recognized
A3 - There is no ‘regularity’ in encoded data that
leads to the formation of invalid EQs
A4 - There are separators around data values
26. Experiments
ExAlg has been tested on both input pages sets
that fits the previous assumptions or not.
Input collections has been collected from RISE
repository, from related works sites such as
ROADRUNNER and IEPAD, and from various well-
known sites i.e. eBay, DBLP, Google, and Sigmod
Anthology.
27. Procedure
Manually generate the schemas
No values encoded in tag attributes has been
considered (e.g. urls, images src, etc.)
Aggregated data has been considered a single
token (e.g. dates such as “Nov 8 2008”)
28. Evaluation
For each leaf, the system behave:
Correct , if the extracted value is the same
extracted by the manually defined schema
Partially Correct , if the extracted value match
two or more leaf in the manually defined schema
Incorrect , if the extracted value is neither
correct nor partially correct. I.e. the extracted
value is a portion of the same in manually defined
schema or another misaligned form
30. Results
Holding A3, none of the schema attributes
are classified as incorrect
Holding also A1 and A2, all of the schema
attribute are classified as correct
Partially correct classified elements can be
refined with very small human interaction.
31. Heuristic Assumption
Summary
A1 - A large number of template-tokens have unique
roles, allowing the formation of EQ and subsequent
differentiation
A2 - A large number of tokens derive from type
constructors. Each τ is instantiated many times in Ps,
so LFEQs can be recognized
A3 - There is no ‘regularity’ in encoded data that
leads to the formation of invalid EQs
A4 - There are separators around data values
32. Some Numbers
Acc. = Accuracy of the system for all of the collections
Norm. Acc. = Normalized Accuracy respect to the # of
attributes for each collection.
33. Conclusions
EQ and Token Roles to discover the template
Limited Failures when missing some assumptions
34. Open Problems
Lower bounds for complexity of some modules
algorithms are not given.
Failure for some given configurations
eg
<ol> <li>β</li> <li>β</li> <li>β</li> </ol>
<ol> <li>β</li> <li>β</li> </ol>
... does not belong to the same EQ, so the list
elements (all foreach list) are extracted as
different attributes.
37. Recall on Structured
Data
Conformed to a common Schema or Type
Types:
β (Basic Type: string of tokens)
Token: word or HTML tag ( or bit, or char )
Tuple < T₁,T₂,...,Tn >, where element ei is an instance
of type Ti
Set { T }, where all elements are instance of type T
38. Example
Set of pages describing books: each page contains book
title, set of authors ( first and last name ), cost of the
book
Schema of the encoded data:
S₁ = < β , { <β,β>t₃ }t₂ , β >t₁
t₁ and t₃ are tuple constructors, t₂ is a set constructor
39. Schemas and Trees
< > t₁
β β
{ }t₂ S₁ = < β , { <β,β>t₃ }t₂ , β >t₁
Sub-Trees are also schemas,
< > t₃ called sub-schema of the
original schema
β β
40. Recall: Template
Function that maps each type constructor τ
of S into an ordered set of strings T(τ), e.g.
A <html><body><b> Book :</b>
T₁(τ₁) = <A,B,C,D>
B by
C <b>Cost :</b> T₁(τ₂) = H
D </body></html> T₁(τ₃) = <E,F,G>
E,G ε (empty string)
F _ (space)
H and {A..H} are strings
x₁ = <t,{<f₁,l₁>,<f₂,l₂>},c>
λ(T₁,x₁) = AtBEf₁Fl₁GHEf₂Fl₂GCcD
where t is ‘title’, f is ‘first name’, l is ‘last name’ and c is ‘cost’
<html><body><b> Book :</b> C Programming Language by Brian Kernighan and Dennis Ritchie
<b>Cost :</b>$30.00 </body></html>
41. Encoding
The encoding λ(T,x) of an instance x of a schema S,
given a template TS is defined recursively in terms of
encoding of sub-values of x, e.g.:
if x is a basic type β, λ(T,x) = x
elif x is a tuple < x1, ..., xn >τt,
λ(T,x) = < C1 , λ(T,x1), ..., Cn , λ(T,xn) >
elif x is a set { e1, ..., em }τs,
λ(T,x) = < ς λ(T,e1) ς ... ς λ(T,em) >
T(τt) = < C1 , ..., Cn+1>, T(τs) = ς
43. Simpler (infix) notation
A <html><body><b> Book :</b> Template:
B by
< A * B { < E * F * G > }H C * D >
C <b>Cost :</b>
D </body></html>
E,G ε (empty string) Where
F _ (space) * is a wildcard for β (Basic Type)
H and { § }H is ‘ § (H §)+ ’
Assumption: the same values in a tuple occur in the same relative position
with respect to the values of other attributes in all the pages. (e.g. the
book name always occurs before the list of the authors and the price)
44. Optionals and Disjunctions
( T )? = { T }τ such as | { T }τ | is 0 or 1
( T1 | T2 ) = T1 OR T2 = < { T1 }τ1 , { T2 }τ2 >τ
46. EXTRACT Problem
“ Given a set of n pages pi = λ(T,xi) (1 ≤ i ≤ n), created
from some unknown template T and values {x1,...,xn}
deduce the template T and the values alone. “
It is an instance of regular grammar inference problem
from positive examples. N.B. Inference “in the
limit” ( not Σ* ) with positive example is undecidable.
However, humans rarely have any ambiguity in picking the
right template from real template-generated web pages,
and the goal is to solve EXTRACT problem for real web
pages, i.e. produce the template and values that would be
considered correct by a human.
47. Tokens
A template-token is an occurrence of a token in a
template. Each page token is is created from either a
template-token or a page-token.
A token itself is different from its occurences
Two page-tokens are said to have the same role iff
they are generated from the same template-token.
49. ExAlg Stages
Stage 1 (ECGM: Equivalence Class Generation Module):
discover tokens associated with the same type
constructor finding equivalence classes
Occurrence Vector : The occurrence-vector of a token t, is defined as
the vector < f1, ..., fn >, where fi is the number of occurrences of t in pi.
Equivalence Class : An equivalence class is a maximal set of tokens
having the same occurrence-vector
Stage 2 (Analysis Module): using the above set to
deduce the template
52. LFEQs
Large and Frequently occurring EQuivalence classes
Large: high number of tokens ( gt SizeThres )
Frequently occurring: whose tokens occur in a large
number of input pages ( gt SupThres )
Intuition : Almost always, LFEQs are formed by tokens
associated with the same type constructor in the
(unknown) template used to create the input pages
53. Towards FindEq
Token associated with the same type constructor
tend to occur in the same equivalence class.
A token has a unique role if all the occurrences of
the token in the pages, is generated by a single
template-token.
Tokens associated with the same type constructor τj
in T that have unique-roles occurs in the same
equivalence class.
54. Valid EQs
If all tokens of an equivalence class ε have unique
roles and are associated with the same type
constructor τj of S, ExAlg assume that ε is
derived from τj
Only in this case the equivalence class is valid
( You don’t know the type constructors ex ante,
ExAlg has to find them )
55. Assumption
For real pages, an equivalence class of large size and
support is usually valid (*)
Size: # of tokens in the Equivalence class
Support (of a token); # of pages in which the
token occurs.
E.g. for token with o.v. <0,1,1,1,0,0,2> support is 4
Tokens generated by value-tokens usually occur
infrequently and therefore do not occur in a LFEQs
56. Deducing template
Since typically LFEQs consist only of tokens associated
with the same type constructor in the (unknown) input
template, ExAlg uses LFEQs to deduce the template
and the schema
Two main obstacles
Base assumption(*) is heuristic ( no guarantee )
Even if an LFEQ is valid, its tokens have unique
roles, and therefore only contains partially
information about the template used to generate
the pages
57. EQs Properties
An equivalence class is ordered
The span of each occurrence of ε is sub-
divided into (m-1) positions (Pos).
Posk is page-token that may occur between tk
and tk+1
58. EQs Properties
A pair of equivalence classes ε1 and ε2 is
nested if the span of all occurrences of ε2 is
within the Pos(p) of some occurrence of ε1
A set of equivalence classes {ε1,...,εn} is
nested if every pair of equivalence classes
of the set is nested.
60. HandInv
Handling invalid LFEQs, if any:
Typically, invalid LFEQs are either not ordered or
not nested with respect to other LFEQs.
HandInv detects the existence of invalid LFEQs
using violations of ordered and nesting properites
It discards some of the LFEQs completely and
breaks others into smaller LFEQs
Outuput: ordered set of nested (with high
probability valid) LFEQs
62. Differentiating roles
Typically an LFEQ only contains tokens with unique roles
template-tokens can’t be discovered just using LFEQs
“Differentiating roles of tokens” is used in EXALG to
discover a greater number of template-tokens
Different token roles are driven by different contexts
such that the occurrences of a token in different
contexts above necessarily have different roles
64. Differentiating roles:
First Technique
This technique uses the html formatting of input pages
An occurrence-path of a page-token is the path
from the root to the page token, in the parse-tree
In practice, two page-tokens with different
occurrence path have different roles
Equivalently, all page-token generated by a template
token have the same occurrence-path
65. Differentiating roles:
Second Technique
Use valid equivalence classes ε
The role of an occurrence of a token t outside the
span of any occurence of ε is different from the role
of an occurence within the span of some occurrence
of ε
Further, the role of an occurrence of t within the
Pos(l) of some occurrence of ε, is different from the
role of an occurrence of t within the Pos(m) of some
occurrence of ε, with l≠m
66. DTokens
Different roles mean different contexts
Def. dtoken: refer to both a token and a
context, identified by differentiation
A dtoken is a token in a given context
EXALG aims to work with equivalence classes
defined using dtokens
68. Schema β
Type < .., .., ..>τ
τ : Type Constructor { T }τ
Differentiating token
roles using parse-tree
Instance path (1)
of Schema
Encoding Find LFEQs based on
τ-association and
λ(TS,x) Input token roles
Pages
Handle Invalid LFEQs
(unknown) either not ordered or
Template not correctly nested
(constructed) Differentiating token
Template dtokens roles using Equivalence
Classes (2)
69. ECGM Summary
Input: a set of pages P
Output: a set of LFEQs (made of dtonkens)
and pages P represented as string of dtokens
NB: dtokens are marked
70. Modules
DiffForm sub-module differentiates roles of tokens
in P (first technique)
Sub-modules FindEq, HandInv and DiffEq iterate in
a loop, until no new dtokens are formed
FindEq need two parameters: SizeThres and
SupThres. Equivalence classes with size and
support greater than SizeThres and SupThres
are considered LFEQs.
73. PosString(ε,p)
For any EQs ε, Def. Pos(p) ‘empty’ if dtokens dp
and dp+1 always occur contiguously.
An equivalence class is defined to be ‘empty’ if all
of its Pos() are empty.
For an occurrence of EQ ε and non-empty POS(p)
of ε, PosString(ε,p) is the concatenation of tokens
and EQs ε’ in Pos(P) but not EQs in the span of ε’.
74. Construct Template
For every non-empty EQ εi, ConstTemp recursively
constructs a template, Tεi, and the template Tεi,p,
corresponding to each non-empty Pos p of εi.
Its output is the template for the root-EQ.
The root-EQ occurrence vector is always <1,1,⋄⋄⋄>,
at least prepending and appending greater than
SizeThres number of dummy tokens to beginning
and end of each page.
75. Template Tε
Tεi = < Tεis1, ..., Tεisn ><Ci1,Ci2,...,Cin+1>
To construct the template Tεi,p ConstTemp checks if the set of
strings PosString(εi,p) has some recognizable pattern
Pattern used by ConstTemp prototype:
Pattern Tεi,p
1 εj εj ... { T εj }∈
2 εj S εj S ... εj { T εj }S
3 εj OR εk Tεj | Tεk
4 ∈ OR εj (Tεj)?
string of dtokens and
5 Tβ
empty EQs
6 -unknown- Tβ
78. END
Overview of ExAlg
Extracting Structured Data from Web Pages
Original work by
Arvind Arasu & Hector Garçia-Molina
{arvinda,hector}@cs.stanford.edu
Presentation by
Alessandro Manfredi & Marco Bontempi & Marco Giannone
{a.n0on3,bontempi,marco.giannone}@gmail.com