ExAlg Overview

Overview of

ExAlg
Extracting Structured Data from Web Pages

Original work by
Arvind Arasu & Hector Garçia-Molina
{arvinda,hector}@cs.stanford.edu
Presentation by
Alessandro Manfredi & Marco Bontempi & Marco Giannone
{a.n0on3,bontempi,marco.giannone}@gmail.com

Introduction

WWW: billions of unstructured HTML pages

Many dynamically generated from underlying

structured source (e.g. relational database)

Information is hard to query

The goal
Full automatic extraction of structured data

... in order to pose complex queries over data

Semantic meaning for extracted data is not an aim

Avoid human input ( error prone and time consuming,
especially for frequent template changes )

Challenges

Distinguish template from data

Syntactically, there is nothing that tells
text part of the template apart from data

Deduce the schema for encoded information:

Complex schemas, arbitrary levels nesting

Data
Structure conformed to a common Schema or Type

Types:

β (Basic Type: string of tokens)

Token: word or HTML tag ( or bit, or char )

Tuple < T₁,T₂,...,Tn >, where element ei is an instance
of type Ti

Set { T }, where all elements are instance of type T

Template
Function that maps each type constructor τ
of S into an ordered set of strings T(τ), e.g.
A <html><body> Book :
B by <html><body> Book :
C Cost : C Programming Language
D </body></html> by Brian Kernighan and
E,G ε (empty string)
Dennis Ritchie Cost :
F _ (space)
$30.00 </body></html>
H and

x₁ = <t,{<f₁,l₁>,<f₂,l₂>},c>
λ(T₁,x₁) = AtBEf₁Fl₁GHEf₂Fl₂GCcD
where t is ‘title’, f is ‘ﬁrst name’, l is ‘last name’ and c is ‘cost’

Token Occurrence Vectors

OV = <1,1,1,1>

Token Occurrence Vectors

OV = <1,2,1,0>

Equivalence classes

OV = <1,2,1,0>
OV = <1,2,1,0>
OV = <1,2,1,0>
OV = <1,2,1,0>

Assumption
For real pages, an equivalence class of large size and
support is usually derived from the same template
portion

Size: # of tokens in the Equivalence class

Support (of a token); # of pages in which the
token occurs. ( where OV is not 0 )

Tokens generated by data usually occur infrequently
and therefore do not occurr in an EQ that occur very
often (well ...)

EQs abstraction

 Reviewer Name John 

 Reviewer Name β

LFEQs
Equivalence Class : A maximal set of tokens having the
same occurrence-vector

LFEQ: Large and Frequently occurring EQuivalence
classes

Large = Size ≥ SizeThres, F(o) = Support ≥ SupThres

SizeThres & SupThres are user-given parameters

Intuition : Almost always, LFEQs are formed by tokens
associated with the same type constructor in the
template used to create the input pages

ExAlg: Modules
Stage 1 Stage 2

DiffFormat
( Differentiate Roles using format )
ConstTemp
( Construct Template )
Input
FindEquiv
( Find Equivalence Classes )

HandleInv
( Handle Invalid Equivalence
Classes ) Template,
ExVal
( Extract Values ) Schema,
DiffEq Values
( Differentiate Roles Using Eq
Classes )

Schema β

Type < .., .., ..>τ

τ : Type Constructor { T }τ
Differentiating token
roles using parse-tree
Instance path (1)
of Schema

Encoding Find LFEQs based on
τ-association and
λ(TS,x) Input token roles
Pages
Handle Invalid LFEQs
(unknown) either not ordered or
Template not correctly nested

LFEQs &
(constructed) Differentiating token
pages with
Template roles using Equivalence
differentiated
Classes (2)
tokens

Template Construction

Recursively consider the EQ ‘e’ between consecutive
tokens i and j from starting and trailing empty strings

Def. e ‘empty’ iff i and j always occur contiguously

if e is empty i and j are single template tokens

if e is not empty there are either basic types (coming
from database) or a more complex unknown type

Data Extraction

Deduced the template, data extraction is
trivial and is therefore not covered

Heuristic Assumption
Summary
A1 - A large number of template-tokens have unique
roles, allowing the formation of EQ and subsequent
differentiation

A2 - A large number of tokens derive from type
constructors. Each τ is instantiated many times in Ps,
so LFEQs can be recognized

A3 - There is no ‘regularity’ in encoded data that
leads to the formation of invalid EQs

A4 - There are separators around data values

Experiments

ExAlg has been tested on both input pages sets
that ﬁts the previous assumptions or not.

Input collections has been collected from RISE
repository, from related works sites such as
ROADRUNNER and IEPAD, and from various well-
known sites i.e. eBay, DBLP, Google, and Sigmod
Anthology.

Procedure

Manually generate the schemas

No values encoded in tag attributes has been
considered (e.g. urls, images src, etc.)

Aggregated data has been considered a single
token (e.g. dates such as “Nov 8 2008”)

Evaluation
For each leaf, the system behave:

Correct , if the extracted value is the same
extracted by the manually defined schema

Partially Correct , if the extracted value match
two or more leaf in the manually defined schema

Incorrect , if the extracted value is neither
correct nor partially correct. I.e. the extracted
value is a portion of the same in manually defined
schema or another misaligned form

Evaluation
Correct

Correct

Partially Correct

Results

Holding A3, none of the schema attributes
are classified as incorrect

Holding also A1 and A2, all of the schema
attribute are classified as correct

Partially correct classified elements can be
refined with very small human interaction.

Some Numbers

Acc. = Accuracy of the system for all of the collections

Norm. Acc. = Normalized Accuracy respect to the # of
attributes for each collection.

Conclusions

EQ and Token Roles to discover the template

Limited Failures when missing some assumptions

Open Problems
Lower bounds for complexity of some modules
algorithms are not given.

Failure for some given conﬁgurations

eg

<ol> <li>β</li> <li>β</li> <li>β</li> </ol>

<ol> <li>β</li> <li>β</li> </ol>

... does not belong to the same EQ, so the list
elements (all foreach list) are extracted as
different attributes.

Deﬁnitions
It’s a long way to the top
( if you want rock ‘n roll )

Recall on Structured
Data
Conformed to a common Schema or Type

Types:

β (Basic Type: string of tokens)

Token: word or HTML tag ( or bit, or char )

Tuple < T₁,T₂,...,Tn >, where element ei is an instance
of type Ti

Set { T }, where all elements are instance of type T

Example
Set of pages describing books: each page contains book
title, set of authors ( ﬁrst and last name ), cost of the
book

Schema of the encoded data:

S₁ = < β , { <β,β>t₃ }t₂ , β >t₁

t₁ and t₃ are tuple constructors, t₂ is a set constructor

Schemas and Trees
< > t₁

β β
{ }t₂ S₁ = < β , { <β,β>t₃ }t₂ , β >t₁

Sub-Trees are also schemas,
< > t₃ called sub-schema of the
original schema

β β

Recall: Template
Function that maps each type constructor τ
of S into an ordered set of strings T(τ), e.g.
A <html><body> Book :
T₁(τ₁) = <A,B,C,D>
B by
C Cost : T₁(τ₂) = H
D </body></html> T₁(τ₃) = <E,F,G>
E,G ε (empty string)
F _ (space)
H and {A..H} are strings
x₁ = <t,{<f₁,l₁>,<f₂,l₂>},c>
λ(T₁,x₁) = AtBEf₁Fl₁GHEf₂Fl₂GCcD
where t is ‘title’, f is ‘ﬁrst name’, l is ‘last name’ and c is ‘cost’
<html><body> Book : C Programming Language by Brian Kernighan and Dennis Ritchie
Cost :$30.00 </body></html>

Encoding
The encoding λ(T,x) of an instance x of a schema S,
given a template TS is deﬁned recursively in terms of
encoding of sub-values of x, e.g.:

if x is a basic type β, λ(T,x) = x

elif x is a tuple < x1, ..., xn >τt,

λ(T,x) = < C1 , λ(T,x1), ..., Cn , λ(T,xn) >

elif x is a set { e1, ..., em }τs,

λ(T,x) = < ς λ(T,e1) ς ... ς λ(T,em) >

T(τt) = < C1 , ..., Cn+1>, T(τs) = ς

Simpler (inﬁx) notation
A <html><body> Book : Template:
B by
< A * B { < E * F * G > }H C * D >
C Cost :
D </body></html>
E,G ε (empty string) Where
F _ (space) * is a wildcard for β (Basic Type)
H and { § }H is ‘ § (H §)+ ’

Assumption: the same values in a tuple occur in the same relative position
with respect to the values of other attributes in all the pages. (e.g. the
book name always occurs before the list of the authors and the price)

Optionals and Disjunctions

( T )? = { T }τ such as | { T }τ | is 0 or 1

( T1 | T2 ) = T1 OR T2 = < { T1 }τ1 , { T2 }τ2 >τ

Problem Model &
Formulation

EXTRACT Problem
“ Given a set of n pages pi = λ(T,xi) (1 ≤ i ≤ n), created
from some unknown template T and values {x1,...,xn}
deduce the template T and the values alone. “

It is an instance of regular grammar inference problem
from positive examples. N.B. Inference “in the
limit” ( not Σ* ) with positive example is undecidable.

However, humans rarely have any ambiguity in picking the
right template from real template-generated web pages,
and the goal is to solve EXTRACT problem for real web
pages, i.e. produce the template and values that would be
considered correct by a human.

Tokens
A template-token is an occurrence of a token in a
template. Each page token is is created from either a
template-token or a page-token.

A token itself is different from its occurences

Two page-tokens are said to have the same role iff
they are generated from the same template-token.

ExAlg Stages
Stage 1 (ECGM: Equivalence Class Generation Module):
discover tokens associated with the same type
constructor ﬁnding equivalence classes
Occurrence Vector : The occurrence-vector of a token t, is deﬁned as
the vector < f1, ..., fn >, where fi is the number of occurrences of t in pi.

Equivalence Class : An equivalence class is a maximal set of tokens
having the same occurrence-vector

Stage 2 (Analysis Module): using the above set to
deduce the template

LFEQs
Large and Frequently occurring EQuivalence classes

Large: high number of tokens ( gt SizeThres )

Frequently occurring: whose tokens occur in a large
number of input pages ( gt SupThres )

Intuition : Almost always, LFEQs are formed by tokens
associated with the same type constructor in the
(unknown) template used to create the input pages

Towards FindEq
Token associated with the same type constructor
tend to occur in the same equivalence class.

A token has a unique role if all the occurrences of
the token in the pages, is generated by a single
template-token.

Tokens associated with the same type constructor τj
in T that have unique-roles occurs in the same
equivalence class.

Valid EQs
If all tokens of an equivalence class ε have unique
roles and are associated with the same type
constructor τj of S, ExAlg assume that ε is
derived from τj

Only in this case the equivalence class is valid

( You don’t know the type constructors ex ante,
ExAlg has to ﬁnd them )

Assumption
For real pages, an equivalence class of large size and
support is usually valid (*)

Size: # of tokens in the Equivalence class

Support (of a token); # of pages in which the
token occurs.
E.g. for token with o.v. <0,1,1,1,0,0,2> support is 4

Tokens generated by value-tokens usually occur
infrequently and therefore do not occur in a LFEQs

Deducing template
Since typically LFEQs consist only of tokens associated
with the same type constructor in the (unknown) input
template, ExAlg uses LFEQs to deduce the template
and the schema

Two main obstacles

Base assumption(*) is heuristic ( no guarantee )

Even if an LFEQ is valid, its tokens have unique
roles, and therefore only contains partially
information about the template used to generate
the pages

EQs Properties

An equivalence class is ordered

The span of each occurrence of ε is sub-
divided into (m-1) positions (Pos).

Posk is page-token that may occur between tk
and tk+1

EQs Properties

A pair of equivalence classes ε1 and ε2 is
nested if the span of all occurrences of ε2 is
within the Pos(p) of some occurrence of ε1

A set of equivalence classes {ε1,...,εn} is
nested if every pair of equivalence classes
of the set is nested.

Invalid EQs
Stage 1 Stage 2

DiffFormat
ConstTemp
Input
FindEquiv

HandleInv
Classes ) Template,
ExVal
DiffEq Values
Classes )

HandInv
Handling invalid LFEQs, if any:

Typically, invalid LFEQs are either not ordered or
not nested with respect to other LFEQs.

HandInv detects the existence of invalid LFEQs
using violations of ordered and nesting properites

It discards some of the LFEQs completely and
breaks others into smaller LFEQs

Outuput: ordered set of nested (with high
probability valid) LFEQs

Differentiating roles

Typically an LFEQ only contains tokens with unique roles

template-tokens can’t be discovered just using LFEQs

“Differentiating roles of tokens” is used in EXALG to
discover a greater number of template-tokens

Different token roles are driven by different contexts
such that the occurrences of a token in different
contexts above necessarily have different roles

Differentiating roles:
First Technique
This technique uses the html formatting of input pages

An occurrence-path of a page-token is the path
from the root to the page token, in the parse-tree

In practice, two page-tokens with different
occurrence path have different roles

Equivalently, all page-token generated by a template
token have the same occurrence-path

Differentiating roles:
Second Technique
Use valid equivalence classes ε

The role of an occurrence of a token t outside the
span of any occurence of ε is different from the role
of an occurence within the span of some occurrence
of ε

Further, the role of an occurrence of t within the
Pos(l) of some occurrence of ε, is different from the
role of an occurrence of t within the Pos(m) of some
occurrence of ε, with l≠m

DTokens

Different roles mean different contexts

Def. dtoken: refer to both a token and a
context, identiﬁed by differentiation

A dtoken is a token in a given context

EXALG aims to work with equivalence classes
deﬁned using dtokens

Schema β

Type < .., .., ..>τ

τ : Type Constructor { T }τ
Differentiating token
roles using parse-tree
Instance path (1)
of Schema

Encoding Find LFEQs based on
τ-association and
λ(TS,x) Input token roles
Pages
Handle Invalid LFEQs
(unknown) either not ordered or
Template not correctly nested

(constructed) Differentiating token
Template dtokens roles using Equivalence
Classes (2)

ECGM Summary

Input: a set of pages P

Output: a set of LFEQs (made of dtonkens)
and pages P represented as string of dtokens

NB: dtokens are marked

Modules
DiffForm sub-module differentiates roles of tokens
in P (ﬁrst technique)

Sub-modules FindEq, HandInv and DiffEq iterate in
a loop, until no new dtokens are formed

FindEq need two parameters: SizeThres and
SupThres. Equivalence classes with size and
support greater than SizeThres and SupThres
are considered LFEQs.

Modules
Stage 1 Stage 2

DiffFormat
ConstTemp
Input
FindEquiv

HandleInv
Classes ) Template,
ExVal
DiffEq Values
Classes )

PosString(ε,p)

For any EQs ε, Def. Pos(p) ‘empty’ if dtokens dp
and dp+1 always occur contiguously.

An equivalence class is deﬁned to be ‘empty’ if all
of its Pos() are empty.

For an occurrence of EQ ε and non-empty POS(p)
of ε, PosString(ε,p) is the concatenation of tokens
and EQs ε’ in Pos(P) but not EQs in the span of ε’.

Construct Template
For every non-empty EQ εi, ConstTemp recursively
constructs a template, Tεi, and the template Tεi,p,
corresponding to each non-empty Pos p of εi.

Its output is the template for the root-EQ.

The root-EQ occurrence vector is always <1,1,⋄⋄⋄>,
at least prepending and appending greater than
SizeThres number of dummy tokens to beginning
and end of each page.

Template Tε
Tεi = < Tεis1, ..., Tεisn ><Ci1,Ci2,...,Cin+1>
To construct the template Tεi,p ConstTemp checks if the set of
strings PosString(εi,p) has some recognizable pattern
Pattern used by ConstTemp prototype:
Pattern Tεi,p
1 εj εj ... { T εj }∈
2 εj S εj S ... εj { T εj }S
3 εj OR εk Tεj | Tεk
4 ∈ OR εj (Tεj)?
string of dtokens and
5 Tβ
empty EQs
6 -unknown- Tβ

Original work

http://infolab.stanford.edu/~arvind/papers/
extract-sigmod03.pdf

END

Overview of ExAlg
Extracting Structured Data from Web Pages
Original work by
Arvind Arasu & Hector Garçia-Molina
{arvinda,hector}@cs.stanford.edu
Presentation by
Alessandro Manfredi & Marco Bontempi & Marco Giannone
{a.n0on3,bontempi,marco.giannone}@gmail.com

ExAlg Overview

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a ExAlg Overview

Semelhante a ExAlg Overview (20)

Mais de Alessandro Manfredi

Mais de Alessandro Manfredi (9)

Último

Último (20)

ExAlg Overview

Notas do Editor