SlideShare uma empresa Scribd logo
1 de 78
Overview of

                     ExAlg
Extracting Structured Data from Web Pages


                       Original work by
             Arvind Arasu & Hector Garçia-Molina
               {arvinda,hector}@cs.stanford.edu
                        Presentation by
   Alessandro Manfredi & Marco Bontempi & Marco Giannone
         {a.n0on3,bontempi,marco.giannone}@gmail.com
Introduction

WWW: billions of unstructured HTML pages


Many dynamically generated from underlying

structured source (e.g. relational database)


Information is hard to query
Real pages
Pages Structure
The goal
Full automatic extraction of structured data

... in order to pose complex queries over data



Semantic meaning for extracted data is not an aim

Avoid human input ( error prone and time consuming,
especially for frequent template changes )
Challenges

Distinguish template from data

  Syntactically, there is nothing that tells
  text part of the template apart from data



Deduce the schema for encoded information:

  Complex schemas, arbitrary levels nesting
Page generation
Data
Structure conformed to a common Schema or Type

Types:

  β (Basic Type: string of tokens)

     Token: word or HTML tag ( or bit, or char )

  Tuple < T₁,T₂,...,Tn >, where element ei is an instance
  of type Ti

  Set { T }, where all elements are instance of type T
Template
      Function that maps each type constructor τ
      of S into an ordered set of strings T(τ), e.g.
A      <html><body><b> Book :</b>
B                  by                     <html><body><b> Book :</b>
C            <b>Cost :</b>                 C Programming Language
D           </body></html>                  by Brian Kernighan and
E,G         ε (empty string)
                                          Dennis Ritchie <b>Cost :</b>
F               _ (space)
                                            $30.00 </body></html>
H                  and


                   x₁ = <t,{<f₁,l₁>,<f₂,l₂>},c>
       λ(T₁,x₁) = AtBEf₁Fl₁GHEf₂Fl₂GCcD
        where t is ‘title’, f is ‘first name’, l is ‘last name’ and c is ‘cost’
Values to extract
ExAlg aim to ...
Template extracted
Roles
  Different Roles!
Token Occurrence Vectors


          OV = <1,1,1,1>
Token Occurrence Vectors



       OV = <1,2,1,0>
Equivalence classes


    OV = <1,2,1,0>
       OV = <1,2,1,0>
           OV = <1,2,1,0>
              OV = <1,2,1,0>
Assumption
For real pages, an equivalence class of large size and
support is usually derived from the same template
portion

  Size: # of tokens in the Equivalence class

  Support (of a token); # of pages in which the
  token occurs. ( where OV is not 0 )

Tokens generated by data usually occur infrequently
and therefore do not occurr in an EQ that occur very
often (well ...)
EQs abstraction


<b> Reviewer Name </b> <em> John </em>




<b> Reviewer Name </b> <em>   β   </em>
LFEQs
Equivalence Class : A maximal set of tokens having the
same occurrence-vector

LFEQ: Large and Frequently occurring EQuivalence
classes

Large = Size ≥ SizeThres, F(o) = Support ≥ SupThres

SizeThres & SupThres are user-given parameters

Intuition : Almost always, LFEQs are formed by tokens
associated with the same type constructor in the
template used to create the input pages
How ExAlg Works
ExAlg: Modules
                   Stage 1                       Stage 2

                  DiffFormat
        ( Differentiate Roles using format )
                                                  ConstTemp
                                               ( Construct Template )
Input
                   FindEquiv
            ( Find Equivalence Classes )



                  HandleInv
           ( Handle Invalid Equivalence
                    Classes )                                           Template,
                                                      ExVal
                                                 ( Extract Values )     Schema,
                     DiffEq                                              Values
          ( Differentiate Roles Using Eq
                     Classes )
Schema                              β

      Type                          < .., .., ..>τ

      τ : Type Constructor                 { T }τ
                                                      Differentiating token
                                                     roles using parse-tree
                Instance                                     path (1)
               of Schema

Encoding                                             Find LFEQs based on
                                                       τ-association and
             λ(TS,x)            Input                     token roles
                                Pages
                                                     Handle Invalid LFEQs
            (unknown)                                either not ordered or
             Template                                not correctly nested

                                LFEQs &
           (constructed)                              Differentiating token
                               pages with
            Template                                 roles using Equivalence
                             differentiated
                                                           Classes (2)
                                 tokens
Template Construction

Recursively consider the EQ ‘e’ between consecutive
tokens i and j from starting and trailing empty strings

Def. e ‘empty’ iff i and j always occur contiguously

if e is empty i and j are single template tokens

if e is not empty there are either basic types (coming
from database) or a more complex unknown type
Data Extraction



Deduced the template, data extraction is
trivial and is therefore not covered
Heuristic Assumption
          Summary
A1 - A large number of template-tokens have unique
roles, allowing the formation of EQ and subsequent
differentiation

A2 - A large number of tokens derive from type
constructors. Each τ is instantiated many times in Ps,
so LFEQs can be recognized

A3 - There is no ‘regularity’ in encoded data that
leads to the formation of invalid EQs

A4 - There are separators around data values
Experiments

ExAlg has been tested on both input pages sets
that fits the previous assumptions or not.

Input collections has been collected from RISE
repository, from related works sites such as
ROADRUNNER and IEPAD, and from various well-
known sites i.e. eBay, DBLP, Google, and Sigmod
Anthology.
Procedure

Manually generate the schemas

  No values encoded in tag attributes has been
  considered (e.g. urls, images src, etc.)

  Aggregated data has been considered a single
  token (e.g. dates such as “Nov 8 2008”)
Evaluation
For each leaf, the system behave:

  Correct , if the extracted value is the same
  extracted by the manually defined schema

  Partially Correct , if the extracted value match
  two or more leaf in the manually defined schema

  Incorrect , if the extracted value is neither
  correct nor partially correct. I.e. the extracted
  value is a portion of the same in manually defined
  schema or another misaligned form
Evaluation
        Correct




               Correct




       Partially Correct
Results

Holding A3, none of the schema attributes
are classified as incorrect

Holding also A1 and A2, all of the schema
attribute are classified as correct

Partially correct classified elements can be
refined with very small human interaction.
Heuristic Assumption
          Summary
A1 - A large number of template-tokens have unique
roles, allowing the formation of EQ and subsequent
differentiation

A2 - A large number of tokens derive from type
constructors. Each τ is instantiated many times in Ps,
so LFEQs can be recognized

A3 - There is no ‘regularity’ in encoded data that
leads to the formation of invalid EQs

A4 - There are separators around data values
Some Numbers




Acc. = Accuracy of the system for all of the collections

Norm. Acc. = Normalized Accuracy respect to the # of
attributes for each collection.
Conclusions


EQ and Token Roles to discover the template


Limited Failures when missing some assumptions
Open Problems
Lower bounds for complexity of some modules
algorithms are not given.

Failure for some given configurations

  eg

       <ol> <li>β</li> <li>β</li> <li>β</li> </ol>

       <ol> <li>β</li> <li>β</li> </ol>

... does not belong to the same EQ, so the list
elements (all foreach list) are extracted as
different attributes.
More Details
Definitions
It’s a long way to the top
( if you want rock ‘n roll )
Recall on Structured
            Data
Conformed to a common Schema or Type

Types:

  β (Basic Type: string of tokens)

     Token: word or HTML tag ( or bit, or char )

  Tuple < T₁,T₂,...,Tn >, where element ei is an instance
  of type Ti

  Set { T }, where all elements are instance of type T
Example
Set of pages describing books: each page contains book
title, set of authors ( first and last name ), cost of the
book

Schema of the encoded data:

              S₁ = < β , { <β,β>t₃ }t₂ , β >t₁



t₁ and t₃ are tuple constructors, t₂ is a set constructor
Schemas and Trees
    < > t₁


β            β
    { }t₂        S₁ = < β , { <β,β>t₃ }t₂ , β >t₁

                 Sub-Trees are also schemas,
    < > t₃       called sub-schema of the
                 original schema

β            β
Recall: Template
          Function that maps each type constructor τ
          of S into an ordered set of strings T(τ), e.g.
     A       <html><body><b> Book :</b>
                                                       T₁(τ₁) = <A,B,C,D>
     B                    by
     C              <b>Cost :</b>                          T₁(τ₂) = H
     D             </body></html>                       T₁(τ₃) = <E,F,G>
    E,G            ε (empty string)
     F                 _ (space)
     H                   and                           {A..H} are strings
                       x₁ = <t,{<f₁,l₁>,<f₂,l₂>},c>
           λ(T₁,x₁) = AtBEf₁Fl₁GHEf₂Fl₂GCcD
            where t is ‘title’, f is ‘first name’, l is ‘last name’ and c is ‘cost’
<html><body><b> Book :</b> C Programming Language by Brian Kernighan and Dennis Ritchie
                           <b>Cost :</b>$30.00 </body></html>
Encoding
The encoding λ(T,x) of an instance x of a schema S,
given a template TS is defined recursively in terms of
encoding of sub-values of x, e.g.:

  if x is a basic type β, λ(T,x) = x

     elif x is a tuple < x1, ..., xn >τt,

        λ(T,x) = < C1 , λ(T,x1), ..., Cn , λ(T,xn) >

     elif x is a set { e1, ..., em }τs,

        λ(T,x) = < ς λ(T,e1) ς ... ς λ(T,em) >

  T(τt) = < C1 , ..., Cn+1>, T(τs) = ς
Page generation
Simpler (infix) notation
   A     <html><body><b> Book :</b>   Template:
   B                by
                                      < A * B { < E * F * G > }H C * D >
   C           <b>Cost :</b>
   D          </body></html>
  E,G         ε (empty string)        Where
   F             _ (space)            * is a wildcard for β (Basic Type)
   H                and               { § }H is ‘ § (H §)+ ’

Assumption: the same values in a tuple occur in the same relative position
with respect to the values of other attributes in all the pages. (e.g. the
book name always occurs before the list of the authors and the price)
Optionals and Disjunctions


  ( T )? = { T }τ such as | { T }τ | is 0 or 1



  ( T1 | T2 ) = T1 OR T2 = < { T1 }τ1 , { T2 }τ2 >τ
Problem Model &
   Formulation
EXTRACT Problem
“ Given a set of n pages pi = λ(T,xi) (1 ≤ i ≤ n), created
from some unknown template T and values {x1,...,xn}
deduce the template T and the values alone. “

  It is an instance of regular grammar inference problem
  from positive examples. N.B. Inference “in the
  limit” ( not Σ* ) with positive example is undecidable.

However, humans rarely have any ambiguity in picking the
right template from real template-generated web pages,
and the goal is to solve EXTRACT problem for real web
pages, i.e. produce the template and values that would be
considered correct by a human.
Tokens
A template-token is an occurrence of a token in a
template. Each page token is is created from either a
template-token or a page-token.

A token itself is different from its occurences

Two page-tokens are said to have the same role iff
they are generated from the same template-token.
What the Algorithm
      Does
ExAlg Stages
Stage 1 (ECGM: Equivalence Class Generation Module):
discover tokens associated with the same type
constructor finding equivalence classes
  Occurrence Vector : The occurrence-vector of a token t, is defined as
  the vector < f1, ..., fn >, where fi is the number of occurrences of t in pi.

  Equivalence Class : An equivalence class is a maximal set of tokens
  having the same occurrence-vector

Stage 2 (Analysis Module): using the above set to
deduce the template
ExAlg: Modules
                   Stage 1                       Stage 2

                  DiffFormat
        ( Differentiate Roles using format )
                                                  ConstTemp
                                               ( Construct Template )
Input
                   FindEquiv
            ( Find Equivalence Classes )



                  HandleInv
           ( Handle Invalid Equivalence
                    Classes )                                           Template,
                                                      ExVal
                                                 ( Extract Values )     Schema,
                     DiffEq                                              Values
          ( Differentiate Roles Using Eq
                     Classes )
Stage 1
LFEQs
Large and Frequently occurring EQuivalence classes

  Large: high number of tokens ( gt SizeThres )

  Frequently occurring: whose tokens occur in a large
  number of input pages ( gt SupThres )

Intuition : Almost always, LFEQs are formed by tokens
associated with the same type constructor in the
(unknown) template used to create the input pages
Towards FindEq
Token associated with the same type constructor
tend to occur in the same equivalence class.

A token has a unique role if all the occurrences of
the token in the pages, is generated by a single
template-token.

Tokens associated with the same type constructor τj
in T that have unique-roles occurs in the same
equivalence class.
Valid EQs
If all tokens of an equivalence class ε have unique
roles and are associated with the same type
constructor τj of S, ExAlg assume that ε is
derived from τj

Only in this case the equivalence class is valid



( You don’t know the type constructors ex ante,
ExAlg has to find them )
Assumption
For real pages, an equivalence class of large size and
support is usually valid (*)

  Size: # of tokens in the Equivalence class

  Support (of a token); # of pages in which the
  token occurs.
     E.g. for token with o.v. <0,1,1,1,0,0,2> support is 4

Tokens generated by value-tokens usually occur
infrequently and therefore do not occur in a LFEQs
Deducing template
Since typically LFEQs consist only of tokens associated
with the same type constructor in the (unknown) input
template, ExAlg uses LFEQs to deduce the template
and the schema

Two main obstacles

  Base assumption(*) is heuristic ( no guarantee )

  Even if an LFEQ is valid, its tokens have unique
  roles, and therefore only contains partially
  information about the template used to generate
  the pages
EQs Properties


An equivalence class is ordered

  The span of each occurrence of ε is sub-
  divided into (m-1) positions (Pos).

  Posk is page-token that may occur between tk
  and tk+1
EQs Properties

A pair of equivalence classes ε1 and ε2 is
nested if the span of all occurrences of ε2 is
within the Pos(p) of some occurrence of ε1

  A set of equivalence classes {ε1,...,εn} is
  nested if every pair of equivalence classes
  of the set is nested.
Invalid EQs
                   Stage 1                       Stage 2

                  DiffFormat
        ( Differentiate Roles using format )
                                                  ConstTemp
                                               ( Construct Template )
Input
                   FindEquiv
            ( Find Equivalence Classes )



                  HandleInv
           ( Handle Invalid Equivalence
                    Classes )                                           Template,
                                                      ExVal
                                                 ( Extract Values )     Schema,
                     DiffEq                                              Values
          ( Differentiate Roles Using Eq
                     Classes )
HandInv
Handling invalid LFEQs, if any:

  Typically, invalid LFEQs are either not ordered or
  not nested with respect to other LFEQs.

  HandInv detects the existence of invalid LFEQs
  using violations of ordered and nesting properites

     It discards some of the LFEQs completely and
     breaks others into smaller LFEQs

  Outuput: ordered set of nested (with high
  probability valid) LFEQs
Core Concept:
 Token Roles
Differentiating roles

Typically an LFEQ only contains tokens with unique roles

  template-tokens can’t be discovered just using LFEQs

“Differentiating roles of tokens” is used in EXALG to
discover a greater number of template-tokens

Different token roles are driven by different contexts
such that the occurrences of a token in different
contexts above necessarily have different roles
Roles
  Different Roles!
Differentiating roles:
       First Technique
This technique uses the html formatting of input pages

  An occurrence-path of a page-token is the path
  from the root to the page token, in the parse-tree

  In practice, two page-tokens with different
  occurrence path have different roles

  Equivalently, all page-token generated by a template
  token have the same occurrence-path
Differentiating roles:
      Second Technique
Use valid equivalence classes ε

The role of an occurrence of a token t outside the
span of any occurence of ε is different from the role
of an occurence within the span of some occurrence
of ε

Further, the role of an occurrence of t within the
Pos(l) of some occurrence of ε, is different from the
role of an occurrence of t within the Pos(m) of some
occurrence of ε, with l≠m
DTokens

Different roles mean different contexts

Def. dtoken: refer to both a token and a
context, identified by differentiation

A dtoken is a token in a given context

EXALG aims to work with equivalence classes
defined using dtokens
Stage 1 Summary
Schema                          β

      Type                      < .., .., ..>τ

      τ : Type Constructor             { T }τ
                                                  Differentiating token
                                                 roles using parse-tree
                Instance                                 path (1)
               of Schema

Encoding                                         Find LFEQs based on
                                                   τ-association and
             λ(TS,x)          Input                   token roles
                              Pages
                                                 Handle Invalid LFEQs
            (unknown)                            either not ordered or
             Template                            not correctly nested


           (constructed)                          Differentiating token
            Template         dtokens             roles using Equivalence
                                                       Classes (2)
ECGM Summary

Input: a set of pages P

Output: a set of LFEQs (made of dtonkens)
and pages P represented as string of dtokens

  NB: dtokens are marked
Modules
DiffForm sub-module differentiates roles of tokens
in P (first technique)

Sub-modules FindEq, HandInv and DiffEq iterate in
a loop, until no new dtokens are formed

  FindEq need two parameters: SizeThres and
  SupThres. Equivalence classes with size and
  support greater than SizeThres and SupThres
  are considered LFEQs.
Modules
                   Stage 1                       Stage 2

                  DiffFormat
        ( Differentiate Roles using format )
                                                  ConstTemp
                                               ( Construct Template )
Input
                   FindEquiv
            ( Find Equivalence Classes )



                  HandleInv
           ( Handle Invalid Equivalence
                    Classes )                                           Template,
                                                      ExVal
                                                 ( Extract Values )     Schema,
                     DiffEq                                              Values
          ( Differentiate Roles Using Eq
                     Classes )
Stage 2
PosString(ε,p)

For any EQs ε, Def. Pos(p) ‘empty’ if dtokens dp
and dp+1 always occur contiguously.

An equivalence class is defined to be ‘empty’ if all
of its Pos() are empty.

For an occurrence of EQ ε and non-empty POS(p)
of ε, PosString(ε,p) is the concatenation of tokens
and EQs ε’ in Pos(P) but not EQs in the span of ε’.
Construct Template
For every non-empty EQ εi, ConstTemp recursively
constructs a template, Tεi, and the template Tεi,p,
corresponding to each non-empty Pos p of εi.

Its output is the template for the root-EQ.

The root-EQ occurrence vector is always <1,1,⋄⋄⋄>,
at least prepending and appending greater than
SizeThres number of dummy tokens to beginning
and end of each page.
Template Tε
        Tεi = < Tεis1, ..., Tεisn ><Ci1,Ci2,...,Cin+1>
To construct the template Tεi,p ConstTemp checks if the set of
    strings PosString(εi,p) has some recognizable pattern
              Pattern used by ConstTemp prototype:
                     Pattern                Tεi,p
          1            εj εj ...          { T εj }∈
          2        εj S εj S ... εj       { T εj }S
          3          εj OR εk             Tεj | Tεk
          4          ∈ OR εj               (Tεj)?
               string of dtokens and
          5                                  Tβ
                     empty EQs
          6         -unknown-                Tβ
Template extracted
Original work



http://infolab.stanford.edu/~arvind/papers/
extract-sigmod03.pdf
END

         Overview of ExAlg
Extracting Structured Data from Web Pages
                     Original work by
           Arvind Arasu & Hector Garçia-Molina
             {arvinda,hector}@cs.stanford.edu
                      Presentation by
 Alessandro Manfredi & Marco Bontempi & Marco Giannone
       {a.n0on3,bontempi,marco.giannone}@gmail.com

Mais conteúdo relacionado

Mais procurados

Theory of computing
Theory of computingTheory of computing
Theory of computing
Ranjan Kumar
 
Context free grammars
Context free grammarsContext free grammars
Context free grammars
Ronak Thakkar
 
Teknik Komunikasi Data Digital
Teknik Komunikasi Data DigitalTeknik Komunikasi Data Digital
Teknik Komunikasi Data Digital
guest995d750
 
Teori bahasa formal dan Otomata
Teori bahasa formal dan OtomataTeori bahasa formal dan Otomata
Teori bahasa formal dan Otomata
Risal Fahmi
 

Mais procurados (20)

Semantic analysis in Compiler Construction
Semantic analysis in Compiler ConstructionSemantic analysis in Compiler Construction
Semantic analysis in Compiler Construction
 
Lecture 5
Lecture 5Lecture 5
Lecture 5
 
Theory of computing
Theory of computingTheory of computing
Theory of computing
 
FInite Automata
FInite AutomataFInite Automata
FInite Automata
 
Lesson 08
Lesson 08Lesson 08
Lesson 08
 
Automata
AutomataAutomata
Automata
 
Language
LanguageLanguage
Language
 
Lesson 09
Lesson 09Lesson 09
Lesson 09
 
Lesson 11
Lesson 11Lesson 11
Lesson 11
 
technik kompilasi
technik kompilasitechnik kompilasi
technik kompilasi
 
Types of Load distributing algorithm in Distributed System
Types of Load distributing algorithm in Distributed SystemTypes of Load distributing algorithm in Distributed System
Types of Load distributing algorithm in Distributed System
 
Deterministic Finite Automata
Deterministic Finite AutomataDeterministic Finite Automata
Deterministic Finite Automata
 
Removing ambiguity-from-cfg
Removing ambiguity-from-cfgRemoving ambiguity-from-cfg
Removing ambiguity-from-cfg
 
van Emde Boas trees
van Emde Boas treesvan Emde Boas trees
van Emde Boas trees
 
Lecture 6
Lecture 6Lecture 6
Lecture 6
 
9.kompresi teks
9.kompresi teks9.kompresi teks
9.kompresi teks
 
Context free grammars
Context free grammarsContext free grammars
Context free grammars
 
Teknik Komunikasi Data Digital
Teknik Komunikasi Data DigitalTeknik Komunikasi Data Digital
Teknik Komunikasi Data Digital
 
Teori bahasa formal dan Otomata
Teori bahasa formal dan OtomataTeori bahasa formal dan Otomata
Teori bahasa formal dan Otomata
 
Lecture 3,4
Lecture 3,4Lecture 3,4
Lecture 3,4
 

Semelhante a ExAlg Overview

Csharp4 strings and_regular_expressions
Csharp4 strings and_regular_expressionsCsharp4 strings and_regular_expressions
Csharp4 strings and_regular_expressions
Abed Bukhari
 
1212 regular meeting
1212 regular meeting1212 regular meeting
1212 regular meeting
marxliouville
 
Introduction to Java Object Oiented Concepts and Basic terminologies
Introduction to Java Object Oiented Concepts and Basic terminologiesIntroduction to Java Object Oiented Concepts and Basic terminologies
Introduction to Java Object Oiented Concepts and Basic terminologies
TabassumMaktum
 

Semelhante a ExAlg Overview (20)

Sedna XML Database System: Internal Representation
Sedna XML Database System: Internal RepresentationSedna XML Database System: Internal Representation
Sedna XML Database System: Internal Representation
 
Generics Past, Present and Future (Latest)
Generics Past, Present and Future (Latest)Generics Past, Present and Future (Latest)
Generics Past, Present and Future (Latest)
 
C# programming
C# programming C# programming
C# programming
 
02python.ppt
02python.ppt02python.ppt
02python.ppt
 
02python.ppt
02python.ppt02python.ppt
02python.ppt
 
Csharp4 strings and_regular_expressions
Csharp4 strings and_regular_expressionsCsharp4 strings and_regular_expressions
Csharp4 strings and_regular_expressions
 
CC Week 11.ppt
CC Week 11.pptCC Week 11.ppt
CC Week 11.ppt
 
I1
I1I1
I1
 
Metaprogramming in Scala 2.10, Eugene Burmako,
Metaprogramming  in Scala 2.10, Eugene Burmako, Metaprogramming  in Scala 2.10, Eugene Burmako,
Metaprogramming in Scala 2.10, Eugene Burmako,
 
Testing for share
Testing for share Testing for share
Testing for share
 
Introduction à Scala - Michel Schinz - January 2010
Introduction à Scala - Michel Schinz - January 2010Introduction à Scala - Michel Schinz - January 2010
Introduction à Scala - Michel Schinz - January 2010
 
Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009
 
A Survey of Concurrency Constructs
A Survey of Concurrency ConstructsA Survey of Concurrency Constructs
A Survey of Concurrency Constructs
 
FivaTech
FivaTechFivaTech
FivaTech
 
1212 regular meeting
1212 regular meeting1212 regular meeting
1212 regular meeting
 
Statistics lab 1
Statistics lab 1Statistics lab 1
Statistics lab 1
 
Introduction to Java Object Oiented Concepts and Basic terminologies
Introduction to Java Object Oiented Concepts and Basic terminologiesIntroduction to Java Object Oiented Concepts and Basic terminologies
Introduction to Java Object Oiented Concepts and Basic terminologies
 
scala-101
scala-101scala-101
scala-101
 
Meet scala
Meet scalaMeet scala
Meet scala
 
Scala Bootcamp 1
Scala Bootcamp 1Scala Bootcamp 1
Scala Bootcamp 1
 

Mais de Alessandro Manfredi (9)

Hey Cloud, it’s the user calling, he says he wants the security back
Hey Cloud, it’s the user calling, he says he wants the security backHey Cloud, it’s the user calling, he says he wants the security back
Hey Cloud, it’s the user calling, he says he wants the security back
 
WhyMCA HappyHour - EUHackathon Part II
WhyMCA HappyHour - EUHackathon Part IIWhyMCA HappyHour - EUHackathon Part II
WhyMCA HappyHour - EUHackathon Part II
 
Connect (4|n)
Connect (4|n)Connect (4|n)
Connect (4|n)
 
LUG - Ricompilazione kernel
LUG - Ricompilazione kernelLUG - Ricompilazione kernel
LUG - Ricompilazione kernel
 
LUG - Logical volumes management
LUG - Logical volumes managementLUG - Logical volumes management
LUG - Logical volumes management
 
LUG - Install Fest 2008
LUG - Install Fest 2008LUG - Install Fest 2008
LUG - Install Fest 2008
 
Find me a roof!
Find me a roof!Find me a roof!
Find me a roof!
 
Advanced Shell Scripting
Advanced Shell ScriptingAdvanced Shell Scripting
Advanced Shell Scripting
 
The "vi" Text Editor
The "vi" Text EditorThe "vi" Text Editor
The "vi" Text Editor
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

ExAlg Overview

  • 1. Overview of ExAlg Extracting Structured Data from Web Pages Original work by Arvind Arasu & Hector Garçia-Molina {arvinda,hector}@cs.stanford.edu Presentation by Alessandro Manfredi & Marco Bontempi & Marco Giannone {a.n0on3,bontempi,marco.giannone}@gmail.com
  • 2. Introduction WWW: billions of unstructured HTML pages Many dynamically generated from underlying structured source (e.g. relational database) Information is hard to query
  • 5. The goal Full automatic extraction of structured data ... in order to pose complex queries over data Semantic meaning for extracted data is not an aim Avoid human input ( error prone and time consuming, especially for frequent template changes )
  • 6. Challenges Distinguish template from data Syntactically, there is nothing that tells text part of the template apart from data Deduce the schema for encoded information: Complex schemas, arbitrary levels nesting
  • 8. Data Structure conformed to a common Schema or Type Types: β (Basic Type: string of tokens) Token: word or HTML tag ( or bit, or char ) Tuple < T₁,T₂,...,Tn >, where element ei is an instance of type Ti Set { T }, where all elements are instance of type T
  • 9. Template Function that maps each type constructor τ of S into an ordered set of strings T(τ), e.g. A <html><body><b> Book :</b> B by <html><body><b> Book :</b> C <b>Cost :</b> C Programming Language D </body></html> by Brian Kernighan and E,G ε (empty string) Dennis Ritchie <b>Cost :</b> F _ (space) $30.00 </body></html> H and x₁ = <t,{<f₁,l₁>,<f₂,l₂>},c> λ(T₁,x₁) = AtBEf₁Fl₁GHEf₂Fl₂GCcD where t is ‘title’, f is ‘first name’, l is ‘last name’ and c is ‘cost’
  • 13. Roles Different Roles!
  • 14. Token Occurrence Vectors OV = <1,1,1,1>
  • 15. Token Occurrence Vectors OV = <1,2,1,0>
  • 16. Equivalence classes OV = <1,2,1,0> OV = <1,2,1,0> OV = <1,2,1,0> OV = <1,2,1,0>
  • 17. Assumption For real pages, an equivalence class of large size and support is usually derived from the same template portion Size: # of tokens in the Equivalence class Support (of a token); # of pages in which the token occurs. ( where OV is not 0 ) Tokens generated by data usually occur infrequently and therefore do not occurr in an EQ that occur very often (well ...)
  • 18. EQs abstraction <b> Reviewer Name </b> <em> John </em> <b> Reviewer Name </b> <em> β </em>
  • 19. LFEQs Equivalence Class : A maximal set of tokens having the same occurrence-vector LFEQ: Large and Frequently occurring EQuivalence classes Large = Size ≥ SizeThres, F(o) = Support ≥ SupThres SizeThres & SupThres are user-given parameters Intuition : Almost always, LFEQs are formed by tokens associated with the same type constructor in the template used to create the input pages
  • 21. ExAlg: Modules Stage 1 Stage 2 DiffFormat ( Differentiate Roles using format ) ConstTemp ( Construct Template ) Input FindEquiv ( Find Equivalence Classes ) HandleInv ( Handle Invalid Equivalence Classes ) Template, ExVal ( Extract Values ) Schema, DiffEq Values ( Differentiate Roles Using Eq Classes )
  • 22. Schema β Type < .., .., ..>τ τ : Type Constructor { T }τ Differentiating token roles using parse-tree Instance path (1) of Schema Encoding Find LFEQs based on τ-association and λ(TS,x) Input token roles Pages Handle Invalid LFEQs (unknown) either not ordered or Template not correctly nested LFEQs & (constructed) Differentiating token pages with Template roles using Equivalence differentiated Classes (2) tokens
  • 23. Template Construction Recursively consider the EQ ‘e’ between consecutive tokens i and j from starting and trailing empty strings Def. e ‘empty’ iff i and j always occur contiguously if e is empty i and j are single template tokens if e is not empty there are either basic types (coming from database) or a more complex unknown type
  • 24. Data Extraction Deduced the template, data extraction is trivial and is therefore not covered
  • 25. Heuristic Assumption Summary A1 - A large number of template-tokens have unique roles, allowing the formation of EQ and subsequent differentiation A2 - A large number of tokens derive from type constructors. Each τ is instantiated many times in Ps, so LFEQs can be recognized A3 - There is no ‘regularity’ in encoded data that leads to the formation of invalid EQs A4 - There are separators around data values
  • 26. Experiments ExAlg has been tested on both input pages sets that fits the previous assumptions or not. Input collections has been collected from RISE repository, from related works sites such as ROADRUNNER and IEPAD, and from various well- known sites i.e. eBay, DBLP, Google, and Sigmod Anthology.
  • 27. Procedure Manually generate the schemas No values encoded in tag attributes has been considered (e.g. urls, images src, etc.) Aggregated data has been considered a single token (e.g. dates such as “Nov 8 2008”)
  • 28. Evaluation For each leaf, the system behave: Correct , if the extracted value is the same extracted by the manually defined schema Partially Correct , if the extracted value match two or more leaf in the manually defined schema Incorrect , if the extracted value is neither correct nor partially correct. I.e. the extracted value is a portion of the same in manually defined schema or another misaligned form
  • 29. Evaluation Correct Correct Partially Correct
  • 30. Results Holding A3, none of the schema attributes are classified as incorrect Holding also A1 and A2, all of the schema attribute are classified as correct Partially correct classified elements can be refined with very small human interaction.
  • 31. Heuristic Assumption Summary A1 - A large number of template-tokens have unique roles, allowing the formation of EQ and subsequent differentiation A2 - A large number of tokens derive from type constructors. Each τ is instantiated many times in Ps, so LFEQs can be recognized A3 - There is no ‘regularity’ in encoded data that leads to the formation of invalid EQs A4 - There are separators around data values
  • 32. Some Numbers Acc. = Accuracy of the system for all of the collections Norm. Acc. = Normalized Accuracy respect to the # of attributes for each collection.
  • 33. Conclusions EQ and Token Roles to discover the template Limited Failures when missing some assumptions
  • 34. Open Problems Lower bounds for complexity of some modules algorithms are not given. Failure for some given configurations eg <ol> <li>β</li> <li>β</li> <li>β</li> </ol> <ol> <li>β</li> <li>β</li> </ol> ... does not belong to the same EQ, so the list elements (all foreach list) are extracted as different attributes.
  • 36. Definitions It’s a long way to the top ( if you want rock ‘n roll )
  • 37. Recall on Structured Data Conformed to a common Schema or Type Types: β (Basic Type: string of tokens) Token: word or HTML tag ( or bit, or char ) Tuple < T₁,T₂,...,Tn >, where element ei is an instance of type Ti Set { T }, where all elements are instance of type T
  • 38. Example Set of pages describing books: each page contains book title, set of authors ( first and last name ), cost of the book Schema of the encoded data: S₁ = < β , { <β,β>t₃ }t₂ , β >t₁ t₁ and t₃ are tuple constructors, t₂ is a set constructor
  • 39. Schemas and Trees < > t₁ β β { }t₂ S₁ = < β , { <β,β>t₃ }t₂ , β >t₁ Sub-Trees are also schemas, < > t₃ called sub-schema of the original schema β β
  • 40. Recall: Template Function that maps each type constructor τ of S into an ordered set of strings T(τ), e.g. A <html><body><b> Book :</b> T₁(τ₁) = <A,B,C,D> B by C <b>Cost :</b> T₁(τ₂) = H D </body></html> T₁(τ₃) = <E,F,G> E,G ε (empty string) F _ (space) H and {A..H} are strings x₁ = <t,{<f₁,l₁>,<f₂,l₂>},c> λ(T₁,x₁) = AtBEf₁Fl₁GHEf₂Fl₂GCcD where t is ‘title’, f is ‘first name’, l is ‘last name’ and c is ‘cost’ <html><body><b> Book :</b> C Programming Language by Brian Kernighan and Dennis Ritchie <b>Cost :</b>$30.00 </body></html>
  • 41. Encoding The encoding λ(T,x) of an instance x of a schema S, given a template TS is defined recursively in terms of encoding of sub-values of x, e.g.: if x is a basic type β, λ(T,x) = x elif x is a tuple < x1, ..., xn >τt, λ(T,x) = < C1 , λ(T,x1), ..., Cn , λ(T,xn) > elif x is a set { e1, ..., em }τs, λ(T,x) = < ς λ(T,e1) ς ... ς λ(T,em) > T(τt) = < C1 , ..., Cn+1>, T(τs) = ς
  • 43. Simpler (infix) notation A <html><body><b> Book :</b> Template: B by < A * B { < E * F * G > }H C * D > C <b>Cost :</b> D </body></html> E,G ε (empty string) Where F _ (space) * is a wildcard for β (Basic Type) H and { § }H is ‘ § (H §)+ ’ Assumption: the same values in a tuple occur in the same relative position with respect to the values of other attributes in all the pages. (e.g. the book name always occurs before the list of the authors and the price)
  • 44. Optionals and Disjunctions ( T )? = { T }τ such as | { T }τ | is 0 or 1 ( T1 | T2 ) = T1 OR T2 = < { T1 }τ1 , { T2 }τ2 >τ
  • 45. Problem Model & Formulation
  • 46. EXTRACT Problem “ Given a set of n pages pi = λ(T,xi) (1 ≤ i ≤ n), created from some unknown template T and values {x1,...,xn} deduce the template T and the values alone. “ It is an instance of regular grammar inference problem from positive examples. N.B. Inference “in the limit” ( not Σ* ) with positive example is undecidable. However, humans rarely have any ambiguity in picking the right template from real template-generated web pages, and the goal is to solve EXTRACT problem for real web pages, i.e. produce the template and values that would be considered correct by a human.
  • 47. Tokens A template-token is an occurrence of a token in a template. Each page token is is created from either a template-token or a page-token. A token itself is different from its occurences Two page-tokens are said to have the same role iff they are generated from the same template-token.
  • 49. ExAlg Stages Stage 1 (ECGM: Equivalence Class Generation Module): discover tokens associated with the same type constructor finding equivalence classes Occurrence Vector : The occurrence-vector of a token t, is defined as the vector < f1, ..., fn >, where fi is the number of occurrences of t in pi. Equivalence Class : An equivalence class is a maximal set of tokens having the same occurrence-vector Stage 2 (Analysis Module): using the above set to deduce the template
  • 50. ExAlg: Modules Stage 1 Stage 2 DiffFormat ( Differentiate Roles using format ) ConstTemp ( Construct Template ) Input FindEquiv ( Find Equivalence Classes ) HandleInv ( Handle Invalid Equivalence Classes ) Template, ExVal ( Extract Values ) Schema, DiffEq Values ( Differentiate Roles Using Eq Classes )
  • 52. LFEQs Large and Frequently occurring EQuivalence classes Large: high number of tokens ( gt SizeThres ) Frequently occurring: whose tokens occur in a large number of input pages ( gt SupThres ) Intuition : Almost always, LFEQs are formed by tokens associated with the same type constructor in the (unknown) template used to create the input pages
  • 53. Towards FindEq Token associated with the same type constructor tend to occur in the same equivalence class. A token has a unique role if all the occurrences of the token in the pages, is generated by a single template-token. Tokens associated with the same type constructor τj in T that have unique-roles occurs in the same equivalence class.
  • 54. Valid EQs If all tokens of an equivalence class ε have unique roles and are associated with the same type constructor τj of S, ExAlg assume that ε is derived from τj Only in this case the equivalence class is valid ( You don’t know the type constructors ex ante, ExAlg has to find them )
  • 55. Assumption For real pages, an equivalence class of large size and support is usually valid (*) Size: # of tokens in the Equivalence class Support (of a token); # of pages in which the token occurs. E.g. for token with o.v. <0,1,1,1,0,0,2> support is 4 Tokens generated by value-tokens usually occur infrequently and therefore do not occur in a LFEQs
  • 56. Deducing template Since typically LFEQs consist only of tokens associated with the same type constructor in the (unknown) input template, ExAlg uses LFEQs to deduce the template and the schema Two main obstacles Base assumption(*) is heuristic ( no guarantee ) Even if an LFEQ is valid, its tokens have unique roles, and therefore only contains partially information about the template used to generate the pages
  • 57. EQs Properties An equivalence class is ordered The span of each occurrence of ε is sub- divided into (m-1) positions (Pos). Posk is page-token that may occur between tk and tk+1
  • 58. EQs Properties A pair of equivalence classes ε1 and ε2 is nested if the span of all occurrences of ε2 is within the Pos(p) of some occurrence of ε1 A set of equivalence classes {ε1,...,εn} is nested if every pair of equivalence classes of the set is nested.
  • 59. Invalid EQs Stage 1 Stage 2 DiffFormat ( Differentiate Roles using format ) ConstTemp ( Construct Template ) Input FindEquiv ( Find Equivalence Classes ) HandleInv ( Handle Invalid Equivalence Classes ) Template, ExVal ( Extract Values ) Schema, DiffEq Values ( Differentiate Roles Using Eq Classes )
  • 60. HandInv Handling invalid LFEQs, if any: Typically, invalid LFEQs are either not ordered or not nested with respect to other LFEQs. HandInv detects the existence of invalid LFEQs using violations of ordered and nesting properites It discards some of the LFEQs completely and breaks others into smaller LFEQs Outuput: ordered set of nested (with high probability valid) LFEQs
  • 62. Differentiating roles Typically an LFEQ only contains tokens with unique roles template-tokens can’t be discovered just using LFEQs “Differentiating roles of tokens” is used in EXALG to discover a greater number of template-tokens Different token roles are driven by different contexts such that the occurrences of a token in different contexts above necessarily have different roles
  • 63. Roles Different Roles!
  • 64. Differentiating roles: First Technique This technique uses the html formatting of input pages An occurrence-path of a page-token is the path from the root to the page token, in the parse-tree In practice, two page-tokens with different occurrence path have different roles Equivalently, all page-token generated by a template token have the same occurrence-path
  • 65. Differentiating roles: Second Technique Use valid equivalence classes ε The role of an occurrence of a token t outside the span of any occurence of ε is different from the role of an occurence within the span of some occurrence of ε Further, the role of an occurrence of t within the Pos(l) of some occurrence of ε, is different from the role of an occurrence of t within the Pos(m) of some occurrence of ε, with l≠m
  • 66. DTokens Different roles mean different contexts Def. dtoken: refer to both a token and a context, identified by differentiation A dtoken is a token in a given context EXALG aims to work with equivalence classes defined using dtokens
  • 68. Schema β Type < .., .., ..>τ τ : Type Constructor { T }τ Differentiating token roles using parse-tree Instance path (1) of Schema Encoding Find LFEQs based on τ-association and λ(TS,x) Input token roles Pages Handle Invalid LFEQs (unknown) either not ordered or Template not correctly nested (constructed) Differentiating token Template dtokens roles using Equivalence Classes (2)
  • 69. ECGM Summary Input: a set of pages P Output: a set of LFEQs (made of dtonkens) and pages P represented as string of dtokens NB: dtokens are marked
  • 70. Modules DiffForm sub-module differentiates roles of tokens in P (first technique) Sub-modules FindEq, HandInv and DiffEq iterate in a loop, until no new dtokens are formed FindEq need two parameters: SizeThres and SupThres. Equivalence classes with size and support greater than SizeThres and SupThres are considered LFEQs.
  • 71. Modules Stage 1 Stage 2 DiffFormat ( Differentiate Roles using format ) ConstTemp ( Construct Template ) Input FindEquiv ( Find Equivalence Classes ) HandleInv ( Handle Invalid Equivalence Classes ) Template, ExVal ( Extract Values ) Schema, DiffEq Values ( Differentiate Roles Using Eq Classes )
  • 73. PosString(ε,p) For any EQs ε, Def. Pos(p) ‘empty’ if dtokens dp and dp+1 always occur contiguously. An equivalence class is defined to be ‘empty’ if all of its Pos() are empty. For an occurrence of EQ ε and non-empty POS(p) of ε, PosString(ε,p) is the concatenation of tokens and EQs ε’ in Pos(P) but not EQs in the span of ε’.
  • 74. Construct Template For every non-empty EQ εi, ConstTemp recursively constructs a template, Tεi, and the template Tεi,p, corresponding to each non-empty Pos p of εi. Its output is the template for the root-EQ. The root-EQ occurrence vector is always <1,1,⋄⋄⋄>, at least prepending and appending greater than SizeThres number of dummy tokens to beginning and end of each page.
  • 75. Template Tε Tεi = < Tεis1, ..., Tεisn ><Ci1,Ci2,...,Cin+1> To construct the template Tεi,p ConstTemp checks if the set of strings PosString(εi,p) has some recognizable pattern Pattern used by ConstTemp prototype: Pattern Tεi,p 1 εj εj ... { T εj }∈ 2 εj S εj S ... εj { T εj }S 3 εj OR εk Tεj | Tεk 4 ∈ OR εj (Tεj)? string of dtokens and 5 Tβ empty EQs 6 -unknown- Tβ
  • 78. END Overview of ExAlg Extracting Structured Data from Web Pages Original work by Arvind Arasu & Hector Garçia-Molina {arvinda,hector}@cs.stanford.edu Presentation by Alessandro Manfredi & Marco Bontempi & Marco Giannone {a.n0on3,bontempi,marco.giannone}@gmail.com

Notas do Editor