Web Information Extraction Learning based on Probabilistic Graphical Models

Web Information Extraction Learning based on Probabilistic Graphical Models Wai Lam Joint work with Tak-Lam Wong The Chinese University of Hong Kong

Introduction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Wrapper Adaptation Problem (1)

Wrapper Adaptation Problem (2) Learned wrapper Wrapper learning

Product Attribute Extraction and Resolution Problem (1) ,[object Object]

Product Attribute Extraction and Resolution Problem (2) ,[object Object],[object Object],[object Object]

Product Attribute Extraction and Resolution Problem (3) ,[object Object],[object Object]

Our Approach ,[object Object],[object Object],[object Object],[object Object]

Motivating Example (Source: http://www.superwarehouse.com ) (Source: http://www.crayeon3.com )

Product Attribute Extraction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Product Attribute Resolution ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Existing Works (Supervised Learning) ,[object Object],[object Object],[object Object],[object Object],[object Object]

Existing Works (Unsupervised Learning) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Our Framework ,[object Object],[object Object],[object Object],[object Object]

Problem Definition (1) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Problem Definition (2) <TR> <TD> White balance </TD> <TD> Auto, daylight, cloudy, tungstem, fluorescent, fluorescent H, custom </TD> </TR> <TR> Line separator Line separator

Problem Definition (3) Attribute information Target information Layout information Content information White balance Auto, daylight, … … boldface, in-table 1 (related to attribute) white balance

Problem Definition (4) Attribute information Target information Layout information Content information View larger image boldface, underline 0 (irrelevant) not-an-attribute

[object Object],[object Object],[object Object],Problem Definition (5) Attribute information Target information Layout information Content information

Graphical Models (1) ,[object Object],[object Object],[object Object],[object Object],[object Object]

Graphical Models (2) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Graphical Models (3) ,[object Object],[object Object],Z 1 Z 2 Z 3 Z N θ

Graphical Models (4) ,[object Object],[object Object],Z n θ N

Graphical Models (5) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Finite Mixture Model

Graphical Models (6) ,[object Object],[object Object],Finite Mixture Model

Graphical Models (7) ,[object Object],[object Object],[object Object],[object Object],[object Object],Finite Mixture Model

Graphical Models (8) θ i N x i G Finite Mixture Model

Graphical Models (9) ,[object Object],Dirichlet Process Mixture π k  ψ k G 0 Z i N x i α

Our Model (1) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

[object Object],[object Object],[object Object],Our Model (2) Attribute information Target information Layout information Content information

Our Model (3) Dirichlet Process Prior ( Infinite Mixture Model ) N Text Fragment S Different Web Site

Our Model (4) N Text Fragment Target information Layout information Content information Dirichlet Process Prior ( Infinite Mixture Model ) The proportion of the k -th component in the mixture Content information parameter of the k -th component Target information parameter of the k-th component

Our Model (5) S Different Web Site Site-dependent Layout format

Our Model (6) Dirichlet Process Prior ( Infinite Mixture Model ) Concentration parameter for DP Base distribution for content info. Base distribution for target info.

Generation Process (2) ,[object Object],[object Object],[object Object],[object Object]

Variational Method (1) ,[object Object],[object Object],[object Object],[object Object]

Variational Method (3) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Variational Method (4) ,[object Object]

Variational Method (5) ,[object Object],[object Object]

Variational Method (6) Mixture of tokens Binary A set of binary features Conjugate priors

Initialization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

EM Algorithm for Layout Parameters ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Experiments ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Evaluation on Attribute Resolution ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Results of Attribute Resolution

Visualize the Resolved Attributes ,[object Object]

Evaluation on Attribute Extraction ,[object Object]

Conclusions ,[object Object],[object Object],[object Object],[object Object]

Variational Inference (4) Mixture of tokens Binary A set of binary features Conjugate priors

Variational Method (7) ,[object Object],[object Object],[object Object]

Unsupervised Approach ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Web Information Extraction Learning based on Probabilistic Graphical Models

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Destaque

Destaque (20)

Semelhante a Web Information Extraction Learning based on Probabilistic Graphical Models

Semelhante a Web Information Extraction Learning based on Probabilistic Graphical Models (20)

Último

Último (20)

Web Information Extraction Learning based on Probabilistic Graphical Models