The document describes a graphical model for jointly extracting and resolving product attributes from web pages. The model uses a Dirichlet process prior to handle an unlimited number of attributes. Variational inference is used to approximate the intractable posterior distribution. Experimental results on four domains show the model achieves good performance on attribute extraction and resolution without supervision.
Web Information Extraction Learning based on Probabilistic Graphical Models
1. Web Information Extraction Learning based on Probabilistic Graphical Models Wai Lam Joint work with Tak-Lam Wong The Chinese University of Hong Kong
16. Problem Definition (2) <TR> <TD> <P> <SPAN> White balance </SPAN> </P> </TD> <TD> <P> <SPAN> Auto, daylight, cloudy, tungstem, fluorescent, fluorescent H, custom </SPAN> </P> </TD> </TR> <TR> Line separator Line separator
17. Problem Definition (3) Attribute information Target information Layout information Content information White balance Auto, daylight, … … boldface, in-table 1 (related to attribute) white balance
18. Problem Definition (4) Attribute information Target information Layout information Content information View larger image boldface, underline 0 (irrelevant) not-an-attribute
31. Our Model (3) Dirichlet Process Prior ( Infinite Mixture Model ) N Text Fragment S Different Web Site
32. Our Model (4) N Text Fragment Target information Layout information Content information Dirichlet Process Prior ( Infinite Mixture Model ) The proportion of the k -th component in the mixture Content information parameter of the k -th component Target information parameter of the k-th component
33. Our Model (5) S Different Web Site Site-dependent Layout format
34. Our Model (6) Dirichlet Process Prior ( Infinite Mixture Model ) Concentration parameter for DP Base distribution for content info. Base distribution for target info.