4. INTRODUCTION
Web information
Unstructured
Non-semantic
Designed for humans not for crawlers
Problems
Representation (HTML vs XML)
Extract, filter and reuse data
Share information
Volatility
Fault tolerance
5. INTRODUCTION
Information Extraction techniques
Machine learning
Pattern recognition
Wrappers technologies
Tools for automatic and semi-automatic
Web data extraction
This work presents
A rule-based method for data identification
l b d th d f d t id tifi ti
An approach to Web data extraction
A particular implementation of the previous
method
7. SEMANTIC GENERATORS
Def: A Semantic Generator (Sg) is a non-
non
empty set of rules (HTML2XML) that can be
used to translate HTML documents into XML
documents
A Semantic Generator (Sg), is built by several
rules which transform a set of non-semantic
HTML tags into a set of semantic XML tags
HTML2XML rule format
HTML2XMLi =< header > IS < body > #num
8. SEMANTIC GENERATORS
HTML2XML: <table.tr.td> IS <my-xml-tag>
Tags: <table> <tr> <td> <A href…> etc…
will be removed….only data will be extracted
#num: provides the number of cells to be processed
<my-xml-tag> Madrid <my-xml-tag>
13. WEBMANTIC ARCHITECTURE
Tidy HTML p
y parser (http://tidy.sourceforge.net). It
( p y f g )
translates HTML documents into well-formed
HTML documents
The HTML Tidy program (HTML parser and
yp g ( p
pretty printer) has been integrated as the first
preprocessing module in WebMantic.
Tree generator module. Once the HTML page is
p p
preprocessed by Tidy parser, a tree representation
y yp , p
of the structures stored in the page is built
In this representation any table or list tags
g
generate a node, and the leafs of the tree are: cells
, f f
for tables (th,td,tr) or items for lists (li,lo)
15. WEBMANTIC ARCHITECTURE
HTML2XML: Rule generator module The tree
module.
representation obtained is used by this module
to generate a set of rules (Sg) that represent
the information to be translated
HTML2XML rules
17. WEBMANTIC ARCHITECTURE
Subsumption module. Previous module generates a
rule for each structure to be translated. However,
some of those rules can be generalized if the
XML tag
XML-tag represents the same concept. (i.e. the
rules in previous example that represent the
concepts of <data-record> and <country>)
19. WEBMANTIC ARCHITECTURE
XML Parser module. This module receives both,
the Semantic G
th S ti Generator obtained i previous
t bt i d in i
module, and the (well formed) HTML document
Semantic Generator
Yahoo! Weather
arser
XML
Pa
X
35. EXPERIMENTAL RESULTS
Several parameters have been evaluated:
1. Number of pages tested from each Web site
2.
2 Number of accessible structures
3. Maximum nested structure
4.
4 Average number of HTML2XML rules for each Semantic
Generator (Sg), once the subsumption process has
finished
5. Average time (seconds) to generate the Sg (Time Sg)
6. Average time (seconds) to translate from HTML to
XMLfor the set of training pages (transformation time)
38. CONCLUSIONS AND FUTURE WORK
Conclusions:
We define a technique which is able to p
f q provide a
semantic representation (using XML-tags) to semi-
structured (tables and lists) Web pages through a set of
rules (encapsulated in a Semantic Generator)
Rules are created and automatically generalized
These rules can be used to preprocess Web pages with a
similar structure, and convert them into XML
documents with semantic tags
d i h i
These can be integrated into information agents
39. CONCLUSIONS AND FUTURE WORK
In the near future:
Other Web t h l i
Oth W b technologies as DOM
Ontologies
Machine learning algorithms to automatically
learns new web (similar) p g
( ) pages
Statistical knowledge extraction