SlideShare a Scribd company logo
1 of 91
Download to read offline
UNIBA: http://www.uniba.it DIB: http://www.di.uniba.it KDDE: http://kdde.di.uniba.it
Discovering Structured Information from
Web Sites
PhD Candidate Fabiana Lanotte,
Supervisor Michelangelo Ceci
Web. . . From homogeneous information
network
● Homogeneous nodes (i.e.
hypertextual documents) and
relations (i.e. hyperlinks)
● Nodes content is encoded in
HTML
● Nodes content is unstructured
Web. . . From homogeneous information
network
HTML is a markup language:
● A description for a browser to
render
● describes how the data
should be displayed
● It never meant to describe
the data.
Web. . . To heterogeneous information
network
However the Web is full of data:
● Structured in some way
Web. . . To heterogeneous information
network
However the Web is full of data:
● Structured in some way
● Represent different kinds of real
world entities
Web. . . To heterogeneous information
network
However the Web is full of data:
● Structured in some way
● Represent different kinds of
real world entities
● Interact via various kinds
of relationships. staff
professor
course
news
professor
papers
Provided
course
What we could do?
Search
● Show structured information in response to query
● Automatically rank and cluster web pages
● Reasoning on the Web
○ Who are the people at some company? What are the courses in some
college department?
Analysis
● Expand the known information of an entity
○ What is a professor’s phone number, email, courses taught, research,
etc?
Contributions of this thesis
1. Extract structured data in form of web lists splitted on
multiple web pages: Logical list
Contributions of this thesis
1. Extract structured data in form of web lists splitted on
multiple web pages: Logical list
2. Automatic extraction of sitemaps
Contributions of this thesis
1. Extract structured data in form of web lists splitted on
multiple web pages: Logical list
2. Automatic extraction of sitemaps
3. Clustering of web pages based on their intra and extra
page features
Automatic Extraction of Logical Web
Lists
Pasqua Fabiana Lanotte, F. Fumarola, M. Ceci, D. Malerba, Automatic
Extraction of Logical Web Lists. ISMIS 2014: 365-374
Introduction
A large amount of structured data on the Web
exists in several forms:
– HTML lists, tables, and back-end Deep Web databases
Table tags, list tags and much more
● Cafarella et al. [1] estimated that there are
more than one billion of relational data,
expressed using the HTML table tag;
● Elmeleegy et al.[2] suggested an equal
number from HTML lists;
… but many structured data are not
represented with table or list tags (e.g. BBC
news).
[1] M.J.Cafarella, A. Halevy, and J. Madhavan. Structured data on the web.
Commun.ACM, Feb.2011
[2] H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on
the web. VLDB Journal, Apr.2011
Web Pages as Database views
• Moreover, many websites (e.g. Amazon, Trulia, AbeBooks,
etc.), present their listings in multiple web pages;
• Similar to a database:
– The whole listing can represent a Database table (i.e.
Logical list);
– Each list contained in a Web Page can represent a view
of a logical list.
Web Pages as Database views
• Logical list: listing of all
computer products;
• View list: top six products;
• There are web lists that
– allow us to extend
a view list,
– filter a logical list.
Goal
Our goal is to define a novel unsupervised
and domain independent approach that is
able to:
• Identify web lists from a Web Page, that are not
necessarily represented as HTML lists and tables;
• Extract logical lists.
• Can be used in several domain application ( Entity
Pages discovery, query answering, etc.)
Related Works
• Existing approaches for web lists extraction can be
classified in:
– Structural methods: focus on the automatically extraction
rules, using common DOM-structures. They often fail to handle
more complicated or noisy structures (e.g. Road runner);
– Visual methods: use visual information from rendered Web
Pages (e.g. ViDE);
– Hybrid methods: combine structural and visual features (e.g.
Hylien, Ventex).
Related Works
• Existing approaches focus only on the extraction of
Web lists in a single Web Page, and fail to detect
logical lists.
• Furche et al. presented a supervised method, based
on structural and visual features, to extract pagination
lists from Web Pages:
– The accuracy of the model depends strongly on the size of the
training set and on the complexity of feature model used.
– It is not applicable at a Web scale.
Definitions
A web page is characterized by multiple representations:
• Textual: composed by web page terms;
• Structural: composed by HTML tags;
• Visual: composed by rendering information;
Definitions
A web page is characterized by multiple representations:
• Textual: composed by web page terms;
• Visual: composed by the Rendered Box Tree;
HTML tag,
(x, y, h, w)
Definitions
A web page is characterized by multiple representations:
• Textual: composed by web page terms;
• Structural: composed by HTML tags
• Visual: composed by rendering information;
Definitions
A web page is characterized by multiple representations:
• Textual: composed by web page terms;
• Structural: composed by HTML tags
• Visual: composed by rendering information;
•
Definitions
Web List:
• Collection of two or more web
elements (called data record)
codified as rendered boxes
having a similar HTML
structure and visually adjacent
and aligned.
Definitions
Logical List: List whose data records are distributed on more
than one web page.
Methodology
Three-steps strategy
1. Web Lists extraction: Given a web page P extract the set
L = {l1
, l2
,...ln
} of web lists contained in P;
2. Dominant list identification
– Dominant list: list of interest, containing the data
records of the logical list we want extract;
3. Logical List Discovery;
Methodology
2. Dominant list identification
The set L = {l1
, l2
,...ln
} is used to extract the dominant list li
Methodology
3. Logical List discovery
• The dominant list li
is used to discovery the logical list LL
containing li
• Idea: links that are grouped in collection with an uniform
layout and presentation usually lead to similar pages;
• Two lists belong to a same Logical List LL if their
elements satisfy a structural similarity (e.g. Levenshtein
distance, normalized tree edit distance) and visual
similarity.
Experiments
• Dataset: 66061 list elements manually extracted from
4405 Web pages belonging to 40 different websites
(music shops, web journals, movies information,
home listings, computer accessories, etc.);
• For the deep-web databases, we performed a query
for each of them and collected the first page of the
results list, and for others we manually select a Web
page;
• Used measure: Precision, Recall, F-measure.
Results
– In average, it achieves 100% Precision,
95% Recall and 97% F-Measure;
– There are no False positive;
– The presence of false negative is caused by
high variance of Data Record;
Conclusions and Future Works
• Our method solves the problem of
– discovering data records organized in web lists,
– extracting logical lists
• The experimental results show that our algorithm is able
to discover logical lists in a heterogeneous set of Web
sites (i.e. domain independent);
• The logical list extraction can be used to improve the
extraction of Entity Pages (e.g. Web pages of books,
professors, etc.), improve query answering using lists and
disambiguation.
Automatic Generation of Sitemaps
Pasqua Fabiana Lanotte, F. Fumarola, D. Malerba, M. Ceci. Automatic Generation
of Sitemaps Based on Navigation Systems. MOD 2016: 216-223
Pasqua Fabiana Lanotte, M. Ceci. Closed Sequential Pattern Mining for Sitemap
Generation. Ready to be submitted at TKDE
Automatic generation of Sitemaps
• CloFAST: Closed Sequential Pattern Mining
using Sparse and Vertical Id-Lists
• Generation of sitemap through navigation
systems
CloFAST: Closed Sequential Pattern
Mining using Sparse and Vertical Id-Lists
F. Fumarola, Pasqua Fabiana Lanotte, M. Ceci, D. Malerba. CloFAST: closed
sequential pattern mining using sparse and vertical id-lists. Knowl. Inf. Syst.48(2):
429-463 (2016)
Sequential Pattern Mining
• Sequential pattern mining provides approaches and
techniques to mine knowledge from sequence data.
• Given a sequence database and a support threshold,
find the complete list of frequent subsequences.
• Example: customer who buy a Canon camera is likely to buy an HP
color printer within a month.
Given the support threshold
minSupp=2 <(ab)c> is a frequent
sequential pattern
State of art: limitations
• There are different types of patterns: frequent, closed,
maximal, with constraints, etc.
• However, the main limits of the state of the art algorithms are:
1. The need of multiple scans of the dataset
2. The generation of huge set of candidate sequences
3. The inefficiency in handling very long sequences
4. (All result in) the inefficiency in handling massive data
• Our solution is the CloFAST algorithm for closed sequential
pattern mining
Problem Definition
Let I = {i1
, i2
, …, in
} be a set of different items:
•An itemset is a non-empty subset of items denoted as (i1, i2,…, ik).
•A sequence is an ordered list of elements denoted as
‹s1
→ s2
→ … → sm
› where si
can be an itemset or a single item.
•A sequence α = ‹a1
→ a2
→ … → an
› is called a sub-sequence of another
sequence β = ‹b1
→ b2
→ … → bm
›, and β a super-sequence of α, if there
exist integers 1 ≤ j1
≤ j2
≤ … ≤ jn
≤ m such that a1
⊆ bj1
, a2
⊆ bj2
and an
⊆
bjn
.
Example
• sequence A → BC is a sub-sequence of the sequence AB → E → BCD,
• AC → AD is not a sub-sequence of A → C → AD.
Problem Definition
• A sequence database D is a set of tuples (sid, S), where sid is
a sequence id and S is a sequence.
• The support of a sequence α in D, denoted as σ(α), is the
number of tuples containing α.
• Given a user defined threshold δ, a sequence α is said frequent
in D if σ(α) ≥ δ.
• Given two sequences β and α, if β is a super-sequence of α and
their support is equal, we say that β is closed and absorbs α.
CloFAST
• CloFAST combines:
– Two data structures to support search space exploration
for closed itemsets and sequences
– A new data representation of sequence databases which
is based on sparse id-lists and vertical id-lists
– A novel one-step technique to both check backward
closure and to prune the search space
• CloFAST works in two steps:
– Closed itemset mining using sparse id-lists
– Closed sequence discovery using vertical id-lists
Closed Itemset Enumeration Tree
(CIET)
• It enumerates the complete set of closed frequent
itemsets.
– The root node is the empty set,
– The first level enumerates the 1-length-items according the a
lexicographic order,
– The other levels store the itemsets with length > 1
• Each node in the CIET can be labeled as: intermediate,
unpromising and closed.
CIET: example
• Nodes with heavy-line
borders represent closed
itemsets.
• Nodes with dashed-line
borders represent
unpromising nodes.
• Remaining nodes
represent intermediate
nodes.
Closed Sequence Enumeration
Tree (CSET)
• It enumerates the complete search space of
closed sequences.
• It has the following properties:
1. Each node corresponds to a sequence
2. If a node corresponds to a sequence s, its children
are obtained by a sequence extension of s
CSET: Example
• Nodes with continuous-line borders represent closed sequences.
• Nodes with dashed-line borders represent pruned nodes.
Sparse Id-List
Given an itemset t, its sparse id-list (SILt
) is a
vector of size n (|D|=n), such that for each j=1,…,n
if Sj
contains t
otherwise
• It can be efficiently used to support itemset-extension
Sparse Id-List: Example
Sequence_ID Sequence – Transaction Id
1 2 3 4 5 6
1 a (abc) (ac) d (cf)
2 (ad) c (bc) (ae)
3 (ef) (ab) (df) c b
4 e g (af) c b c
SILa
SILb
SILc SILd
SILe
SILf
Itemset-extension using SILs
• Itemset-extension (I-Step): an item is added to the last
transaction of a sequence
• In CloFAST, itemset extensions are executed during the
construction of the CIET.
• I-Step is achieved by iterating on SILa
and SILb
elements,
and returning a news SIL(a,b)
with the transactions-ids
which are found in both the SILs for the same id.
Vertical Id-List
Given a sequence α, whose last itemset is i, its
vertical id-list (VILα
) is a vector of size n (|D|=n),
such that for each j=1,…,n
if Sj
contains α
otherwise
It can be efficiently used to support sequence-extension
Vertical Id-List: Example
SILa
SILb
SILc SILd
SILe
SILf
VILa
VILa
VILb
VILc
VILd
VILe
VILf
Sequence_ID Sequence
1 a (abc) (ac) d (cf)
2 (ad) c (bc) (ae)
3 (ef) (ab) (df) c b
4 e g (af) c b c
Algorithm
1. With a first database scan, CloFAST finds frequent 1-length itemsets
and builds their sparse id-lists.
2. It mines closed frequent itemsets using a CIET (to explore the search
space) and itemset extension to verify the support the discovered
itemsets.
3. The closed frequent itemsets are used to fill the first level of a CSET.
4. It discovers closed sequential patterns by doing sequence extension
on candidates generated from the CSET. For each candidate:
– It is computed its VIL and its support
– It is evaluated for sequence closure and pruning
5. Finally the list of closed sequential patterns is returned
Backward Closure and Pruning
• It is inspired from BIDE1
bidirectional checking.
• The intuition is that: “it is useless to further explore a node if this node and its
descendants can be absorbed by a node stored in another path of the Tree”.
• It is divided into:
– Forward Closure: this closure is performed by exploring the search space during
S-Step.
– Backward Closure: this is done by using the CSET and alternating Itemset closure
to sequence closure checking for node that are not closed in forward.
• The pruning allows us to not exploring nodes that have the same search
space ( it uses their VILs as marker).
1. Jianyong Wang; Jiawei Han; Chun Li, "Frequent Closed Sequence Mining without Candidate Maintenance," Knowledge
and Data Engineering, IEEE Transactions on , vol.19, no.8, pp.1042,1056, Aug. 2007
Backward Closure and Pruning:
Example
• Example of itemset closure for the sequence
α=<a → d>
• α isn’t explored since β has the same VIL
Experiments
• Dataset: both synthetics and reals
• Competitors: CloSpan, Bide, ClaSP, Fast
• Evaluation:
– Efficiency (varying dataset density)
– Scalability (varying dataset size)
– Effectiveness of CloFAST optimization technique
CloFAST
Results varying density
CloFAST
Results varying db size
Results
• Performance study shows that CloFAST significantly
outperforms the state of the art algorithms;
• This is more evident in the case of :
– dense datasets;
– for small values of the support threshold;
– datasets having frequent long sequences;
Automatic Generation of Sitemaps
Based on Navigation Systems
Sitemaps
• A sitemap represents an explicit specification of the design
concept and knowledge organization of a website.
• It helps users and search-bot by providing a hierarchical
overview of a website’s content:
– to increase the user experience of the website;
– to provide a complementary tool to the keyword-based search for the
information retrieval process.
“one of the oldest hypertext usability principles is to offer a visual representation of
the information space in order to help users understand where they can go”
Jacob Nielsen
State of the Art
• Before the Google Sitemap Protocol (2005), sitemaps
were mostly manually generated
– As websites got bigger it was hard to keep sitemaps
updated, missing contents (e.g. blogs, forum)
– Most existing tools extract a list of urls and do not output
the hierarchical structure of websites; 1
• It is not a simple process, especially for websites with
a great amount of content
1. http://slickplan.com/, http://www.screamingfrog.co.uk, https://www.xml-sitemaps.com/
State of the Art
• The most prominent works are mainly based on the textual
content of the web pages
• HDTM (Weninger et al. CIKM 2012) proposed:
1. Uses random walks with restart from homepage to sample a distribution
to assign to each website’s page.
2. An iterative algorithm based on Gibbs sampling discovers hierarchies.
Final hierarchies are obtained selecting, for each node, the most probable
parent
– Limitations: ineffective in at least two cases: 1) when there is not enough
information in the text of a page; 2) when web pages have different
content, but actually refer to the same semantic class.
State of the Art
• Other works are based on analysis hyperlinks, urls
structure and heuristics (or a combination of these
features)
• limitations:
– Using of supervised solutions
– Assume websites are homogeneous
– Unable to recover the original hierarchical organization
codified in embedded navigational systems
Automatic Generation of Sitemaps
• In this work we present a solution to extract
deeper sitemaps by using the website
Navigation System:
– It provides a local view of the website
hierarchy
• Idea: combine this local view to extract
Sitemaps that describe the global view of the
website hierarchy
Proposed Solution
We develop an unsupervised and domain-independent
method which is able to discover the hierarchical
structure of a website using frequent navigation paths
through web lists
Sitemap Definition
Given:
• A web graph G(V,E) rooted at the homepage h ∈ V;
• A sequence database which enumerates a subset of all possible path in G
starting from h and having length t ∈ ℕ
• A threshold k ∈ ℕ
A sitemap is a tree T(V’, E’) ⊆ G such that:
• V’⊆V, E’ ⊆ E
• ∀ (i, j) ∈ E’ | j ∈ weblist(i)
• ∀ e = (i, j) ∈ E’, ∃ w: E -> ℕ such that, α = pathT
(<h,...i,j>) and w(e) = σ (α
) and w(e)>k
Methodology
The algorithm employs a three-step strategy:
1. Sequence dataset generation (using Random Walks);
2. Sequential pattern mining phase (CloFAST);
3. Post-pruning.
1
Methodology
The algorithm employs a three-step strategy:
1. Sequence dataset generation (using Random Walks);
2. Sequential pattern mining phase (CloFAST);
3. Post-pruning.
2
Methodology
The algorithm employs a three-step strategy:
1. Sequence dataset generation (using Random Walks);
2. Sequential pattern mining phase (CloFAST);
3. Post-pruning.
3
Methodology
Post-Pruning:
Methodology
50
30
10
20
10 10
5
30
5
Experiments
• Datasets: 3 Computer Science websites
(cs.illinois.edu, cs.princeton.edu, cs.ox.ac.uk), 1
organization website (www.enel.it)
websites #pages
in G
#edges in
G
#levels #pages
L1
#pages
L2
#pages
L3
#page
L4
cs.illinois 951 18615 4 12 42 41 40
cs.ox.ac.uk 3335 26860 2 8 33 - -
cs.princeton 2298 72598 3 6 38 23 -
enel.it/it-it 124 3476 3 7 18 46 -
Experiments
• Competitor: HDTM
• Evaluation: Precision, Recall, F-Measure
• Parameters: length of random walks, minimum
support threshold, sequence db size
Results
Friedman + Nemenyi correction
Results
• Models based on the sequential pattern mining
approach outperform the competitor HDTM
• Comparisons are based on web page sitemaps which
are poor of contents. This implies that generated
models have high precision but low recall.
Exploiting Websites Structural and Content
Features for Web Pages Clustering
Pasqua Fabiana Lanotte, F. Fumarola, D. Malerba, M. Ceci. Exploiting Web Sites
Structural and Content Features for Web Pages Clustering. ISMIS 2017.
Introduction
• The process of automatically organizing web pages and
websites has always been an important task in Web Mining
• Since a web page is characterized by several representations
the existing clustering algorithms differ in their ability of
using these representations.
• Most of existing works analyze these features almost
independently, mainly because different sources of
information uses different data representations.
Related Works
• Clustering based on textual representation:
– Consider web pages as plain texts
– Turn to be ineffective when there is not enough
information in the text or when they have different
content, but actually refer to the same semantic class.
Related Works
• Clustering based on HTML structure representation:
– typically consider the HTML formatting (i.e. HTML tags
and visual information rendered by a web browser)
– Idea: similar web pages are generated using a common
HTML template
Related Works
• Clustering based on the hyperlink structure:
– Web pages are nodes in a graph where hyperlinks enable
the information to be split in multiple and
interdependent nodes.
– Clustering is based on the analysis of the topological
structure of a network
Goal
Performing clustering of web pages in a website by combining
information about content, web page structure and hyperlink
structure of web pages.
Idea: two web pages are similar if they have common terms (i.e.
Bag of words hypothesis) and they share the same reachability
properties in the website’s graph (i.e. Distributional hypothesis).
Methodology
The proposed solution implements a four steps strategy:
1. Website crawling;
2. Link vectors generation;
3. Content vectors generation;
4. Content-Link coupled Clustering
Methodology
1. Website crawling:
Crawling uses web pages’ structure
information and exploits web lists in
order to:
• Mitigate problems coming from noisy
links which may not be relevant to
clustering process (advertisement links,
short-cut, etc.)
• Include only links belonging to the
navigational system
79
Methodology
2. Link vectors generation
- Starting from the homepage we apply the random walk
theory to extract a sequence database of random walks
- Inspired by the field of IR, we consider each random walk a
sentence and eac web page a word
Methodology
2. Link vectors generation
- Starting from the homepage we apply the random walk
theory to extract a sequence database of random walks
- Inspired by the field of IR, we consider each random walk a
sentence and each web page a word
- We can apply distributional-based algorithms (e.g.
Word2Vec) to extract for each word a vector space
representation
[0.4, 1.2, 3.1, ...]
[1.3, 0.4, 4.9, ...]
Methodology
3. Content vectors generation
- We consider web pages as plain texts (i.e. bag of word
hypotheses)
- We apply the traditional TF-IDF weighting schema to obtain a
content-vector representation
4. Content-Link coupled Clustering
- Normalize both content and link vector for their euclidean norm
- Concatenate content and link vectors of each web page
- Generated vectors can be used in traditional clustering algorithms
based on vector space model.
82
Experiments
• Datasets: 4 Computer Science websites
(cs.illinois.edu, cs.princeton.edu, cs.ox.ac.uk,
cs.stanford.edu);
• Used measures: homogeneity, completeness,
V-Measure, AMI, ARI, silhuette
83
Experiments
• Research questions:
1. Which is the real contribution of combining
content and hyperlink structure in a single vector
space representation?
2. Which is the the role of using web lists to reduce
noise and improve clustering results?
84
Results
85
IllinoisPrinceton
Results
86
OxfordStanford
Results
• The best results are obtained combining textual
information with hyperlink structure.
• Results do not show a statistical contribution in the
use of web lists for clustering purpose
87
Publications
• Pasqua Fabiana Lanotte, F. Fumarola, M. Ceci, D. Malerba, Automatic Extraction of
Logical Web Lists. ISMIS 2014: 365-374
• G. Pio, Pasqua Fabiana Lanotte, M. Ceci, D. Malerba. Mining Temporal Evolution of
Entities in a Stream of Textual Documents. ISMIS 2014: 50-60
• M. Ceci, Pasqua Fabiana Lanotte, F. Fumarola, D. Cavallo, D. Malerba. Completion Time
and Next Activity Prediction of Processes Using Sequential Pattern Mining. Discovery
Science 2014: 49-61
• Pasqua Fabiana Lanotte, F. Fumarola, D. Malerba, M. Ceci. Automatic Generation of
Sitemaps Based on Navigation Systems. MOD 2016: 216-223
• F. Fumarola, Pasqua Fabiana Lanotte, M. Ceci, D. Malerba. CloFAST: closed sequential
pattern mining using sparse and vertical id-lists. Knowl. Inf. Syst.48(2): 429-463 (2016)
• Pasqua Fabiana Lanotte, F. Fumarola, D. Malerba, M. Ceci. Exploiting Web Sites
Structural and Content Features for Web Pages Clustering. ISMIS 2017.
88
Thanks for your attention
Sequence-extension using VILs
Given two sibling nodes α, β in the CSET,
– The sequence extension (S-Step) of α using β aims at generate a new
sequence γ.
– γ is obtained by adding to α the last itemset in β.
Example:
‹ e → a› -- S-Step -- <e → d> = ‹ e → a → d›
•S-Step is achieved by:
1. verifying for each position j of VILα
and VILβ
the condition that the value of
VILα
[j] is lower than VILβ
[j].
2. And/or using a shift-right function to move to the next transaction id of the
last itemset in β that has a value greater than the value of the transaction id
of the last itemset in α
90
Example of a S-Step
S-step on sequence b -> a
VILa
VILb
S-Step
SILa
shift-right
91

More Related Content

What's hot

WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedWWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedStefan Dietze
 
ESWC SS 2013 - Tuesday Tutorial 1 Maribel Acosta and Barry Norton: Providing ...
ESWC SS 2013 - Tuesday Tutorial 1 Maribel Acosta and Barry Norton: Providing ...ESWC SS 2013 - Tuesday Tutorial 1 Maribel Acosta and Barry Norton: Providing ...
ESWC SS 2013 - Tuesday Tutorial 1 Maribel Acosta and Barry Norton: Providing ...eswcsummerschool
 
The SPARQL Anything project
The SPARQL Anything projectThe SPARQL Anything project
The SPARQL Anything projectEnrico Daga
 
Standardizing for Open Data
Standardizing for Open DataStandardizing for Open Data
Standardizing for Open DataIvan Herman
 
An Implementation of a New Framework for Automatic Generation of Ontology and...
An Implementation of a New Framework for Automatic Generation of Ontology and...An Implementation of a New Framework for Automatic Generation of Ontology and...
An Implementation of a New Framework for Automatic Generation of Ontology and...IJCSIS Research Publications
 
Querying Linked Data on Android
Querying Linked Data on AndroidQuerying Linked Data on Android
Querying Linked Data on AndroidEUCLID project
 
Linked Data for Czech Legislation
Linked Data for Czech LegislationLinked Data for Czech Legislation
Linked Data for Czech LegislationMartin Necasky
 
CS6010 Social Network Analysis Unit II
CS6010 Social Network Analysis   Unit IICS6010 Social Network Analysis   Unit II
CS6010 Social Network Analysis Unit IIpkaviya
 
DBpedia Tutorial - Feb 2015, Dublin
DBpedia Tutorial - Feb 2015, DublinDBpedia Tutorial - Feb 2015, Dublin
DBpedia Tutorial - Feb 2015, Dublinm_ackermann
 
Introduction to RDF & SPARQL
Introduction to RDF & SPARQLIntroduction to RDF & SPARQL
Introduction to RDF & SPARQLOpen Data Support
 
Semantic Data Architecture and Ontological Infrastructure (ASIO)
Semantic Data Architecture and Ontological Infrastructure (ASIO)Semantic Data Architecture and Ontological Infrastructure (ASIO)
Semantic Data Architecture and Ontological Infrastructure (ASIO)AdrinSaavedraSerrano
 
From the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking upFrom the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking upDavide Palmisano
 
Registry Technical Training
Registry Technical TrainingRegistry Technical Training
Registry Technical TrainingDave Reynolds
 

What's hot (17)

WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedWWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
 
ESWC SS 2013 - Tuesday Tutorial 1 Maribel Acosta and Barry Norton: Providing ...
ESWC SS 2013 - Tuesday Tutorial 1 Maribel Acosta and Barry Norton: Providing ...ESWC SS 2013 - Tuesday Tutorial 1 Maribel Acosta and Barry Norton: Providing ...
ESWC SS 2013 - Tuesday Tutorial 1 Maribel Acosta and Barry Norton: Providing ...
 
The SPARQL Anything project
The SPARQL Anything projectThe SPARQL Anything project
The SPARQL Anything project
 
Standardizing for Open Data
Standardizing for Open DataStandardizing for Open Data
Standardizing for Open Data
 
An Implementation of a New Framework for Automatic Generation of Ontology and...
An Implementation of a New Framework for Automatic Generation of Ontology and...An Implementation of a New Framework for Automatic Generation of Ontology and...
An Implementation of a New Framework for Automatic Generation of Ontology and...
 
Querying Linked Data on Android
Querying Linked Data on AndroidQuerying Linked Data on Android
Querying Linked Data on Android
 
Linked Data for Czech Legislation
Linked Data for Czech LegislationLinked Data for Czech Legislation
Linked Data for Czech Legislation
 
CS6010 Social Network Analysis Unit II
CS6010 Social Network Analysis   Unit IICS6010 Social Network Analysis   Unit II
CS6010 Social Network Analysis Unit II
 
DBpedia Tutorial - Feb 2015, Dublin
DBpedia Tutorial - Feb 2015, DublinDBpedia Tutorial - Feb 2015, Dublin
DBpedia Tutorial - Feb 2015, Dublin
 
Introduction to RDF & SPARQL
Introduction to RDF & SPARQLIntroduction to RDF & SPARQL
Introduction to RDF & SPARQL
 
PhD Defense
PhD DefensePhD Defense
PhD Defense
 
Semantic Data Architecture and Ontological Infrastructure (ASIO)
Semantic Data Architecture and Ontological Infrastructure (ASIO)Semantic Data Architecture and Ontological Infrastructure (ASIO)
Semantic Data Architecture and Ontological Infrastructure (ASIO)
 
Mods0210
Mods0210Mods0210
Mods0210
 
From the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking upFrom the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking up
 
Registry Technical Training
Registry Technical TrainingRegistry Technical Training
Registry Technical Training
 
Metadata is back!
Metadata is back!Metadata is back!
Metadata is back!
 
Web Spa
Web SpaWeb Spa
Web Spa
 

Similar to Phd presentation

Web Mining Presentation Final
Web Mining Presentation FinalWeb Mining Presentation Final
Web Mining Presentation FinalEr. Jagrat Gupta
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebSimon Price
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Anja Jentzsch
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...IOSR Journals
 
Multi-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsMulti-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsJiaheng Lu
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notesAnandh Arumugakan
 
Web Information Extraction for the DB Research Domain
Web Information Extraction for the DB Research DomainWeb Information Extraction for the DB Research Domain
Web Information Extraction for the DB Research Domainliat_kakun
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery systemDenis Shestakov
 
The Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search EngineThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search EngineMehul Boricha
 
A Study On Web Structure Mining
A Study On Web Structure MiningA Study On Web Structure Mining
A Study On Web Structure MiningNicole Heredia
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Enrico Daga
 
A Study on Web Structure Mining
A Study on Web Structure MiningA Study on Web Structure Mining
A Study on Web Structure MiningIRJET Journal
 
"Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий...
"Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий..."Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий...
"Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий...GeeksLab Odessa
 
A Schema-Based Approach To Modeling And Querying WWW Data
A Schema-Based Approach To Modeling And Querying WWW DataA Schema-Based Approach To Modeling And Querying WWW Data
A Schema-Based Approach To Modeling And Querying WWW DataLisa Garcia
 
Data Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and FutureData Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and Futurefeiwin
 

Similar to Phd presentation (20)

H017554148
H017554148H017554148
H017554148
 
Web Mining Presentation Final
Web Mining Presentation FinalWeb Mining Presentation Final
Web Mining Presentation Final
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic Web
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
 
ANALYSIS OF RESEARCH ISSUES IN WEB DATA MINING
ANALYSIS OF RESEARCH ISSUES IN WEB DATA MINING ANALYSIS OF RESEARCH ISSUES IN WEB DATA MINING
ANALYSIS OF RESEARCH ISSUES IN WEB DATA MINING
 
Multi-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsMulti-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing Paradigms
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Web mining
Web miningWeb mining
Web mining
 
Web Information Extraction for the DB Research Domain
Web Information Extraction for the DB Research DomainWeb Information Extraction for the DB Research Domain
Web Information Extraction for the DB Research Domain
 
On building a search interface discovery system
On building a search interface discovery systemOn building a search interface discovery system
On building a search interface discovery system
 
The Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search EngineThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine
 
A Study On Web Structure Mining
A Study On Web Structure MiningA Study On Web Structure Mining
A Study On Web Structure Mining
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.
 
A Study on Web Structure Mining
A Study on Web Structure MiningA Study on Web Structure Mining
A Study on Web Structure Mining
 
A1303060109
A1303060109A1303060109
A1303060109
 
A1303060109
A1303060109A1303060109
A1303060109
 
"Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий...
"Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий..."Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий...
"Data mining и информационный поиск проблемы, алгоритмы, решения"_Краковецкий...
 
A Schema-Based Approach To Modeling And Querying WWW Data
A Schema-Based Approach To Modeling And Querying WWW DataA Schema-Based Approach To Modeling And Querying WWW Data
A Schema-Based Approach To Modeling And Querying WWW Data
 
Data Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and FutureData Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and Future
 

Recently uploaded

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 

Recently uploaded (20)

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 

Phd presentation

  • 1. UNIBA: http://www.uniba.it DIB: http://www.di.uniba.it KDDE: http://kdde.di.uniba.it Discovering Structured Information from Web Sites PhD Candidate Fabiana Lanotte, Supervisor Michelangelo Ceci
  • 2. Web. . . From homogeneous information network ● Homogeneous nodes (i.e. hypertextual documents) and relations (i.e. hyperlinks) ● Nodes content is encoded in HTML ● Nodes content is unstructured
  • 3. Web. . . From homogeneous information network HTML is a markup language: ● A description for a browser to render ● describes how the data should be displayed ● It never meant to describe the data.
  • 4. Web. . . To heterogeneous information network However the Web is full of data: ● Structured in some way
  • 5. Web. . . To heterogeneous information network However the Web is full of data: ● Structured in some way ● Represent different kinds of real world entities
  • 6. Web. . . To heterogeneous information network However the Web is full of data: ● Structured in some way ● Represent different kinds of real world entities ● Interact via various kinds of relationships. staff professor course news professor papers Provided course
  • 7. What we could do? Search ● Show structured information in response to query ● Automatically rank and cluster web pages ● Reasoning on the Web ○ Who are the people at some company? What are the courses in some college department? Analysis ● Expand the known information of an entity ○ What is a professor’s phone number, email, courses taught, research, etc?
  • 8. Contributions of this thesis 1. Extract structured data in form of web lists splitted on multiple web pages: Logical list
  • 9. Contributions of this thesis 1. Extract structured data in form of web lists splitted on multiple web pages: Logical list 2. Automatic extraction of sitemaps
  • 10. Contributions of this thesis 1. Extract structured data in form of web lists splitted on multiple web pages: Logical list 2. Automatic extraction of sitemaps 3. Clustering of web pages based on their intra and extra page features
  • 11. Automatic Extraction of Logical Web Lists Pasqua Fabiana Lanotte, F. Fumarola, M. Ceci, D. Malerba, Automatic Extraction of Logical Web Lists. ISMIS 2014: 365-374
  • 12. Introduction A large amount of structured data on the Web exists in several forms: – HTML lists, tables, and back-end Deep Web databases
  • 13. Table tags, list tags and much more ● Cafarella et al. [1] estimated that there are more than one billion of relational data, expressed using the HTML table tag; ● Elmeleegy et al.[2] suggested an equal number from HTML lists; … but many structured data are not represented with table or list tags (e.g. BBC news). [1] M.J.Cafarella, A. Halevy, and J. Madhavan. Structured data on the web. Commun.ACM, Feb.2011 [2] H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. VLDB Journal, Apr.2011
  • 14. Web Pages as Database views • Moreover, many websites (e.g. Amazon, Trulia, AbeBooks, etc.), present their listings in multiple web pages; • Similar to a database: – The whole listing can represent a Database table (i.e. Logical list); – Each list contained in a Web Page can represent a view of a logical list.
  • 15. Web Pages as Database views • Logical list: listing of all computer products; • View list: top six products; • There are web lists that – allow us to extend a view list, – filter a logical list.
  • 16. Goal Our goal is to define a novel unsupervised and domain independent approach that is able to: • Identify web lists from a Web Page, that are not necessarily represented as HTML lists and tables; • Extract logical lists. • Can be used in several domain application ( Entity Pages discovery, query answering, etc.)
  • 17. Related Works • Existing approaches for web lists extraction can be classified in: – Structural methods: focus on the automatically extraction rules, using common DOM-structures. They often fail to handle more complicated or noisy structures (e.g. Road runner); – Visual methods: use visual information from rendered Web Pages (e.g. ViDE); – Hybrid methods: combine structural and visual features (e.g. Hylien, Ventex).
  • 18. Related Works • Existing approaches focus only on the extraction of Web lists in a single Web Page, and fail to detect logical lists. • Furche et al. presented a supervised method, based on structural and visual features, to extract pagination lists from Web Pages: – The accuracy of the model depends strongly on the size of the training set and on the complexity of feature model used. – It is not applicable at a Web scale.
  • 19. Definitions A web page is characterized by multiple representations: • Textual: composed by web page terms; • Structural: composed by HTML tags; • Visual: composed by rendering information;
  • 20. Definitions A web page is characterized by multiple representations: • Textual: composed by web page terms; • Visual: composed by the Rendered Box Tree; HTML tag, (x, y, h, w)
  • 21. Definitions A web page is characterized by multiple representations: • Textual: composed by web page terms; • Structural: composed by HTML tags • Visual: composed by rendering information;
  • 22. Definitions A web page is characterized by multiple representations: • Textual: composed by web page terms; • Structural: composed by HTML tags • Visual: composed by rendering information; •
  • 23. Definitions Web List: • Collection of two or more web elements (called data record) codified as rendered boxes having a similar HTML structure and visually adjacent and aligned.
  • 24. Definitions Logical List: List whose data records are distributed on more than one web page.
  • 25. Methodology Three-steps strategy 1. Web Lists extraction: Given a web page P extract the set L = {l1 , l2 ,...ln } of web lists contained in P; 2. Dominant list identification – Dominant list: list of interest, containing the data records of the logical list we want extract; 3. Logical List Discovery;
  • 26. Methodology 2. Dominant list identification The set L = {l1 , l2 ,...ln } is used to extract the dominant list li
  • 27. Methodology 3. Logical List discovery • The dominant list li is used to discovery the logical list LL containing li • Idea: links that are grouped in collection with an uniform layout and presentation usually lead to similar pages; • Two lists belong to a same Logical List LL if their elements satisfy a structural similarity (e.g. Levenshtein distance, normalized tree edit distance) and visual similarity.
  • 28. Experiments • Dataset: 66061 list elements manually extracted from 4405 Web pages belonging to 40 different websites (music shops, web journals, movies information, home listings, computer accessories, etc.); • For the deep-web databases, we performed a query for each of them and collected the first page of the results list, and for others we manually select a Web page; • Used measure: Precision, Recall, F-measure.
  • 29. Results – In average, it achieves 100% Precision, 95% Recall and 97% F-Measure; – There are no False positive; – The presence of false negative is caused by high variance of Data Record;
  • 30. Conclusions and Future Works • Our method solves the problem of – discovering data records organized in web lists, – extracting logical lists • The experimental results show that our algorithm is able to discover logical lists in a heterogeneous set of Web sites (i.e. domain independent); • The logical list extraction can be used to improve the extraction of Entity Pages (e.g. Web pages of books, professors, etc.), improve query answering using lists and disambiguation.
  • 31. Automatic Generation of Sitemaps Pasqua Fabiana Lanotte, F. Fumarola, D. Malerba, M. Ceci. Automatic Generation of Sitemaps Based on Navigation Systems. MOD 2016: 216-223 Pasqua Fabiana Lanotte, M. Ceci. Closed Sequential Pattern Mining for Sitemap Generation. Ready to be submitted at TKDE
  • 32. Automatic generation of Sitemaps • CloFAST: Closed Sequential Pattern Mining using Sparse and Vertical Id-Lists • Generation of sitemap through navigation systems
  • 33. CloFAST: Closed Sequential Pattern Mining using Sparse and Vertical Id-Lists F. Fumarola, Pasqua Fabiana Lanotte, M. Ceci, D. Malerba. CloFAST: closed sequential pattern mining using sparse and vertical id-lists. Knowl. Inf. Syst.48(2): 429-463 (2016)
  • 34. Sequential Pattern Mining • Sequential pattern mining provides approaches and techniques to mine knowledge from sequence data. • Given a sequence database and a support threshold, find the complete list of frequent subsequences. • Example: customer who buy a Canon camera is likely to buy an HP color printer within a month. Given the support threshold minSupp=2 <(ab)c> is a frequent sequential pattern
  • 35. State of art: limitations • There are different types of patterns: frequent, closed, maximal, with constraints, etc. • However, the main limits of the state of the art algorithms are: 1. The need of multiple scans of the dataset 2. The generation of huge set of candidate sequences 3. The inefficiency in handling very long sequences 4. (All result in) the inefficiency in handling massive data • Our solution is the CloFAST algorithm for closed sequential pattern mining
  • 36. Problem Definition Let I = {i1 , i2 , …, in } be a set of different items: •An itemset is a non-empty subset of items denoted as (i1, i2,…, ik). •A sequence is an ordered list of elements denoted as ‹s1 → s2 → … → sm › where si can be an itemset or a single item. •A sequence α = ‹a1 → a2 → … → an › is called a sub-sequence of another sequence β = ‹b1 → b2 → … → bm ›, and β a super-sequence of α, if there exist integers 1 ≤ j1 ≤ j2 ≤ … ≤ jn ≤ m such that a1 ⊆ bj1 , a2 ⊆ bj2 and an ⊆ bjn . Example • sequence A → BC is a sub-sequence of the sequence AB → E → BCD, • AC → AD is not a sub-sequence of A → C → AD.
  • 37. Problem Definition • A sequence database D is a set of tuples (sid, S), where sid is a sequence id and S is a sequence. • The support of a sequence α in D, denoted as σ(α), is the number of tuples containing α. • Given a user defined threshold δ, a sequence α is said frequent in D if σ(α) ≥ δ. • Given two sequences β and α, if β is a super-sequence of α and their support is equal, we say that β is closed and absorbs α.
  • 38. CloFAST • CloFAST combines: – Two data structures to support search space exploration for closed itemsets and sequences – A new data representation of sequence databases which is based on sparse id-lists and vertical id-lists – A novel one-step technique to both check backward closure and to prune the search space • CloFAST works in two steps: – Closed itemset mining using sparse id-lists – Closed sequence discovery using vertical id-lists
  • 39. Closed Itemset Enumeration Tree (CIET) • It enumerates the complete set of closed frequent itemsets. – The root node is the empty set, – The first level enumerates the 1-length-items according the a lexicographic order, – The other levels store the itemsets with length > 1 • Each node in the CIET can be labeled as: intermediate, unpromising and closed.
  • 40. CIET: example • Nodes with heavy-line borders represent closed itemsets. • Nodes with dashed-line borders represent unpromising nodes. • Remaining nodes represent intermediate nodes.
  • 41. Closed Sequence Enumeration Tree (CSET) • It enumerates the complete search space of closed sequences. • It has the following properties: 1. Each node corresponds to a sequence 2. If a node corresponds to a sequence s, its children are obtained by a sequence extension of s
  • 42. CSET: Example • Nodes with continuous-line borders represent closed sequences. • Nodes with dashed-line borders represent pruned nodes.
  • 43. Sparse Id-List Given an itemset t, its sparse id-list (SILt ) is a vector of size n (|D|=n), such that for each j=1,…,n if Sj contains t otherwise • It can be efficiently used to support itemset-extension
  • 44. Sparse Id-List: Example Sequence_ID Sequence – Transaction Id 1 2 3 4 5 6 1 a (abc) (ac) d (cf) 2 (ad) c (bc) (ae) 3 (ef) (ab) (df) c b 4 e g (af) c b c SILa SILb SILc SILd SILe SILf
  • 45. Itemset-extension using SILs • Itemset-extension (I-Step): an item is added to the last transaction of a sequence • In CloFAST, itemset extensions are executed during the construction of the CIET. • I-Step is achieved by iterating on SILa and SILb elements, and returning a news SIL(a,b) with the transactions-ids which are found in both the SILs for the same id.
  • 46. Vertical Id-List Given a sequence α, whose last itemset is i, its vertical id-list (VILα ) is a vector of size n (|D|=n), such that for each j=1,…,n if Sj contains α otherwise It can be efficiently used to support sequence-extension
  • 47. Vertical Id-List: Example SILa SILb SILc SILd SILe SILf VILa VILa VILb VILc VILd VILe VILf Sequence_ID Sequence 1 a (abc) (ac) d (cf) 2 (ad) c (bc) (ae) 3 (ef) (ab) (df) c b 4 e g (af) c b c
  • 48. Algorithm 1. With a first database scan, CloFAST finds frequent 1-length itemsets and builds their sparse id-lists. 2. It mines closed frequent itemsets using a CIET (to explore the search space) and itemset extension to verify the support the discovered itemsets. 3. The closed frequent itemsets are used to fill the first level of a CSET. 4. It discovers closed sequential patterns by doing sequence extension on candidates generated from the CSET. For each candidate: – It is computed its VIL and its support – It is evaluated for sequence closure and pruning 5. Finally the list of closed sequential patterns is returned
  • 49. Backward Closure and Pruning • It is inspired from BIDE1 bidirectional checking. • The intuition is that: “it is useless to further explore a node if this node and its descendants can be absorbed by a node stored in another path of the Tree”. • It is divided into: – Forward Closure: this closure is performed by exploring the search space during S-Step. – Backward Closure: this is done by using the CSET and alternating Itemset closure to sequence closure checking for node that are not closed in forward. • The pruning allows us to not exploring nodes that have the same search space ( it uses their VILs as marker). 1. Jianyong Wang; Jiawei Han; Chun Li, "Frequent Closed Sequence Mining without Candidate Maintenance," Knowledge and Data Engineering, IEEE Transactions on , vol.19, no.8, pp.1042,1056, Aug. 2007
  • 50. Backward Closure and Pruning: Example • Example of itemset closure for the sequence α=<a → d> • α isn’t explored since β has the same VIL
  • 51. Experiments • Dataset: both synthetics and reals • Competitors: CloSpan, Bide, ClaSP, Fast • Evaluation: – Efficiency (varying dataset density) – Scalability (varying dataset size) – Effectiveness of CloFAST optimization technique
  • 54. Results • Performance study shows that CloFAST significantly outperforms the state of the art algorithms; • This is more evident in the case of : – dense datasets; – for small values of the support threshold; – datasets having frequent long sequences;
  • 55. Automatic Generation of Sitemaps Based on Navigation Systems
  • 56. Sitemaps • A sitemap represents an explicit specification of the design concept and knowledge organization of a website. • It helps users and search-bot by providing a hierarchical overview of a website’s content: – to increase the user experience of the website; – to provide a complementary tool to the keyword-based search for the information retrieval process. “one of the oldest hypertext usability principles is to offer a visual representation of the information space in order to help users understand where they can go” Jacob Nielsen
  • 57. State of the Art • Before the Google Sitemap Protocol (2005), sitemaps were mostly manually generated – As websites got bigger it was hard to keep sitemaps updated, missing contents (e.g. blogs, forum) – Most existing tools extract a list of urls and do not output the hierarchical structure of websites; 1 • It is not a simple process, especially for websites with a great amount of content 1. http://slickplan.com/, http://www.screamingfrog.co.uk, https://www.xml-sitemaps.com/
  • 58. State of the Art • The most prominent works are mainly based on the textual content of the web pages • HDTM (Weninger et al. CIKM 2012) proposed: 1. Uses random walks with restart from homepage to sample a distribution to assign to each website’s page. 2. An iterative algorithm based on Gibbs sampling discovers hierarchies. Final hierarchies are obtained selecting, for each node, the most probable parent – Limitations: ineffective in at least two cases: 1) when there is not enough information in the text of a page; 2) when web pages have different content, but actually refer to the same semantic class.
  • 59. State of the Art • Other works are based on analysis hyperlinks, urls structure and heuristics (or a combination of these features) • limitations: – Using of supervised solutions – Assume websites are homogeneous – Unable to recover the original hierarchical organization codified in embedded navigational systems
  • 60. Automatic Generation of Sitemaps • In this work we present a solution to extract deeper sitemaps by using the website Navigation System: – It provides a local view of the website hierarchy • Idea: combine this local view to extract Sitemaps that describe the global view of the website hierarchy
  • 61. Proposed Solution We develop an unsupervised and domain-independent method which is able to discover the hierarchical structure of a website using frequent navigation paths through web lists
  • 62. Sitemap Definition Given: • A web graph G(V,E) rooted at the homepage h ∈ V; • A sequence database which enumerates a subset of all possible path in G starting from h and having length t ∈ ℕ • A threshold k ∈ ℕ A sitemap is a tree T(V’, E’) ⊆ G such that: • V’⊆V, E’ ⊆ E • ∀ (i, j) ∈ E’ | j ∈ weblist(i) • ∀ e = (i, j) ∈ E’, ∃ w: E -> ℕ such that, α = pathT (<h,...i,j>) and w(e) = σ (α ) and w(e)>k
  • 63. Methodology The algorithm employs a three-step strategy: 1. Sequence dataset generation (using Random Walks); 2. Sequential pattern mining phase (CloFAST); 3. Post-pruning. 1
  • 64. Methodology The algorithm employs a three-step strategy: 1. Sequence dataset generation (using Random Walks); 2. Sequential pattern mining phase (CloFAST); 3. Post-pruning. 2
  • 65. Methodology The algorithm employs a three-step strategy: 1. Sequence dataset generation (using Random Walks); 2. Sequential pattern mining phase (CloFAST); 3. Post-pruning. 3
  • 68. Experiments • Datasets: 3 Computer Science websites (cs.illinois.edu, cs.princeton.edu, cs.ox.ac.uk), 1 organization website (www.enel.it) websites #pages in G #edges in G #levels #pages L1 #pages L2 #pages L3 #page L4 cs.illinois 951 18615 4 12 42 41 40 cs.ox.ac.uk 3335 26860 2 8 33 - - cs.princeton 2298 72598 3 6 38 23 - enel.it/it-it 124 3476 3 7 18 46 -
  • 69. Experiments • Competitor: HDTM • Evaluation: Precision, Recall, F-Measure • Parameters: length of random walks, minimum support threshold, sequence db size
  • 71. Results • Models based on the sequential pattern mining approach outperform the competitor HDTM • Comparisons are based on web page sitemaps which are poor of contents. This implies that generated models have high precision but low recall.
  • 72. Exploiting Websites Structural and Content Features for Web Pages Clustering Pasqua Fabiana Lanotte, F. Fumarola, D. Malerba, M. Ceci. Exploiting Web Sites Structural and Content Features for Web Pages Clustering. ISMIS 2017.
  • 73. Introduction • The process of automatically organizing web pages and websites has always been an important task in Web Mining • Since a web page is characterized by several representations the existing clustering algorithms differ in their ability of using these representations. • Most of existing works analyze these features almost independently, mainly because different sources of information uses different data representations.
  • 74. Related Works • Clustering based on textual representation: – Consider web pages as plain texts – Turn to be ineffective when there is not enough information in the text or when they have different content, but actually refer to the same semantic class.
  • 75. Related Works • Clustering based on HTML structure representation: – typically consider the HTML formatting (i.e. HTML tags and visual information rendered by a web browser) – Idea: similar web pages are generated using a common HTML template
  • 76. Related Works • Clustering based on the hyperlink structure: – Web pages are nodes in a graph where hyperlinks enable the information to be split in multiple and interdependent nodes. – Clustering is based on the analysis of the topological structure of a network
  • 77. Goal Performing clustering of web pages in a website by combining information about content, web page structure and hyperlink structure of web pages. Idea: two web pages are similar if they have common terms (i.e. Bag of words hypothesis) and they share the same reachability properties in the website’s graph (i.e. Distributional hypothesis).
  • 78. Methodology The proposed solution implements a four steps strategy: 1. Website crawling; 2. Link vectors generation; 3. Content vectors generation; 4. Content-Link coupled Clustering
  • 79. Methodology 1. Website crawling: Crawling uses web pages’ structure information and exploits web lists in order to: • Mitigate problems coming from noisy links which may not be relevant to clustering process (advertisement links, short-cut, etc.) • Include only links belonging to the navigational system 79
  • 80. Methodology 2. Link vectors generation - Starting from the homepage we apply the random walk theory to extract a sequence database of random walks - Inspired by the field of IR, we consider each random walk a sentence and eac web page a word
  • 81. Methodology 2. Link vectors generation - Starting from the homepage we apply the random walk theory to extract a sequence database of random walks - Inspired by the field of IR, we consider each random walk a sentence and each web page a word - We can apply distributional-based algorithms (e.g. Word2Vec) to extract for each word a vector space representation [0.4, 1.2, 3.1, ...] [1.3, 0.4, 4.9, ...]
  • 82. Methodology 3. Content vectors generation - We consider web pages as plain texts (i.e. bag of word hypotheses) - We apply the traditional TF-IDF weighting schema to obtain a content-vector representation 4. Content-Link coupled Clustering - Normalize both content and link vector for their euclidean norm - Concatenate content and link vectors of each web page - Generated vectors can be used in traditional clustering algorithms based on vector space model. 82
  • 83. Experiments • Datasets: 4 Computer Science websites (cs.illinois.edu, cs.princeton.edu, cs.ox.ac.uk, cs.stanford.edu); • Used measures: homogeneity, completeness, V-Measure, AMI, ARI, silhuette 83
  • 84. Experiments • Research questions: 1. Which is the real contribution of combining content and hyperlink structure in a single vector space representation? 2. Which is the the role of using web lists to reduce noise and improve clustering results? 84
  • 87. Results • The best results are obtained combining textual information with hyperlink structure. • Results do not show a statistical contribution in the use of web lists for clustering purpose 87
  • 88. Publications • Pasqua Fabiana Lanotte, F. Fumarola, M. Ceci, D. Malerba, Automatic Extraction of Logical Web Lists. ISMIS 2014: 365-374 • G. Pio, Pasqua Fabiana Lanotte, M. Ceci, D. Malerba. Mining Temporal Evolution of Entities in a Stream of Textual Documents. ISMIS 2014: 50-60 • M. Ceci, Pasqua Fabiana Lanotte, F. Fumarola, D. Cavallo, D. Malerba. Completion Time and Next Activity Prediction of Processes Using Sequential Pattern Mining. Discovery Science 2014: 49-61 • Pasqua Fabiana Lanotte, F. Fumarola, D. Malerba, M. Ceci. Automatic Generation of Sitemaps Based on Navigation Systems. MOD 2016: 216-223 • F. Fumarola, Pasqua Fabiana Lanotte, M. Ceci, D. Malerba. CloFAST: closed sequential pattern mining using sparse and vertical id-lists. Knowl. Inf. Syst.48(2): 429-463 (2016) • Pasqua Fabiana Lanotte, F. Fumarola, D. Malerba, M. Ceci. Exploiting Web Sites Structural and Content Features for Web Pages Clustering. ISMIS 2017. 88
  • 89. Thanks for your attention
  • 90. Sequence-extension using VILs Given two sibling nodes α, β in the CSET, – The sequence extension (S-Step) of α using β aims at generate a new sequence γ. – γ is obtained by adding to α the last itemset in β. Example: ‹ e → a› -- S-Step -- <e → d> = ‹ e → a → d› •S-Step is achieved by: 1. verifying for each position j of VILα and VILβ the condition that the value of VILα [j] is lower than VILβ [j]. 2. And/or using a shift-right function to move to the next transaction id of the last itemset in β that has a value greater than the value of the transaction id of the last itemset in α 90
  • 91. Example of a S-Step S-step on sequence b -> a VILa VILb S-Step SILa shift-right 91