Automating Google Workspace (GWS) & more with Apps Script
Building and Using Knowledge Bases
1. WeST – Web Science & Technologies
University of Koblenz Landau, Germany
Building and Using
Knowledge Bases
Steffen Staab
Saqib Mir – European Bioinformatics Institute
Ermelinda d„Oro, Massimo Ruffolo – Univ. Calabria, Italy
& WeST Team
2. Institut WeST – Web Science & Technologies
Semantic Web Web Retrieval Social Web Multimedia Web Software Web GESIS
WeST – Web Science & Steffen Staab Slide 2
Technologies staab@uni-koblenz.de
3. PhD thesis trauma 17 years ago
„Nach dem Auspacken der LPS 105 präsentiert sich dem
Betrachter ein stabiles Laufwerk, das genauso geringe
Außenmaße besitzt wie die Maxtor.“
Having unwrapped the LPS 105 – reveals itself to the
onlooker - a stable disk drive, which has similarly small
volume as the Maxtor.“
WeST – Web Science & Steffen Staab Slide 3
Technologies staab@uni-koblenz.de
4. GENERAL MOTIVATION
General motivation is not information extraction,
but it is solving tasks!
WeST – Web Science & Steffen Staab Slide 4
Technologies staab@uni-koblenz.de
5. General objective: Extracting to LOD
useAsExample hasLivedIn
Crucial to know: Ontologies nowadays reflect this structure
Ontologies are
• Modular (vs one to rule them all)
• Distributed (vs defined in one place)
• Connected (vs isolated templates)
• Extensible (vs claimed to be finished)
• Lightweight (vs computationally intractable)
• Popular ones are used more often (vs people disagreeing)
Ontologies – LEGO style
WeST – Web Science & Steffen Staab Slide 5
Technologies staab@uni-koblenz.de
6. Most famous applications
Steve Macbeth (Microsoft): - discussion wrt Schema.org -
“about 7% of pages we crawl have mark-up”
http://www.w3.org/2012/06/06-schema-minutes.html
LOD Cloud
Google Knowledge Graph
Bing gets its own knowledge graph
http://searchengineland.com/bing-britannica-partnership-123930
WeST – Web Science & Steffen Staab Slide 6
Technologies staab@uni-koblenz.de
7. Example ontology-based application 1:
ANALYSIS OF
URBAN PARAMETERS
WeST – Web Science & Steffen Staab Slide 7
Technologies staab@uni-koblenz.de
8. General objective: Analysing LOD
useAsExample hasLivedIn
WeST – Web Science & Steffen Staab Slide 8
Technologies staab@uni-koblenz.de
11. Example ontology-based application :
FACETED MULTIMEDIA
EXPLORATION
WeST – Web Science & Steffen Staab Slide 11
Technologies staab@uni-koblenz.de
12. Making Web 2.0 More Accessible
[Schenk et al; JoWS 2009]
GeoNames
Links Location
low- to
xxxxx
Persons xxxx midlevel
features
Knowledge Tags
WeST – Web Science & Steffen Staab Slide 12
Technologies staab@uni-koblenz.de
13. Choosing between Koblenz – and Koblenz
Video at: http://vimeo.com/2057249
WeST – Web Science & Steffen Staab Slide 13
Technologies staab@uni-koblenz.de
16. A tag view of „Koblenz“ & „Castle“
WeST – Web Science & Steffen Staab Slide 16
Technologies staab@uni-koblenz.de
17. Semantic Identity – Festung Ehrenbreitstein
WeST – Web Science & Steffen Staab Slide 17
Technologies staab@uni-koblenz.de
18. Persons – Celebrities, FOAFers & Flickr Users
Billion Triples Challenge 1. Prize
2008
WeST – Web Science & Steffen Staab Slide 18
Technologies
[Schenk et al; JoWS 2009]
staab@uni-koblenz.de
19. Now on to information extraction:
OBSERVATIONS ON
INFORMATION EXTRACTION
WeST – Web Science & Steffen Staab Slide 19
Technologies staab@uni-koblenz.de
20. Challenges & Opportunities for IE
Not all web pages are created equal
WeST – Web Science & Steffen Staab Slide 20
Technologies staab@uni-koblenz.de
21. Challenges & Opportunities for IE
Some challenges are the same, e.g. finding type instances
WeST – Web Science & Steffen Staab Slide 21
Technologies staab@uni-koblenz.de
22. Challenges & Opportunities for IE
Some challenges are the same, e.g. finding relation instances
WeST – Web Science & Steffen Staab Slide 22
Technologies staab@uni-koblenz.de
23. Challenges & Opportunities for IE
Some contain concepts and their descriptions, some don„t
No types here,
few relation types
WeST – Web Science & Steffen Staab Slide 23
Technologies staab@uni-koblenz.de
24. Challenges & Opportunities for IE
Knowing that they are instances and of which type
Textual Positional
indication indication
WeST – Web Science & Steffen Staab Slide 24
Technologies staab@uni-koblenz.de
25. Challenges & Opportunities for IE
To some extent
positional and layout
indications work across
languages and sites
WeST – Web Science & Steffen Staab Slide 25
Technologies staab@uni-koblenz.de
26. Challenges & Opportunities for IE
owl:sameAs
We should not only think about
Web pages, but about Web sites
WeST – Web Science & Steffen Staab Slide 26
Technologies staab@uni-koblenz.de
27. Challenges & Opportunities for IE
We should not only think about
Web pages, but about Web sites
owl:sameAs
WeST – Web Science & Steffen Staab Slide 27
Technologies staab@uni-koblenz.de
28. Comparing related work to our objectives
Related work objectives Our objectives
IE on Web pages IE on Web sites
Acquiring instances and Acquiring items
relationship instances Classifying items in
Instances
Concepts
Relation instances
Relationships
IE also based
IE based on linear text
on spatial position
There is overlap and of course there are
exceptions in related work
WeST – Web Science & Steffen Staab Slide 28
Technologies staab@uni-koblenz.de
29. Outline
The Social Media-Case The Bio-Case
Motivation
State-of-the-Art
Core idea of SXPath
Implementation
Evaluation
[Oro et al; VLDB 2010]
WeST – Web Science & Steffen Staab Slide 29
Technologies staab@uni-koblenz.de
31. Presentation-oriented documents
• HTML DOM structure is site specific
• Spatial arrangements are rarely explicit
• Spatial layout is hidden in complex nesting of layout elements
• Intricate DOM tree structures are conceptually difficult to query
for the user (or a tool!)
WeST – Web Science & Steffen Staab Slide 31
Technologies staab@uni-koblenz.de
32. Related Work
Web Query languages
Xpath 1.0 and XQuery1.0
Established
Too difficult to use for scraping from intricate DOM structures
Visual languages
Spatial Graph Grammars [Kong et al.] are quite complex in
term of both usability and efficiency
Algebras for creating and querying multimedia interactive
presentations (e.g. ppt) [Subrahmanian et al.]
Web wrapper induction exploiting visual interface
[Gottlob et al.] [Sahuguet et al.]
generate XPath location paths of DOM nodes
can benefit from using Spatial XPath
WeST – Web Science & Steffen Staab Slide 32
Technologies staab@uni-koblenz.de
33. Outline
The Social Media-Case The Bio-Case
Motivation
State-of-the-Art
Core idea of SXPath
Implementation
Evaluation
WeST – Web Science & Steffen Staab Slide 33
Technologies staab@uni-koblenz.de
34. Representing Spatial Relations between DOM Nodes
b
e
WeST – Web Science & Steffen Staab Slide 34
Technologies staab@uni-koblenz.de
35. Idea: Use Spatial Relations among DOM Nodes
WeST – Web Science & Steffen Staab Slide 35
Technologies staab@uni-koblenz.de
36. Spatial DOM (SDOM)
WeST – Web Science & Steffen Staab Slide 36
Technologies staab@uni-koblenz.de
38. Querying for Relations Among Nodes
Rectangular Cardinal Relations (RCR)
r1 E:NE r2
Spatial models allow for expressing
disjunctive relations among regions
Topological Relations
WeST – Web Science & Steffen Staab Slide 38
Technologies staab@uni-koblenz.de
39. XPath Example
WeST – Web Science & Steffen Staab Slide 39
Technologies staab@uni-koblenz.de
40. SXPath Example
WeST – Web Science & Steffen Staab Slide 40
Technologies staab@uni-koblenz.de
41. WeST – Web Science & Steffen Staab Slide 41
Technologies staab@uni-koblenz.de
42. From XPath 1.0 towards Spatial Querying with SXPath
SXPath features
adopts intuitive path notation:
axis::nodetest [pred]*
adds to XPath
spatial axes
spatial position functions
natural semantics for spatial querying
WeST – Web Science & Steffen Staab Slide 42
Technologies staab@uni-koblenz.de
44. Complexity Results
Formal model defined in the paper
[Oro et al; VLDB 2010]
WeST – Web Science & Steffen Staab Slide 44
Technologies staab@uni-koblenz.de
45. Outline
The Social Media-Case The Bio-Case
Motivation
State-of-the-Art
Core idea of SXPath
Implementation
Evaluation
WeST – Web Science & Steffen Staab Slide 45
Technologies staab@uni-koblenz.de
46. SXPath System
WeST – Web Science & Steffen Staab Slide 46
Technologies staab@uni-koblenz.de
50. Outline
The Social Media Case The Bio-Case
Motivation Motivation
State-of-the-Art The (Biochemical) Deep
Core idea of SXPath Web
SXPath Language Contributions
Spatial Data Model Page-level wrapper
induction
Syntax & Semantics
Site-wide wrapper
Complexity
generation
Implementation Error Correction by
Evaluation Mutual Reinforcement
Conclusions and Future
Directions
WeST – Web Science & Steffen Staab Slide 50
Technologies staab@uni-koblenz.de
51. >1000 Life Science DBs, number growing quickly
WeST – Web Science & Steffen Staab Slide 51
Technologies staab@uni-koblenz.de
52. Biochemical Web Sites: Observations - 1
Labeled Data
Full survey:
http://sabio.villa-
bosch.de/labelsurvey.html (404)
Total Labeled Unlabeled Unlabeled
(Redundant)
754 719 19 16
Table 1: Data fields across 20 Biochemical Web sites
WeST – Web Science & Steffen Staab Slide 52
Technologies staab@uni-koblenz.de
53. Biochemical Web Sites: Observations - 2
Dynamic Web Pages
WeST – Web Science & Steffen Staab Slide 53
Technologies staab@uni-koblenz.de
54. Biochemical Web Sites: Observations - 3
Rich Site Structure
WeST – Web Science & Steffen Staab Slide 54
Technologies staab@uni-koblenz.de
55. Biochemical Web Sites: Observations - 4
Semantics is often only in the report,
not in the underlying relational database
Web Services
Survey: 11 of 100 Databases1 provide APIs
Incomplete coverage
Varying granularity
No semantics in the service description
1 Databases indexed by the Nucleic Acids Research Journal
(http://www3.oup.co.uk/nar/database/). Complete survey was available at
http://sabiork.villa-bosch.de/index.html/survey.html
WeST – Web Science & Steffen Staab Slide 55
Technologies staab@uni-koblenz.de
56. Biochemical Web Sites: Extraction Tasks
[Mir et al; DILS 2009]
[Mir et al; ESWC 2010]
Induce Wrapper
Induce Wrapper
Induce Wrapper
WeST – Web Science & Steffen Staab Slide 56
Technologies staab@uni-koblenz.de
57. Contributions
Unsupervised Page-Level Wrapper Induction
Unsupervised Site-Wide Wrapper Induction
(Site Structure Discovery)
(Acquiring the Schema/Ontology)
Automatic Error Detection and Correction by
Mutual Reinforcement
WeST – Web Science & Steffen Staab Slide 57
Technologies staab@uni-koblenz.de
65. Site-Wide Wrapper Induction: Observations
Not all pages contain data (e.g. Legal disclaimers,
contact pages, navigational menus)
An efficient approach should ignore these pages
We dont need to learn the entire site-structure
WeST – Web Science & Steffen Staab Slide 65
Technologies staab@uni-koblenz.de
66. Site-Wide Wrapper Induction: Observations - 2
Classified Link-Collections point to data-intensive
pages of the same class.
WeST – Web Science & Steffen Staab Slide 66
Technologies staab@uni-koblenz.de
67. Site-Wide Wrapper Induction: Observations - 3
Pages belong to the same class describe the same
concepts
Some concepts are sometimes omitted
Ordering is always the same
WeST – Web Science & Steffen Staab Slide 67
Technologies staab@uni-koblenz.de
68. Site-Wide Wrapper Induction
1. Start with C0 L1
S={C0}
2. Follow all classified
link-collections C0
C1
3. Generate wrappers L3
for each set of target
L2
pages
C2
4. Determine if new C3
class is formed
5. Add navigation step If C0 != Ci (i>0)
S=S+Ci;
6. Repeat 2 – 5 for each
Navigation Steps
new class formed in 4
W= {(C0 → L1→ C0),
(C0 → L2→ C2),
(C0 → L3→ C3)}
WeST – Web Science & Steffen Staab Slide 68
Technologies staab@uni-koblenz.de
69. Site-Wide Wrapper Induction – Evaluation
SOURCE #C #C’ #D TP FN FP P R
MSDChem 1 1 N/A N/A N/A N/A N/A N/A
ChEBI 3 1 1711 1195 516 0 100 69.8
KEGG 10 7 6223 5044 1179 188 97 81.1
Average 98.5 75.5
Table 3: Site-wide wrapper induction results, 20 test pages for each class
(C=Classes, C =Classes discovered, D=Data entries)
WeST – Web Science & Steffen Staab Slide 69
Technologies staab@uni-koblenz.de
70. Error Detection and Correction:
Mutual Reinforcement
Observation: Certain data reappear on more
than one class of pages
WeST – Web Science & Steffen Staab Slide 70
Technologies staab@uni-koblenz.de
71. Error Detection and Correction:
Mutual Reinforcement
Reinforcement if reappearing data correctly classified as
Data
Otherwise it points to misclassification
Label-Data Mismatch
• Correction: Introduce more samples
Label-Label Mismatch
• Cannot be detected
WeST – Web Science & Steffen Staab Slide 71
Technologies staab@uni-koblenz.de
72. Where to go next?
Reverse engineering production
1. LOD emitting RDF & RDFS
2. Navigation model what belongs to what
3. Interaction model (- not treated at all by us so far -)
4. Layout model spatial positioning
Capture this generative model using machine learning
Relational learning
• Markov logic programmes?
• …?
WeST – Web Science & Steffen Staab Slide 72
Technologies staab@uni-koblenz.de
73. Bibliography
Ermelinda Oro, Massimo Ruffolo, Steffen Staab. SXPath –
Extending XPath towards Spatial Querying on Web
Documents. In: PVLDB – Proceedings of the VLDB
Endowment, 4(2): 129-140, 2010.
S. Mir, S. Staab, I. Rojas. Site-Wide Wrapper Induction for
Life Science Deep Web Databases. In: DILS-2009 – Proc.
of the Data Integration in the Life Sciences Workshop,
Manchester, UK, July 20-22, LNCS, Springer, 2009.
Saqib Mir, Steffen Staab, Isabel Rojas. An Unsupervised
Approach for Acquiring Ontologies and RDF Data from
Online Life Science Databases. In: 7th Extended Semantic
Web Conference (ESWC2010), Heraklion, Greece, May
30-June 3, 2010, pp. 319-333.
WeST – Web Science & Steffen Staab Slide 73
Technologies staab@uni-koblenz.de
74. WeST – Web Science & Technologies
University of Koblenz Landau, Germany
Thank you for your attention!
Notas do Editor
Layout engines of Web browsers assign a rectangle to each DOM element. ___________________________________________________The internal code of a page is this How can we query the page using the spatial information?The browser when visualize the pages represent the information in their rectangles that we can call minimum bounding rectangle. In fact the layout engine assign to each node*** parallelotraildom e quellochevedi--- vedicoldplayèscritto qua dentro e siillumina, img e siillumina***For each node based on the stylesheet, what the web designer.Presentation oriented, all also the style is used for give emphasis so that the human understand the important information, so the name in bold. (sviluppifuturiusarli)
As shown in the the figure the complex, involved and nested structure of the DOM has a clear presentation that enable user to read and understand the meaning of information presented in the Web page.
The rectangular algebra is an extension of the Allen’s interval algebra to the two dimensional case. For example in this case the relatio x (b,e) y is intuitively obtained by applying interval algebra to both sides of the rectangle.__________________________________________________________So we could use the spatial model of geospatial database for representing the mutual relationships between objects***Mostra RA***The rectngular algebra define 169 relations, all the possible relations between rectangles *** mostrare la figurona***Between this and this in the relation algebra this relation is called so*** illumina****** Ritaglia un singolo rettangolo***-----------------Modelli del mondo geospaziale per rappresentare le mutue relazioniRAIlluminare 2 - albero non basato del nesting ma su contenimento e relazioni
No comment. Già tutto nella slide.and has very interesting properties like invertibility that enable optimized evaluations of SXPath language._______________________________________So we could use the spatial model of geospatial database for representing the mutual relationships between objects***Mostra RA***The rectngular algebra define 169 relations, all the possible relations between rectangles *** mostrare la figurona***Between this and this in the relation algebra this relation is called so*** illumina****** Ritaglia un singolo rettangolo***-----------------Modelli del mondo geospaziale per rappresentare le mutue relazioniRAIlluminare 2 - albero non basato del nesting ma su contenimento e relazioni
By representing RA relations/spatial relation we obtain the SDOM where continuous arrows represent spatial containment and dotted arrows represent RA relations. This way we have a model of a Web page that represent all spatial relations existing between each pair of DOM nodes.Spatial relations enable also the definition of a spatial ordering along the 4 main direction North, South, East, and West as shown in the figure._____________________________Intuizione di DOMSo I can make a tree of the page not based on nesting of tags, but by using the spatial containment and spatial relations*** tirare fuori l’sdom****** sempre animando, mostrando sempre I due elementi scelti, ***Between image and radiohead there is the spatial relation (s, bi)I can represent this data model that do not capture the simple nesting of tags but catcht the spatial arrangment of the objects on the page*** con le animazioni***This is the new data model that I use called Spatial DOM. That is the Document Object Model with the objects of the DOM where the relations (queste scure) are containment relations, (quelle tratteggiate) are the Rarelations.It allows to introduce an ordering in the page using this model ----------------Nuovo modello che uso SDOMIntrodurre che permette di definire ordinamento spaziale nella pagina
The architecture of the system consists in a parser of SXPath expressions (Query parser), a builder of the SDOM an engine that efficiently evaluates SXPath queries.______________________
The RA relation is too fine grained and verbose, difficult to use by a human. So we introduce also the Rectangular Cardinal Relations and topological relations (Two of the most intuitive and diffused spatial models) in order to map RA relations and allow user to query spatial relations in a more intuitive way.________________________________________________________Such relations are very complicated We need more intuitive relations to use So we use another geospatial model called RCR and Topological relations mapped with the RA modelDivide in regional tiles and it is simple
In this slide is show a comparison between Xpath and SXPath. Suppose a user that need to extract details of a music band. By using Xptah the user need to know the intricate DOM structure. By using SXPth the user can exploit the visual pattern adopted by the Web designers for organizing details of the music bands._______________________
In this slide is show a comparison between Xpath and SXPath. Suppose a user that need to extract details of a music band. By using Xptah the user need to know the intricate DOM structure. By using SXPth the user can exploit the visual pattern adopted by the Web designers for organizing details of the music bands._______________________
SXPath expressions are also resilient. In fact, a gicen visual pattern can be queried in the same way on different web pages having different internal encodings.____________________________________Another advantage is that it is more general For instance, with only a query I can catch different DOMs because their spatial representation is the same.So it generalize the patterns Our language catch visual patterns, catch in general way visual patterns on Web pages Example 2A single data record can be split in different sub-treesWrapper induction techniques like DEPTA [Zhai et al.] recognize datarecords when they are encoded in the DOM as consecutive similarsubtrees-------------------Esempio 2Altrovantaggioacchiappo DOM diversiIl linguaggiocattura in manieragenerale pattern visuali
The architecture of the system consists in a parser of SXPath expressions (Query parser), a builder of the SDOM an engine that efficiently evaluates SXPath queries.______________________
The study of combined computational complexity of different SXPath fragments shows that SXPath maintain Polinomial time computational complexity. Obviously SXPath as a greater exponent in the polynomial because of the quadratic number of relation stored in the SDOM that need to be explored during the evaluation of spatial axes.We compute spatial axes by using the same dynamic programming approach suggested by Gottolob but we have to explore a quadratic number of further relation in the SDOM.________________________________________ Core SXPath queries can be evaluated in time O(SDS2 á SQS) where SDSis the size of the XML document, and SQS is the size of the query QProof Sketch There are O(SVv S2) many spatial relations to beconsidered in addition to the O(SVS) many relations of the DOMincurring a higher polynomial worst case complexityIn order to obtain a polynomial-time combined complexity bound for SXPathquery evaluation we use dynamic programming adopting the Context-ValueTable (CV-Table) principle introduced by Gottlob et al.Position and size are computed on demand, we compute all spatial positionfunctions in a loop for all pairs previousÉcurrent nodesFull SXPath computational costs are dominated by String Operations belongingto XPath 1.0In SWF the computation of spatial ordering generates a higher polynomial worstcase than XPath 1.0
The GUI shows the DOM, allows to write queries, enables to check query results that are show on the screen._________________________________________
In the second experiment we evaluated the effectiveness of Sxpath with respect to Xpath. We discovered that the possibility to exploit the visual appearance of Web pages allow to write queries by less attempts than in Xpath, that Sxpath location path are more syntetic and that Sxpath is resilient (the same query can be used on different Web site having very different internal encodings in terms of DOM trees).________________________________
In the second experiment we evaluated the effectiveness of Sxpath with respect to Xpath. We discovered that the possibility to exploit the visual appearance of Web pages allow to write queries by less attempts than in Xpath, that Sxpath location path are more syntetic and that Sxpath is resilient (the same query can be used on different Web site having very different internal encodings in terms of DOM trees).________________________________
In the second experiment we evaluated the effectiveness of Sxpath with respect to Xpath. We discovered that the possibility to exploit the visual appearance of Web pages allow to write queries by less attempts than in Xpath, that Sxpath location path are more syntetic and that Sxpath is resilient (the same query can be used on different Web site having very different internal encodings in terms of DOM trees).________________________________