Mining SPARQL queries to understand the behavior of au-
tomated programs (or machine agents) is an important step
in designing systems for the semantic web. We present
techniques that differ from state-of-the-art SPARQL mining
techniques in two ways: 1. Move away from one SPARQL
query at a time view to SPARQL user session view 2. Look
at the results of SPARQL queries in addition to the query
itself. Due to these two approaches, we are able to find two
new patterns in SPARQL queries that help us reason better
about the underlying program that generated the SPARQL
queries. Through a variety of experiments, we show that
the patterns found have significant support in all the four
datasets provided by the USEWOD committee.
2. Introduction: LOD Users
The LOD cloud has two types of users
- Humans (browsers).
- Programs / machine agents.
2
Yahoo! Confidential
3. Introduction: LOD Access Methods
The data on the LOD cloud can be accessed
in multiple ways.
For this work, we categorize them into two
buckets:
- SPARQL : A powerful declarative graph query
language
- Non-SPARQL: Direct linked data requests.
3
Yahoo! Confidential
4. Motivation: User Behavior Understanding
Deep Understanding of client behavior can help
build “better” serving systems
Better:
- Secure
- Scalable
- Available
Prior Work:
- Moller et al , WebSci 2010
- Picalausa et al. Swim 2011
- Kirchberg et. al Usewod 2011
4
Yahoo! Confidential - Mario et. Al, Usewod 2011
5. Summarizing. . .
Human Users Machine Agents
Non-SPARQL
SPARQL This paper’s
focus
5
Yahoo! Confidential
6. What this paper is about?
Mining of the USEWOD query log dataset to
identify:
- Two Trends in Machine Agent Querying
- Two Patterns in Machine Agent Querying
6
Yahoo! Confidential
7. The USEWOD dataset
Query logs of servers hosting a part of LOD cloud data.
Type # records % SPARQL
(million)
bio2rdf Life sciences ~ 0.2 100%
lgd Geo ~ 1.9 100%
SWDF Conference ~ 16.7 43.38%
dbpedia Structured ~ 36.2 46.9%
wikipedia
7
Yahoo! Confidential
8. Part-1: Two Trends in Machine Agent
Querying
The Theme
“What are the overarching trends for
SPARQL queries?”
8
Yahoo! Confidential
9. Trend-1: SPARQL is here to stay!
0.1 – 1million
SWDF Dbpedia
Take-away: SPARQL query volume is pretty
significant
9
Yahoo! Confidential
10. Trend-2: SPARQL is heavily used by machine
agents.
Took 17 million user agents from SPARQL queries from dbpedia
and..
10
Yahoo! Confidential
11. Part-2: Two Patterns in Machine Agent
Querying
The Theme
“Looking at SPARQL query logs, can we reason
about the program that generated the queries?”
11
Yahoo! Confidential
12. Salient aspects of proposed Query Mining
Techniques
Move from per query analysis to query session
analysis
Move from query analysis to query result analysis
12
Yahoo! Confidential
13. Pattern -1 : Loops in Programs
Take-away
• Through a per-user, temporal mining of logs, we
discover patterns that are caused by loops in
program.
• Significant support in all 4 datasets
13
Yahoo! Confidential
14. Per-user Temporal mining
TIME
Loop
Original Logs
User level Session Analysis
14
Yahoo! Confidential
User-1 User-2 User-3 User-4
15. Intra Pattern Loop
successive queries from the same user, use the
same “template”
Example: Two successive queries:
SELECT * WHERE {http://bio2rdf.org/dr:D00332
http://bio2rdf.org/ns/bio2rdf#xRef
http://bio2rdf.org/cas:54-47-7}
SELECT * WHERE{http://bio2rdf.org/dr:D00333
http://bio2rdf.org/ns/bio2rdf#xRef
http://bio2rdf.org/cas:54-47-7}
Only the subject (D00332,D00333) varies
15
Yahoo! Confidential
16. Detecting Intra Pattern Loop
We convert a query to its canonical form by
replacing variables, URI and literals by
“keywords”.
SELECT * WHERE {http://bio2rdf.org/dr:D00332
Canonical Form of the previous queries: SELECT *
http://bio2rdf.org/ns/bio2rdf#xRef
http://bio2rdf.org/cas:54-47-7}
WHERE { _URI_ _URI_ _URI_ }
Queries generated by the same template will have
the same canonical form.
16
Yahoo! Confidential
17. Salient Aspects of Intra Pattern loops
Iterate over a dictionary of values (categorical)
Iterate over a numerical range (example LIMIT,
OFFSET parameters in SPARQL queries)
Multiple levels of nested loops with the same
intra loop pattern.
4 Parameters to quantify above (in paper)
17
Yahoo! Confidential
18. Inter Pattern Loops
Found loops that iterate over a set of patterns
P1,P2,P3 ,P1,P2,P3 ,P1,P2,P3
Typically used when the output of the first query
goes as a parameter to the second query.
(examples in paper)
18
Yahoo! Confidential
19. Results
86% 32%
Take-away:
bio2rdf Significant support
40% for loops! lgd
16%
swdf dbpedia 19
Yahoo! Confidential
20. Pattern-2: Querying for dbpedia Linkage
Take-away:
• By executing each query
• analyze the results, we find that a portion of
queries “look” for dbpedia links
• Results:
- 20 months of SWDF queries had average of 8% look
for dbpedia urls
- 2 days worth of lgd queries had 26.5% queries look
for dbpedia urls
20
Yahoo! Confidential
21. Summary & Conclusions
Proposed 2 new ways of SPARQL query mining:
- Session view
- Analyze results in addition to query
Showed that machine agents look for dbpedia using the
owl:sameas annotation.
Influence on system design:
- Can we pre-fetch elements in loop beforehand?
- Priortitize dbpedia attributes for caching
Influence on log collection & analysis:
- Stratified random sampling to remove effect of loops.
21
Yahoo! Confidential
22. For the great data !!
For the great feedback & comments
For listening!
22
Yahoo! Confidential
23. The famous LOD Cloud . . .
7 billion triples and counting!!
23
Yahoo! Confidential