SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
FUJITSU RESTRICTED - UK & IRELAND EYES ONLY © Copyright 2014 Fujitsu (Ireland) Limited
Traffic Analytics for
Linked Data Publishers
Luca Costabello, Pierre-Yves Vandenbussche, Gofran Shukair
Fujitsu Ireland
Corine Deliot, Neil Wilson
British Library
2
The Problem: Measuring Traffic on RDF Datasets
Linked Data publishers have limited awareness of how
datasets are accessed by visitors.
No tool to mine Linked Data servers access logs
Why is this such a big deal?
Justify investment in Linked Data IT infrastructure
Cost control
Identify abuses
Interpret access peaks
3
Which traffic metrics?
 Adapt conventional web analytics metrics
 Define Linked-Data specific extensions
How to extract and compute such metrics?
 Which data sources? (client tracking? server access logs mining? both?)
 Need to support dual data access protocol (HTTP operations + SPARQL)
 How to filter noise? (i.e. robots, search engines crawlers)
 How to detect client sessions? (no client tracking, dual data access protocol)
 How to detect SPARQL activity peaks?
Challenges
4
Existing tools do not include Linked Data-specific metrics:
Linked data-specific metrics, but no platform [Moller et al, WebScience 2010]
Filling a Gap in Prior Art
5
• Traffic analytics platform for LD servers
• Metrics
• Metrics Extraction
• Visitor Sessions
• Heavy/Light SPARQL Queries
• Results & British Library Trial Insight
Our Contribution / Agenda
6
• Traffic analytics platform for LD servers
• Metrics
• Metrics Extraction
• Visitor Sessions
• Heavy/Light SPARQL Queries
• Results & British Library Trial Insight
7
8
9
• Traffic Analytics Platform for LD Servers
• Metrics
• Metrics Extraction
• Visitor Sessions
• Heavy/Light SPARQL Queries
• Results & British Library Trial Insight
10
Metrics
* Linked Data-specific
11
Metrics
* Linked Data-specific
12
Metrics
* Linked Data-specific
13
• Traffic Analytics Platform for LD Servers
• Metrics
• Metrics Extraction
• Visitor Sessions
• Heavy/Light SPARQL Queries
• Results & British Library Trial Insight
14
Visitor Session Detection
Session: sequence of requests issued with no significant
interruptions by a uniquely identified visitor. Expires after a
period of inactivity.
We use the HAC variant by [Murray et al. 2006, Mehrzadi et al. 2012]
Unsupervised, gap-based session boundary detection
Traditional web logs analysis
Benefit: visitor-specific temporal cut-off
Two-step procedure:
Set visitor-specific session cut-off as time interval that significantly
increases the variance.
Group HTTP/SPARQL requests into sessions according to the cut-off
15
• Traffic Analytics Platform for LD Servers
• Metrics
• Metrics Extraction
• Visitor Sessions
• Heavy/Light SPARQL Queries
• Results & British Library Trial Insight
16
Heavy/Light SPARQL Queries Binary Classifier
 Rough estimate of heavy and light queries with supervised
binary classification.
 Heavy SPARQL Query: if it requires considerable computational
and memory resources.
Light
Heavy
17
Heavy/Light SPARQL Queries Binary Classifier
 Feature vectors: SPARQL 1.1 syntactic features only:
18
• Traffic Analytics Platform for LD Servers
• Metrics
• Metrics Extraction
• Visitor Sessions
• Heavy/Light SPARQL Queries
• Results & British Library Trial Insight
19
British National Bibliography access logs
bnb.data.bl.uk (access logs are not public)
13 months
~ 10M HTTP requests/month
DBpedia 3.9 access logs
USEWOD 2015 Dataset
Datasets
20
Visitor Session Detection: Results
How well do we detect the beginning of a new session?
Dataset
British National Bibliography access logs (3 consecutive days)
~16k HTTP/SPARQL requests
• 32% Desktop browsers (115 visitors)
• 68% Software libraries (10 visitors)
Manually annotated records
• 1=session_start | 0=internal
Baseline: fixed-length cut-offs
HAC outperforms fixed-length cut-offs
21
Random distinct queries from DBpedia 3.9 access logs
Run the queries multiple times on local clone of DBpedia
Kept ~3.7k queries with low variance (3.1k light, 600 heavy)
Cut-off threshold: 100ms
Naïve Bayes and SVM
Grid search & randomized search w/ 10-fold CV
Heavy/Light SPARQL: Experiment Protocol
22
Heavy/Light SPARQL: Results
23
Genuine calls account for 0.6% of total traffic!
+30% of HTTP/SPARQL traffic over the observed 13 months
Sharp increase in requests from Software Libraries (95x)
SPARQL accounts for 29% of traffic
6% of heavy SPARQL queries
37 days have unusual traffic spikes
Bounce rate: 48%
Software Libraries have bigger, deeper, and longer sessions.
Some Insights on BL Traffic Logs
24
We relieve publishers from manual and time-consuming
access log mining
Support Linked Data-specific metrics
 Break down traffic by RDF content
 Capture SPARQL insights
 Properly interpret 303 patterns
Reconstruction of Linked Data visitors sessions
Heavy/light SPARQL classifier w/ SPARQL syntax +
supervised learning
Revealed hidden insights on 13 months of access logs of the
British Library
Summary
25
Statistics on noise (i.e. web crawlers)
Heavy/light classifier
 Feature set refinements
 Does it generalize to other datasets?
Enhance session detection with content-based heuristics
 Relatedness of subsequent SPARQL queries
 Structure and type of requested RDF entities
Future Work
26
Public Demo: bit.ly/ld-traffic
innovation.ie.fujitsu.com/kedi

Mais conteúdo relacionado

Semelhante a Traffic Analytics for Linked Data Publishers

Data monstersrealtimeetl new
Data monstersrealtimeetl newData monstersrealtimeetl new
Data monstersrealtimeetl newGreenM
 
Southwest Power Pool big data case study
Southwest Power Pool big data case study Southwest Power Pool big data case study
Southwest Power Pool big data case study Seeling Cheung
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreHPCC Systems
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs
 
Building Fast Applications for Streaming Data
Building Fast Applications for Streaming DataBuilding Fast Applications for Streaming Data
Building Fast Applications for Streaming Datafreshdatabos
 
Improving Reporting Performance
Improving Reporting PerformanceImproving Reporting Performance
Improving Reporting PerformanceDhiren Gala
 
Introducing the IRUSdataUK pilot webinar
Introducing the IRUSdataUK pilot webinarIntroducing the IRUSdataUK pilot webinar
Introducing the IRUSdataUK pilot webinarJisc
 
HUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation SlidesHUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation SlidesJohn Mulhall
 
Cloud-Scale BGP and NetFlow Analysis
Cloud-Scale BGP and NetFlow AnalysisCloud-Scale BGP and NetFlow Analysis
Cloud-Scale BGP and NetFlow AnalysisAlex Henthorn-Iwane
 
Free Netflow analyzer training - diagnosing_and_troubleshooting
Free Netflow analyzer  training - diagnosing_and_troubleshootingFree Netflow analyzer  training - diagnosing_and_troubleshooting
Free Netflow analyzer training - diagnosing_and_troubleshootingManageEngine, Zoho Corporation
 
NetFlow Analyzer Training Part II : Diagnosing and troubleshooting traffic is...
NetFlow Analyzer Training Part II : Diagnosing and troubleshooting traffic is...NetFlow Analyzer Training Part II : Diagnosing and troubleshooting traffic is...
NetFlow Analyzer Training Part II : Diagnosing and troubleshooting traffic is...ManageEngine, Zoho Corporation
 
Tufts Research: Strategies from Data Management Leaders to Speed Clinical Trials
Tufts Research: Strategies from Data Management Leaders to Speed Clinical TrialsTufts Research: Strategies from Data Management Leaders to Speed Clinical Trials
Tufts Research: Strategies from Data Management Leaders to Speed Clinical TrialsVeeva Systems
 
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...Maya Lumbroso
 
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...Dataconomy Media
 
Web Performance – die effektivsten Techniken aus der Praxis
Web Performance – die effektivsten Techniken aus der PraxisWeb Performance – die effektivsten Techniken aus der Praxis
Web Performance – die effektivsten Techniken aus der PraxisFelix Gessert
 

Semelhante a Traffic Analytics for Linked Data Publishers (20)

Data monstersrealtimeetl new
Data monstersrealtimeetl newData monstersrealtimeetl new
Data monstersrealtimeetl new
 
Southwest Power Pool big data case study
Southwest Power Pool big data case study Southwest Power Pool big data case study
Southwest Power Pool big data case study
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
Building Fast Applications for Streaming Data
Building Fast Applications for Streaming DataBuilding Fast Applications for Streaming Data
Building Fast Applications for Streaming Data
 
nstitutional repositories, item and research data metrics
nstitutional repositories, item and research data metricsnstitutional repositories, item and research data metrics
nstitutional repositories, item and research data metrics
 
Introduction to Tideways
Introduction to TidewaysIntroduction to Tideways
Introduction to Tideways
 
Improving Reporting Performance
Improving Reporting PerformanceImproving Reporting Performance
Improving Reporting Performance
 
Introducing the IRUSdataUK pilot webinar
Introducing the IRUSdataUK pilot webinarIntroducing the IRUSdataUK pilot webinar
Introducing the IRUSdataUK pilot webinar
 
HUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation SlidesHUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation Slides
 
Psicquic
PsicquicPsicquic
Psicquic
 
Qo comparision
Qo comparisionQo comparision
Qo comparision
 
Cloud-Scale BGP and NetFlow Analysis
Cloud-Scale BGP and NetFlow AnalysisCloud-Scale BGP and NetFlow Analysis
Cloud-Scale BGP and NetFlow Analysis
 
Free Netflow analyzer training - diagnosing_and_troubleshooting
Free Netflow analyzer  training - diagnosing_and_troubleshootingFree Netflow analyzer  training - diagnosing_and_troubleshooting
Free Netflow analyzer training - diagnosing_and_troubleshooting
 
NetFlow Analyzer Training Part II : Diagnosing and troubleshooting traffic is...
NetFlow Analyzer Training Part II : Diagnosing and troubleshooting traffic is...NetFlow Analyzer Training Part II : Diagnosing and troubleshooting traffic is...
NetFlow Analyzer Training Part II : Diagnosing and troubleshooting traffic is...
 
Tufts Research: Strategies from Data Management Leaders to Speed Clinical Trials
Tufts Research: Strategies from Data Management Leaders to Speed Clinical TrialsTufts Research: Strategies from Data Management Leaders to Speed Clinical Trials
Tufts Research: Strategies from Data Management Leaders to Speed Clinical Trials
 
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
 
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
 
Web Performance – die effektivsten Techniken aus der Praxis
Web Performance – die effektivsten Techniken aus der PraxisWeb Performance – die effektivsten Techniken aus der Praxis
Web Performance – die effektivsten Techniken aus der Praxis
 
PhD Defense
PhD DefensePhD Defense
PhD Defense
 

Mais de Luca Costabello

Machine Learning on Knowledge Graphs: a Quick Tour of Knowledge Graph Embeddings
Machine Learning on Knowledge Graphs: a Quick Tour of Knowledge Graph EmbeddingsMachine Learning on Knowledge Graphs: a Quick Tour of Knowledge Graph Embeddings
Machine Learning on Knowledge Graphs: a Quick Tour of Knowledge Graph EmbeddingsLuca Costabello
 
Error-Tolerant RDF Subgraph Matching for Adaptive Presentation of Linked Data...
Error-Tolerant RDF Subgraph Matching for Adaptive Presentation of Linked Data...Error-Tolerant RDF Subgraph Matching for Adaptive Presentation of Linked Data...
Error-Tolerant RDF Subgraph Matching for Adaptive Presentation of Linked Data...Luca Costabello
 
Context-Aware Access Control and Presentation of Linked Data
Context-Aware Access Control and Presentation of Linked DataContext-Aware Access Control and Presentation of Linked Data
Context-Aware Access Control and Presentation of Linked DataLuca Costabello
 
Access Control for HTTP Operations on Linked Data
Access Control for HTTP Operations on Linked DataAccess Control for HTTP Operations on Linked Data
Access Control for HTTP Operations on Linked DataLuca Costabello
 
Linked Data Access Goes Mobile: Context Aware Authorization for Graph Stores
Linked Data Access Goes Mobile: Context Aware Authorization for Graph StoresLinked Data Access Goes Mobile: Context Aware Authorization for Graph Stores
Linked Data Access Goes Mobile: Context Aware Authorization for Graph StoresLuca Costabello
 
PRISSMA, Towards Mobile Adaptive Presentation of the Web of Data
PRISSMA,Towards Mobile Adaptive Presentation of the Web of DataPRISSMA,Towards Mobile Adaptive Presentation of the Web of Data
PRISSMA, Towards Mobile Adaptive Presentation of the Web of DataLuca Costabello
 
Time Based Cluster Analysis for Automatic Blog Generation
Time Based Cluster Analysis for Automatic Blog GenerationTime Based Cluster Analysis for Automatic Blog Generation
Time Based Cluster Analysis for Automatic Blog GenerationLuca Costabello
 

Mais de Luca Costabello (7)

Machine Learning on Knowledge Graphs: a Quick Tour of Knowledge Graph Embeddings
Machine Learning on Knowledge Graphs: a Quick Tour of Knowledge Graph EmbeddingsMachine Learning on Knowledge Graphs: a Quick Tour of Knowledge Graph Embeddings
Machine Learning on Knowledge Graphs: a Quick Tour of Knowledge Graph Embeddings
 
Error-Tolerant RDF Subgraph Matching for Adaptive Presentation of Linked Data...
Error-Tolerant RDF Subgraph Matching for Adaptive Presentation of Linked Data...Error-Tolerant RDF Subgraph Matching for Adaptive Presentation of Linked Data...
Error-Tolerant RDF Subgraph Matching for Adaptive Presentation of Linked Data...
 
Context-Aware Access Control and Presentation of Linked Data
Context-Aware Access Control and Presentation of Linked DataContext-Aware Access Control and Presentation of Linked Data
Context-Aware Access Control and Presentation of Linked Data
 
Access Control for HTTP Operations on Linked Data
Access Control for HTTP Operations on Linked DataAccess Control for HTTP Operations on Linked Data
Access Control for HTTP Operations on Linked Data
 
Linked Data Access Goes Mobile: Context Aware Authorization for Graph Stores
Linked Data Access Goes Mobile: Context Aware Authorization for Graph StoresLinked Data Access Goes Mobile: Context Aware Authorization for Graph Stores
Linked Data Access Goes Mobile: Context Aware Authorization for Graph Stores
 
PRISSMA, Towards Mobile Adaptive Presentation of the Web of Data
PRISSMA,Towards Mobile Adaptive Presentation of the Web of DataPRISSMA,Towards Mobile Adaptive Presentation of the Web of Data
PRISSMA, Towards Mobile Adaptive Presentation of the Web of Data
 
Time Based Cluster Analysis for Automatic Blog Generation
Time Based Cluster Analysis for Automatic Blog GenerationTime Based Cluster Analysis for Automatic Blog Generation
Time Based Cluster Analysis for Automatic Blog Generation
 

Último

IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 

Último (17)

IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdf
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 

Traffic Analytics for Linked Data Publishers

  • 1. FUJITSU RESTRICTED - UK & IRELAND EYES ONLY © Copyright 2014 Fujitsu (Ireland) Limited Traffic Analytics for Linked Data Publishers Luca Costabello, Pierre-Yves Vandenbussche, Gofran Shukair Fujitsu Ireland Corine Deliot, Neil Wilson British Library
  • 2. 2 The Problem: Measuring Traffic on RDF Datasets Linked Data publishers have limited awareness of how datasets are accessed by visitors. No tool to mine Linked Data servers access logs Why is this such a big deal? Justify investment in Linked Data IT infrastructure Cost control Identify abuses Interpret access peaks
  • 3. 3 Which traffic metrics?  Adapt conventional web analytics metrics  Define Linked-Data specific extensions How to extract and compute such metrics?  Which data sources? (client tracking? server access logs mining? both?)  Need to support dual data access protocol (HTTP operations + SPARQL)  How to filter noise? (i.e. robots, search engines crawlers)  How to detect client sessions? (no client tracking, dual data access protocol)  How to detect SPARQL activity peaks? Challenges
  • 4. 4 Existing tools do not include Linked Data-specific metrics: Linked data-specific metrics, but no platform [Moller et al, WebScience 2010] Filling a Gap in Prior Art
  • 5. 5 • Traffic analytics platform for LD servers • Metrics • Metrics Extraction • Visitor Sessions • Heavy/Light SPARQL Queries • Results & British Library Trial Insight Our Contribution / Agenda
  • 6. 6 • Traffic analytics platform for LD servers • Metrics • Metrics Extraction • Visitor Sessions • Heavy/Light SPARQL Queries • Results & British Library Trial Insight
  • 7. 7
  • 8. 8
  • 9. 9 • Traffic Analytics Platform for LD Servers • Metrics • Metrics Extraction • Visitor Sessions • Heavy/Light SPARQL Queries • Results & British Library Trial Insight
  • 13. 13 • Traffic Analytics Platform for LD Servers • Metrics • Metrics Extraction • Visitor Sessions • Heavy/Light SPARQL Queries • Results & British Library Trial Insight
  • 14. 14 Visitor Session Detection Session: sequence of requests issued with no significant interruptions by a uniquely identified visitor. Expires after a period of inactivity. We use the HAC variant by [Murray et al. 2006, Mehrzadi et al. 2012] Unsupervised, gap-based session boundary detection Traditional web logs analysis Benefit: visitor-specific temporal cut-off Two-step procedure: Set visitor-specific session cut-off as time interval that significantly increases the variance. Group HTTP/SPARQL requests into sessions according to the cut-off
  • 15. 15 • Traffic Analytics Platform for LD Servers • Metrics • Metrics Extraction • Visitor Sessions • Heavy/Light SPARQL Queries • Results & British Library Trial Insight
  • 16. 16 Heavy/Light SPARQL Queries Binary Classifier  Rough estimate of heavy and light queries with supervised binary classification.  Heavy SPARQL Query: if it requires considerable computational and memory resources. Light Heavy
  • 17. 17 Heavy/Light SPARQL Queries Binary Classifier  Feature vectors: SPARQL 1.1 syntactic features only:
  • 18. 18 • Traffic Analytics Platform for LD Servers • Metrics • Metrics Extraction • Visitor Sessions • Heavy/Light SPARQL Queries • Results & British Library Trial Insight
  • 19. 19 British National Bibliography access logs bnb.data.bl.uk (access logs are not public) 13 months ~ 10M HTTP requests/month DBpedia 3.9 access logs USEWOD 2015 Dataset Datasets
  • 20. 20 Visitor Session Detection: Results How well do we detect the beginning of a new session? Dataset British National Bibliography access logs (3 consecutive days) ~16k HTTP/SPARQL requests • 32% Desktop browsers (115 visitors) • 68% Software libraries (10 visitors) Manually annotated records • 1=session_start | 0=internal Baseline: fixed-length cut-offs HAC outperforms fixed-length cut-offs
  • 21. 21 Random distinct queries from DBpedia 3.9 access logs Run the queries multiple times on local clone of DBpedia Kept ~3.7k queries with low variance (3.1k light, 600 heavy) Cut-off threshold: 100ms Naïve Bayes and SVM Grid search & randomized search w/ 10-fold CV Heavy/Light SPARQL: Experiment Protocol
  • 23. 23 Genuine calls account for 0.6% of total traffic! +30% of HTTP/SPARQL traffic over the observed 13 months Sharp increase in requests from Software Libraries (95x) SPARQL accounts for 29% of traffic 6% of heavy SPARQL queries 37 days have unusual traffic spikes Bounce rate: 48% Software Libraries have bigger, deeper, and longer sessions. Some Insights on BL Traffic Logs
  • 24. 24 We relieve publishers from manual and time-consuming access log mining Support Linked Data-specific metrics  Break down traffic by RDF content  Capture SPARQL insights  Properly interpret 303 patterns Reconstruction of Linked Data visitors sessions Heavy/light SPARQL classifier w/ SPARQL syntax + supervised learning Revealed hidden insights on 13 months of access logs of the British Library Summary
  • 25. 25 Statistics on noise (i.e. web crawlers) Heavy/light classifier  Feature set refinements  Does it generalize to other datasets? Enhance session detection with content-based heuristics  Relatedness of subsequent SPARQL queries  Structure and type of requested RDF entities Future Work