SlideShare uma empresa Scribd logo
1 de 15
Fostering Serendipity through Big 
Linked Data 
Muhammad Saleem , Maulik R. Kamdar , Aftab Iqbal , 
Shanmukha Sampath , Helena F. Deus , and Axel-Cyrille 
Ngonga Ngomo 
Semantic Web Challenge at ISWC2013, October 21-25 , 2013, Sydney, Australia
Agenda 
• Motivation 
• Datasets 
• Architecture 
• Evaluation 
• Requirements 
• Demo 
• Conclusion and Future Work
Motivation 
Fostering Serendipity through Big Data 
Triplification, Continuous Integration, 
and Visualization
Triplification: Linked TCGA 
• TCGA is publicly accessible atlas of cancer 
related data from National Cancer Institute 
(NCI) 
– 9000 patients 
– 33 cancer types 
– 147,645 raw data files 
– 12.7 TB 
• Only 46% of the total expected data with 
new data being submitted every day 
• Goal is to enable cancer researchers to 
make and validate important discoveries 
• Total Linked TCGA > 30 billion triples 
(Largest Dataset of LOD)
Triplification:PubMed 
• Collection of publications from the bio-medical 
domain 
• Large amount of metadata (MESH Terms) 
• 23+ million publications 
• 10,000 new publications/month
Big Data Continuous Integration 
TopFed 
Parser 
Federator Optimizer 
Integrator 
Results 
SPARQL Query Results 
Sub-query 
PubMed 
Entrez Utilities 
RDFizer 
Auto 
Loader 
TCGA Data 
Portal 
SPARQL 
endpoint 
RDF 
SPARQL 
endpoint 
RDF 
SPARQL 
endpoint 
RDF 
Index
Exon-Expression 
Methylation 
C-1 ∨ Category 
Colour = blue 
For each query triple t(s, p, o) ∈ T 
Highly Scalable 
b1 b2 p1 p2 p3 p4 p5 p6 g1 g2 g3 g4 g5 g6 g7 g8 g9 
C = {CNV, SNP, E-Gene, E-Protein, miRNA, Clinical} 
M = {beta_value, position} F = {Expression-Exon} 
(CNV, SNP, E-Gene, 
miRNA, 
E-Protein, Clinical) 
D = {seg_mean, rpmmm, scaled_est, p_exp_val} 
B = {DNA-Methylation} 
C-1 = {{p ∈ {D ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ C}} ∧ {{S-Join(p, D ∪ C) ∨ P-Join(p, D ∪ C) } ∨ {!S-Join(p, M ∪ B ∪ E ∪ F) ∧ 
!P-Join(p, M ∪ B ∪ E ∪ F) }}} 
C-2 = {{p ∈ {E ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ F}} ∧ {{S-Join(p, E ∪ F) ∨ P-Join(p, E ∪ F)} ∨ {!S-Join(p, M ∪ B ∪ D ∪ C) ∧ 
!P-Join(p, M ∪ B ∪ D ∪ C) }}} 
C-3 = {{p ∈ {M∪ A} ∨ {p = rdf:type ∧ o ∈ B}} ∧ {{S-Join(p,M ∪ B) ∨ P-Join(p, M∪ B) } ∨ {!S-Join(p, E ∪ F ∪ D ∪ C) ∧ 
!P-Join(p, E ∪ F ∪ D ∪ C) }}} 
IF tumour lookup is successful 
forward to corresponding 
leaf 
Else 
broadcast to every one 
A = {chromosome, result, bcr_patient_barcode} G = {start, stop} 
E = {RPKM} 
Tumours 
SPARQL 
endpoints 
C-2 ∨ Category 
Colour = pink 
C-3 ∨ Category 
Colour = green 
1-16 17-33 1-5 6-11 12-16 17-22 23-27 28-33 1-4 5-8 9-12 13-16 17-20 21-24 25-27 28-30 31-33
Evaluation:Number of Sub-Query Submission 
60 
50 
40 
30 
20 
10 
FedX number of Sub-Query Submission TopFedE number of Sub-Query Submission 
• TopFed number of sub-queries submission is 1/3 to FedX 
• Number of ASK requests 
– FedX 480 
– TopFed 10 
0 
1 2 3 4 5 6 7 8 9 10 Avg
Evaluation: Query Runtime 
100000 
10000 
1000 
100 
10 
1 
1 2 3 4 5 6 7 8 9 10 Average 
Query Execution Time (msec) in 
log scale 
FedX TopFed 
• TopFed outperform FedX significantly on 90% of the queries 
• On average, the query run time of TopFed is about 1/3 to that 
of FedX 
• TopFed‘s best run-time (query 2, query 3) is more than 75 times 
smaller than that of FedX
Big Data Track Requirements 
• Data Volume 
– 7.36 billion triples from Linked TCGA 
– 23 million publications from PubMed 
• Data Variety 
– The Linked TCGA data was extracted from raw text files of different 
structures 
– Processed the metadata associated with PubMed publications and 
transform them into RDF 
– Unstructured data (publication abstracts) is processed to extract 
mentions of gene names and cancers 
• Data Velocity 
– TCGA data doubles /2 months 
– PubMed publications 10k/month
Big Data Visualization
Tumor-wise Visualization
PubMed Paper-wise Visualization
Genome-wise Patients Results Visualization
Everything is Public 
• Demo: http://srvgal78.deri.ie/tcga-pubmed/ 
• TopFed: https://code.google.com/p/topfed/ 
• TCGA Data Refiner, RDFizer: http://goo.gl/vSnBEJ 
• Utilities: http://goo.gl/kNrFdI 
• Linked TCGA : http://tcga.deri.ie/ 
saleem@informatik.uni-leipzig.de 
AKSW, University of Leipzig, Germany

Mais conteúdo relacionado

Mais procurados

FastA HOMOLOGY SEARCH ALGORITHM
FastA HOMOLOGY SEARCH ALGORITHMFastA HOMOLOGY SEARCH ALGORITHM
FastA HOMOLOGY SEARCH ALGORITHMMuunda Mudenda
 
The Cancer Genomics Cloud (CGC) pilots - an Introduction
The Cancer Genomics Cloud (CGC) pilots  - an IntroductionThe Cancer Genomics Cloud (CGC) pilots  - an Introduction
The Cancer Genomics Cloud (CGC) pilots - an IntroductionSteve Tsang
 
Mowlam-semantic publishing-up-nfdp13
Mowlam-semantic publishing-up-nfdp13Mowlam-semantic publishing-up-nfdp13
Mowlam-semantic publishing-up-nfdp13DataDryad
 
The Cancer Genomics Cloud (CGC) Pilots NIH IC Show and Tell
The Cancer Genomics Cloud (CGC) Pilots   NIH IC Show and TellThe Cancer Genomics Cloud (CGC) Pilots   NIH IC Show and Tell
The Cancer Genomics Cloud (CGC) Pilots NIH IC Show and TellSteve Tsang
 
The Future of Microalgal Taxonomy
The Future of Microalgal TaxonomyThe Future of Microalgal Taxonomy
The Future of Microalgal TaxonomyAnne Thessen
 
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedHail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedSpark Summit
 
Blast bioinformatics
Blast bioinformaticsBlast bioinformatics
Blast bioinformaticsatmapandey
 
Biothings APIs: high-performance bioentity-centric web services
Biothings APIs: high-performance bioentity-centric web servicesBiothings APIs: high-performance bioentity-centric web services
Biothings APIs: high-performance bioentity-centric web servicesChunlei Wu
 
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
 API-Centric Data Integration for Human Genomics Reference Databases: Achieve... API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...Genomika Diagnósticos
 
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...Materials Data Facility as Community Database to Share Nano-manufacturing Rec...
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...Globus
 
How Big Data could benefit from Physics ?
How Big Data could benefit from Physics ?How Big Data could benefit from Physics ?
How Big Data could benefit from Physics ?BILL METANGMO TSOBZE
 
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph DatabaseAnalyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph DatabaseNeo4j
 
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFlávio Codeço Coelho
 
VariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitVariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitData Con LA
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMfnothaft
 

Mais procurados (20)

Data science courses
Data science coursesData science courses
Data science courses
 
FastA HOMOLOGY SEARCH ALGORITHM
FastA HOMOLOGY SEARCH ALGORITHMFastA HOMOLOGY SEARCH ALGORITHM
FastA HOMOLOGY SEARCH ALGORITHM
 
Fasta
FastaFasta
Fasta
 
The Cancer Genomics Cloud (CGC) pilots - an Introduction
The Cancer Genomics Cloud (CGC) pilots  - an IntroductionThe Cancer Genomics Cloud (CGC) pilots  - an Introduction
The Cancer Genomics Cloud (CGC) pilots - an Introduction
 
Mowlam-semantic publishing-up-nfdp13
Mowlam-semantic publishing-up-nfdp13Mowlam-semantic publishing-up-nfdp13
Mowlam-semantic publishing-up-nfdp13
 
The Cancer Genomics Cloud (CGC) Pilots NIH IC Show and Tell
The Cancer Genomics Cloud (CGC) Pilots   NIH IC Show and TellThe Cancer Genomics Cloud (CGC) Pilots   NIH IC Show and Tell
The Cancer Genomics Cloud (CGC) Pilots NIH IC Show and Tell
 
7 advanced uses of rdfs
7 advanced uses of rdfs7 advanced uses of rdfs
7 advanced uses of rdfs
 
Fasta
FastaFasta
Fasta
 
The Future of Microalgal Taxonomy
The Future of Microalgal TaxonomyThe Future of Microalgal Taxonomy
The Future of Microalgal Taxonomy
 
Mayank
MayankMayank
Mayank
 
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedHail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
 
Blast bioinformatics
Blast bioinformaticsBlast bioinformatics
Blast bioinformatics
 
Biothings APIs: high-performance bioentity-centric web services
Biothings APIs: high-performance bioentity-centric web servicesBiothings APIs: high-performance bioentity-centric web services
Biothings APIs: high-performance bioentity-centric web services
 
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
 API-Centric Data Integration for Human Genomics Reference Databases: Achieve... API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
 
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...Materials Data Facility as Community Database to Share Nano-manufacturing Rec...
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...
 
How Big Data could benefit from Physics ?
How Big Data could benefit from Physics ?How Big Data could benefit from Physics ?
How Big Data could benefit from Physics ?
 
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph DatabaseAnalyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
 
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
 
VariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitVariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn Langit
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
 

Semelhante a Fostering Serendipity through Big Linked Data

Transparency in the Data Supply Chain
Transparency in the Data Supply ChainTransparency in the Data Supply Chain
Transparency in the Data Supply ChainPaul Groth
 
Multi-omics infrastructure and data for R/Bioconductor
Multi-omics infrastructure and data for R/BioconductorMulti-omics infrastructure and data for R/Bioconductor
Multi-omics infrastructure and data for R/BioconductorLevi Waldron
 
Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...Paolo Missier
 
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...Databricks
 
E-Utilities
E-UtilitiesE-Utilities
E-Utilitiesmkim8
 
NCBI API - Integration into analysis code
NCBI API - Integration into analysis codeNCBI API - Integration into analysis code
NCBI API - Integration into analysis codeJiwoong Kim
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBAccelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBMongoDB
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbMongoDB
 
Databases_CSS2.pptx
Databases_CSS2.pptxDatabases_CSS2.pptx
Databases_CSS2.pptxSilpa87
 
Health Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha NoyHealth Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha NoyHealth Data Consortium
 
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use CasesFrom Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use CasesNeo4j
 
Working With Large-Scale Clinical Datasets
Working With Large-Scale Clinical DatasetsWorking With Large-Scale Clinical Datasets
Working With Large-Scale Clinical DatasetsCraig Smail
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)r-kor
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
 
Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI Matthieu Schapranow
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligenceSunghwan Kim
 

Semelhante a Fostering Serendipity through Big Linked Data (20)

Transparency in the Data Supply Chain
Transparency in the Data Supply ChainTransparency in the Data Supply Chain
Transparency in the Data Supply Chain
 
MPDB Presentation
MPDB PresentationMPDB Presentation
MPDB Presentation
 
Multi-omics infrastructure and data for R/Bioconductor
Multi-omics infrastructure and data for R/BioconductorMulti-omics infrastructure and data for R/Bioconductor
Multi-omics infrastructure and data for R/Bioconductor
 
Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...
 
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
 
E-Utilities
E-UtilitiesE-Utilities
E-Utilities
 
NCBI API - Integration into analysis code
NCBI API - Integration into analysis codeNCBI API - Integration into analysis code
NCBI API - Integration into analysis code
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBAccelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDB
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo db
 
Databases_CSS2.pptx
Databases_CSS2.pptxDatabases_CSS2.pptx
Databases_CSS2.pptx
 
Health Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha NoyHealth Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha Noy
 
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use CasesFrom Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
 
Working With Large-Scale Clinical Datasets
Working With Large-Scale Clinical DatasetsWorking With Large-Scale Clinical Datasets
Working With Large-Scale Clinical Datasets
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
Qi liu 08.08.2014
Qi liu 08.08.2014Qi liu 08.08.2014
Qi liu 08.08.2014
 
High-Dimensional Machine Learning for Medicine
High-Dimensional Machine Learning for MedicineHigh-Dimensional Machine Learning for Medicine
High-Dimensional Machine Learning for Medicine
 
Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligence
 

Mais de Muhammad Saleem

QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...Muhammad Saleem
 
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...Muhammad Saleem
 
CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
CostFed: Cost-Based Query Optimization for SPARQL Endpoint FederationCostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
CostFed: Cost-Based Query Optimization for SPARQL Endpoint FederationMuhammad Saleem
 
SQCFramework: SPARQL Query containment Benchmark Generation Framework
SQCFramework: SPARQL Query containment  Benchmark Generation Framework SQCFramework: SPARQL Query containment  Benchmark Generation Framework
SQCFramework: SPARQL Query containment Benchmark Generation Framework Muhammad Saleem
 
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...Muhammad Saleem
 
Federated Query Formulation and Processing Through BioFed
Federated Query Formulation and Processing Through BioFedFederated Query Formulation and Processing Through BioFed
Federated Query Formulation and Processing Through BioFedMuhammad Saleem
 
Fine-grained Evaluation of SPARQL Endpoint Federation Systems
Fine-grained Evaluation of SPARQL Endpoint Federation SystemsFine-grained Evaluation of SPARQL Endpoint Federation Systems
Fine-grained Evaluation of SPARQL Endpoint Federation SystemsMuhammad Saleem
 
SPARQL Querying Benchmarks ISWC2016
SPARQL Querying Benchmarks ISWC2016SPARQL Querying Benchmarks ISWC2016
SPARQL Querying Benchmarks ISWC2016Muhammad Saleem
 
Efficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federationEfficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federationMuhammad Saleem
 
LSQ: The Linked SPARQL Queries Dataset
LSQ: The Linked SPARQL Queries DatasetLSQ: The Linked SPARQL Queries Dataset
LSQ: The Linked SPARQL Queries DatasetMuhammad Saleem
 
FEASIBLE-Benchmark-Framework-ISWC2015
FEASIBLE-Benchmark-Framework-ISWC2015FEASIBLE-Benchmark-Framework-ISWC2015
FEASIBLE-Benchmark-Framework-ISWC2015Muhammad Saleem
 
Federated SPARQL Query Processing ISWC2015 Tutorial
Federated SPARQL Query Processing ISWC2015 TutorialFederated SPARQL Query Processing ISWC2015 Tutorial
Federated SPARQL Query Processing ISWC2015 TutorialMuhammad Saleem
 
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
SAFE: Policy Aware SPARQL Query Federation Over RDF Data CubesSAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
SAFE: Policy Aware SPARQL Query Federation Over RDF Data CubesMuhammad Saleem
 
Federated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of DataFederated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of DataMuhammad Saleem
 
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint FederationHiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint FederationMuhammad Saleem
 
DAW: Duplicate-AWare Federated Query Processing over the Web of Data
DAW: Duplicate-AWare Federated Query Processing over the Web of DataDAW: Duplicate-AWare Federated Query Processing over the Web of Data
DAW: Duplicate-AWare Federated Query Processing over the Web of DataMuhammad Saleem
 
Linked Cancer Genome Atlas Database
Linked Cancer Genome Atlas DatabaseLinked Cancer Genome Atlas Database
Linked Cancer Genome Atlas DatabaseMuhammad Saleem
 

Mais de Muhammad Saleem (19)

QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
 
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...
 
LargeRDFBench
LargeRDFBenchLargeRDFBench
LargeRDFBench
 
Extended LargeRDFBench
Extended LargeRDFBenchExtended LargeRDFBench
Extended LargeRDFBench
 
CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
CostFed: Cost-Based Query Optimization for SPARQL Endpoint FederationCostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
 
SQCFramework: SPARQL Query containment Benchmark Generation Framework
SQCFramework: SPARQL Query containment  Benchmark Generation Framework SQCFramework: SPARQL Query containment  Benchmark Generation Framework
SQCFramework: SPARQL Query containment Benchmark Generation Framework
 
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
 
Federated Query Formulation and Processing Through BioFed
Federated Query Formulation and Processing Through BioFedFederated Query Formulation and Processing Through BioFed
Federated Query Formulation and Processing Through BioFed
 
Fine-grained Evaluation of SPARQL Endpoint Federation Systems
Fine-grained Evaluation of SPARQL Endpoint Federation SystemsFine-grained Evaluation of SPARQL Endpoint Federation Systems
Fine-grained Evaluation of SPARQL Endpoint Federation Systems
 
SPARQL Querying Benchmarks ISWC2016
SPARQL Querying Benchmarks ISWC2016SPARQL Querying Benchmarks ISWC2016
SPARQL Querying Benchmarks ISWC2016
 
Efficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federationEfficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federation
 
LSQ: The Linked SPARQL Queries Dataset
LSQ: The Linked SPARQL Queries DatasetLSQ: The Linked SPARQL Queries Dataset
LSQ: The Linked SPARQL Queries Dataset
 
FEASIBLE-Benchmark-Framework-ISWC2015
FEASIBLE-Benchmark-Framework-ISWC2015FEASIBLE-Benchmark-Framework-ISWC2015
FEASIBLE-Benchmark-Framework-ISWC2015
 
Federated SPARQL Query Processing ISWC2015 Tutorial
Federated SPARQL Query Processing ISWC2015 TutorialFederated SPARQL Query Processing ISWC2015 Tutorial
Federated SPARQL Query Processing ISWC2015 Tutorial
 
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
SAFE: Policy Aware SPARQL Query Federation Over RDF Data CubesSAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
 
Federated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of DataFederated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of Data
 
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint FederationHiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
 
DAW: Duplicate-AWare Federated Query Processing over the Web of Data
DAW: Duplicate-AWare Federated Query Processing over the Web of DataDAW: Duplicate-AWare Federated Query Processing over the Web of Data
DAW: Duplicate-AWare Federated Query Processing over the Web of Data
 
Linked Cancer Genome Atlas Database
Linked Cancer Genome Atlas DatabaseLinked Cancer Genome Atlas Database
Linked Cancer Genome Atlas Database
 

Último

REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseAnaAcapella
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 

Último (20)

REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 

Fostering Serendipity through Big Linked Data

  • 1. Fostering Serendipity through Big Linked Data Muhammad Saleem , Maulik R. Kamdar , Aftab Iqbal , Shanmukha Sampath , Helena F. Deus , and Axel-Cyrille Ngonga Ngomo Semantic Web Challenge at ISWC2013, October 21-25 , 2013, Sydney, Australia
  • 2. Agenda • Motivation • Datasets • Architecture • Evaluation • Requirements • Demo • Conclusion and Future Work
  • 3. Motivation Fostering Serendipity through Big Data Triplification, Continuous Integration, and Visualization
  • 4. Triplification: Linked TCGA • TCGA is publicly accessible atlas of cancer related data from National Cancer Institute (NCI) – 9000 patients – 33 cancer types – 147,645 raw data files – 12.7 TB • Only 46% of the total expected data with new data being submitted every day • Goal is to enable cancer researchers to make and validate important discoveries • Total Linked TCGA > 30 billion triples (Largest Dataset of LOD)
  • 5. Triplification:PubMed • Collection of publications from the bio-medical domain • Large amount of metadata (MESH Terms) • 23+ million publications • 10,000 new publications/month
  • 6. Big Data Continuous Integration TopFed Parser Federator Optimizer Integrator Results SPARQL Query Results Sub-query PubMed Entrez Utilities RDFizer Auto Loader TCGA Data Portal SPARQL endpoint RDF SPARQL endpoint RDF SPARQL endpoint RDF Index
  • 7. Exon-Expression Methylation C-1 ∨ Category Colour = blue For each query triple t(s, p, o) ∈ T Highly Scalable b1 b2 p1 p2 p3 p4 p5 p6 g1 g2 g3 g4 g5 g6 g7 g8 g9 C = {CNV, SNP, E-Gene, E-Protein, miRNA, Clinical} M = {beta_value, position} F = {Expression-Exon} (CNV, SNP, E-Gene, miRNA, E-Protein, Clinical) D = {seg_mean, rpmmm, scaled_est, p_exp_val} B = {DNA-Methylation} C-1 = {{p ∈ {D ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ C}} ∧ {{S-Join(p, D ∪ C) ∨ P-Join(p, D ∪ C) } ∨ {!S-Join(p, M ∪ B ∪ E ∪ F) ∧ !P-Join(p, M ∪ B ∪ E ∪ F) }}} C-2 = {{p ∈ {E ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ F}} ∧ {{S-Join(p, E ∪ F) ∨ P-Join(p, E ∪ F)} ∨ {!S-Join(p, M ∪ B ∪ D ∪ C) ∧ !P-Join(p, M ∪ B ∪ D ∪ C) }}} C-3 = {{p ∈ {M∪ A} ∨ {p = rdf:type ∧ o ∈ B}} ∧ {{S-Join(p,M ∪ B) ∨ P-Join(p, M∪ B) } ∨ {!S-Join(p, E ∪ F ∪ D ∪ C) ∧ !P-Join(p, E ∪ F ∪ D ∪ C) }}} IF tumour lookup is successful forward to corresponding leaf Else broadcast to every one A = {chromosome, result, bcr_patient_barcode} G = {start, stop} E = {RPKM} Tumours SPARQL endpoints C-2 ∨ Category Colour = pink C-3 ∨ Category Colour = green 1-16 17-33 1-5 6-11 12-16 17-22 23-27 28-33 1-4 5-8 9-12 13-16 17-20 21-24 25-27 28-30 31-33
  • 8. Evaluation:Number of Sub-Query Submission 60 50 40 30 20 10 FedX number of Sub-Query Submission TopFedE number of Sub-Query Submission • TopFed number of sub-queries submission is 1/3 to FedX • Number of ASK requests – FedX 480 – TopFed 10 0 1 2 3 4 5 6 7 8 9 10 Avg
  • 9. Evaluation: Query Runtime 100000 10000 1000 100 10 1 1 2 3 4 5 6 7 8 9 10 Average Query Execution Time (msec) in log scale FedX TopFed • TopFed outperform FedX significantly on 90% of the queries • On average, the query run time of TopFed is about 1/3 to that of FedX • TopFed‘s best run-time (query 2, query 3) is more than 75 times smaller than that of FedX
  • 10. Big Data Track Requirements • Data Volume – 7.36 billion triples from Linked TCGA – 23 million publications from PubMed • Data Variety – The Linked TCGA data was extracted from raw text files of different structures – Processed the metadata associated with PubMed publications and transform them into RDF – Unstructured data (publication abstracts) is processed to extract mentions of gene names and cancers • Data Velocity – TCGA data doubles /2 months – PubMed publications 10k/month
  • 15. Everything is Public • Demo: http://srvgal78.deri.ie/tcga-pubmed/ • TopFed: https://code.google.com/p/topfed/ • TCGA Data Refiner, RDFizer: http://goo.gl/vSnBEJ • Utilities: http://goo.gl/kNrFdI • Linked TCGA : http://tcga.deri.ie/ saleem@informatik.uni-leipzig.de AKSW, University of Leipzig, Germany