SlideShare uma empresa Scribd logo
1 de 43
The Progress on Sagace and
Data Integration

Maori Ito
1
Main two topics
• Sagace
–Cross Search

• RDF
–Data Integration

2
Sagace
• Search for Biomedical Data &
Resources in Japan
Features
•
•
•
•

Focus on biomedical database
Manual Semi-automated Ranking
Refining search results with facets
More informative search results with
metadata
Mechanism of Search Engine
1. Crawling
2. Indexing
3. Query Processing
4. Scoring
Crawling

Databases

Crawling Program
6
Indexing
• Split data convenient size and store
own server
Indexing Data

Internal Server
Query Processing and Scoring
Search System
NIBIO

NBDC / DBCLS

AgriTogo

MEDALS

Collaborate by
using P2P
architecture

JCGGDB

9
What is the most Important
thing in cross search ?

! Speed and Accuracy !
Features

• Focus on biomedical database
• Semi-automated Ranking
Log Analysis and reflect
search results
• The members of top 8 databases are
almost the same.
–
–
–
–
–
–
–

Patents
KEGG MEDICUS
Medicine and pharmaceutical proceedings
Drug emergency call
Ingredients information of health food
Merck Manual
Medical Information Network Distribution
Service
– The Encyclopedia of Psychoactive Drugs

12
Comparison of databases
• Popular databases are Medical or
Pharmaceutical “literal rich”
databases.
• Top databases run away with the
winnings!
• More than half databases have never
clicked!
13
Log data has been reflected in
ranking.
• Original score -> A:12,000,B:8,000
• Gather clicked data
• Eliminate duplicating database in the same day
and pick up lowest denotative rank.
– If the database score is lower than 12,400, add 200.
– The other databases are added 100 basically. But if the
database denotative rank is lower than 10, add 200.

• Patents score is fixed 8,100.
• Maximum score is 30,000.
Unpopular databases
• Sagace has started the service in
March 2012.
• Some databases have never clicked
since then.
• Eliminate these databases.
• Databases
– 272 DB -> 122 DB
15
Results
• Accuracy for users must have
improved.
• Reducing databases also caused
speed up.

16
Specific databases in life
science
• Some databases in life science is
lacked “literal information” .
• Cross search engine is suitable to
show literal information.
• Metadata will help these database.

17
Metadata
• If the developers mark up data with
metadata…

18
Metadata
• Literal information can add into
search results!

Results Image
How to mark up and reflect the
results?
【HTML】

Declare scope itemtype with normal html tag

<div itemscope itemtype="http://schema.org/BiologicalDatabaseEntry">
<span
>2012-10-24</span>
</div>

Select property
【Result】

Content
Win Win Win!
• Database developers can appeal rich
database information.
• Users can find valuable information
easily.
• Crawler program can find these
metadata properly.

21
What is schema.org?
• "Schema.org is a set of extensible schemas that
enables webmasters to embed structured data on
their web pages for use by search engines and other
applications.”
• "Search engines including Bing, Google, Yahoo! and
Yandex rely on this markup to improve the display of
search results, making it easier for people to find the
right web pages.”
(http://schema.org/)
Microdata
“You use the schema.org vocabulary, along
with the microdata format, to add information to
your HTML content.”
(http://schema.org/docs/gs.html)

• Finalizing the proposal of schema.org
extension is a requirement to show “rich”
results for major search engines.
Current Situation
• Define original "property"
(entryID, isEntryOf, taxon, seeAlso, reference).
• Please refer to
– http://sagace.nibio.go.jp/press/metadata/markup/
6 DBs, 1 catalog and 1 DB
archive applied microdata!
• DoBISCUIT(Database Of BIoSynthesis clusters
CUrated and InTegrated)
• JCRB Cell Bank
• Functional Glycomics with KO mice database
• Glyco-Disease Genes Database
• JCGGDB Report
• MEDALS
• Integbio Database Catalog
• Life Science Database Archive
To add biological database
vocabularies into schema.org,
• “Need more people who think it is a good
idea.” (by organizers @ schema.org)
– public-vocabs@w3.org (<- ML Let’s join !)

• We need more databases and web pages
that are marked up with microdata.
• I want your opinion on microdata.
• Let's talk!
Data Integration with RDF

http://www.mkbergman.com/968/a-new-best-friend-gephi-for-large-scale-networks/

http://www.cytoscape.org/what_is_cytoscape.html
What is RDF?
• Resource Description Framework
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix drugbank:
<http://bio2rdf.org/drugbank:> .
drugbank:DB00316 rdfs:label
"Acetaminophen”.

RDF

Object
Subject

rdf s:label
drugbank:
DB00316

Acet aminophen
Predicat e

28
RDF

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix drugbank: <http://bio2rdf.org/drugbank:> .
@prefix drugbank_vocab: <http://bio2rdf.org/drugbank_vocabulary:> .
@prefix drugbank_target: <http://bio2rdf.org/drugbank_target:> .
subject
predicate
object
drugbank:DB00316 rdfs:label "Acetaminophen" ;
drugbank_vocab:target drugbank_target:290 .
drugbank_target:290 rdfs:label "Prostaglandin G/H synthase 2".

Object
Predicate
Subject

Drugbank:
DB00316

rdfs:label

Acetaminophen

Object / Subject

Predicate
rdfs:label

Drugbank_target:
drugbank_vocab:target
290
Predicate

Object

Prostaglandin G/H
synthase2
29
SPARQL(SPARQL Protocol and
RDF Query Language)
• “SPARQL (pronounced "sparkle", a
recursive acronym for SPARQL
Protocol and RDF Query Language)
is an RDF query language, that is, a
query language for databases, able
to retrieve and manipulate data
stored in Resource Description
Framework format.”
(http://en.wikipedia.org/wiki/SPARQL)
30
How to use?
RDF
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix drugbank: <http://bio2rdf.org/drugbank:> .
@prefix drugbank_vocab: <http://bio2rdf.org/drugbank_vocabulary:> .
@prefix drugbank_target: <http://bio2rdf.org/drugbank_target:> .
drugbank:DB00316 rdfs:label "Acetaminophen" ;
drugbank_vocab:target drugbank_target:290 .
drugbank_target:290 rdfs:label "Prostaglandin G/H synthase 2".
PREFIX drugbank:<http://bio2rdf.org/drugbank_vocabulary:>
SPARQL
select distinct ?v where {#distinct means exclude duplicate
?s rdfs:label "Acetaminophen” ;
drugbank:target ?t .
?t rdfs:label ?v.
What is the target of “Acetaminophen”
}

"Prostaglandin G/H synthase 2”

Results!
31
SPARQL Endpoint
e.g:http://drugbank.bio2rdf.org/sparql

What is the target of “Acetaminophen”

32
Results
• You can get results from the
endpoint.

33
RDFization in life science
• Many data has been rdfized already.
• Affymetrix,Drugbank, GO, OMIM, KE
GG, PDB, UniProt, PubMed...

34
Let’s try!
• Bio2RDF
– http://bio2rdf.org/

• EBI RDF Platform
– http://www.ebi.ac.uk/rdf/

• SPARQL endpoint
– e.g:http://drugbank.bio2rdf.org/sparql

• How to learn?
– Learning SPARQL
35
Pros of RDF
• Excellent with life science data
• Comparison to RDB
– Easily be expanded
– RDB  RDF

• Excellent with No SQL too
– key value
36
Cons of RDF
• A bit hard to make RDF
• A bit hard to create developing
environments
• Speed of SPARQL

37
Currant situation in NIBIO
• Toxygates
– Johan-san and Igarashi-san have been
developing .

• Orphan Drug Data

38
Toxygates
• RDFization Open TG-Gates data.
– microarray data, pathological data
(kidney, liver, grade ,... )

• Linked to other database by using
RDF
– KEGG pathway
– GO terms
– CHEMBL
– DrugBank

39
http://toxygates.nibio.go.jp/

40
Orphan Drug
• RDFize orphan drug information in
NIBIO.

<http://www.nibio.go.jp/orphanDrugTarget#80> drgn:designationFiscalYear "1996";
drgn:designationDate "1996/4/1";
drgn:number "(8yaku A) No. 81";
drgb:name "Imiglucerase";
dc:description "Improvement of symptoms of anaemia, thrombocytopenia, hepatosp
drgn:designationApplicant "Genzyme Japan K.K.";
drgb:pharmacology "Improvement of symptoms of anaemia, thrombocytopenia, hep
drgb:manufacturer "Genzyme Japan K.K.";
eob:approvalDate "1998/3/6";
drgb:product "Cerezyme injection 200U";
drgb:brand "CEREZYME_ injection";
drgn:approvedName "Imiglucerase (Genetical Recombination)";
41
drgn:status "Approved".
Let’s try and give me your
idea!
• RDF data will enlarge many kinds of
data in Life science.
• NBDC encouraged this movement.

42
Future Perspective
• RDFize other databases in NIBIO
– E.g. bioresource

• Examine the benefit
• Spread RDF to many scientists
• Make useful environments for who
are not familiar with computers

43

Mais conteúdo relacionado

Mais procurados

HRGRN: enabling graph search and integrative analysis of Arabidopsis signalin...
HRGRN: enabling graph search and integrative analysis of Arabidopsis signalin...HRGRN: enabling graph search and integrative analysis of Arabidopsis signalin...
HRGRN: enabling graph search and integrative analysis of Arabidopsis signalin...
Araport
 
The HCLS Community Profile: Describing Datasets, Versions, and Distributions
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsThe HCLS Community Profile: Describing Datasets, Versions, and Distributions
The HCLS Community Profile: Describing Datasets, Versions, and Distributions
Alasdair Gray
 
Health Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha NoyHealth Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha Noy
Health Data Consortium
 

Mais procurados (20)

HRGRN: enabling graph search and integrative analysis of Arabidopsis signalin...
HRGRN: enabling graph search and integrative analysis of Arabidopsis signalin...HRGRN: enabling graph search and integrative analysis of Arabidopsis signalin...
HRGRN: enabling graph search and integrative analysis of Arabidopsis signalin...
 
Standardization and integration of molecular biology information with DAS
Standardization and integration of molecular biology information with DASStandardization and integration of molecular biology information with DAS
Standardization and integration of molecular biology information with DAS
 
2009 0807 Lod Gmod
2009 0807 Lod Gmod2009 0807 Lod Gmod
2009 0807 Lod Gmod
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBAccelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDB
 
New Initiatives - Geoffrey Bilder - London LIVE 2017
New Initiatives - Geoffrey Bilder - London LIVE 2017New Initiatives - Geoffrey Bilder - London LIVE 2017
New Initiatives - Geoffrey Bilder - London LIVE 2017
 
WEBINAR: The Yosemite Project: An RDF Roadmap for Healthcare Information Inte...
WEBINAR: The Yosemite Project: An RDF Roadmap for Healthcare Information Inte...WEBINAR: The Yosemite Project: An RDF Roadmap for Healthcare Information Inte...
WEBINAR: The Yosemite Project: An RDF Roadmap for Healthcare Information Inte...
 
Supporting Dataset Descriptions in the Life Sciences
Supporting Dataset Descriptions in the Life SciencesSupporting Dataset Descriptions in the Life Sciences
Supporting Dataset Descriptions in the Life Sciences
 
Clinical Quality Linked Data on health.data.gov
Clinical Quality Linked Data on health.data.govClinical Quality Linked Data on health.data.gov
Clinical Quality Linked Data on health.data.gov
 
The HCLS Community Profile: Describing Datasets, Versions, and Distributions
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsThe HCLS Community Profile: Describing Datasets, Versions, and Distributions
The HCLS Community Profile: Describing Datasets, Versions, and Distributions
 
Publishing and Consuming FAIR Data A Case in the Agri-Food Domain
Publishing and Consuming FAIR DataA Case in the Agri-Food DomainPublishing and Consuming FAIR DataA Case in the Agri-Food Domain
Publishing and Consuming FAIR Data A Case in the Agri-Food Domain
 
Mendeley Data FAIR hackathon
Mendeley Data FAIR hackathonMendeley Data FAIR hackathon
Mendeley Data FAIR hackathon
 
Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing p...
Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing p...Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing p...
Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing p...
 
Crossref LIVE: The Benefits of Open Infrastructure (APAC time zones) - 29th O...
Crossref LIVE: The Benefits of Open Infrastructure (APAC time zones) - 29th O...Crossref LIVE: The Benefits of Open Infrastructure (APAC time zones) - 29th O...
Crossref LIVE: The Benefits of Open Infrastructure (APAC time zones) - 29th O...
 
Validata: A tool for testing profile conformance
Validata: A tool for testing profile conformanceValidata: A tool for testing profile conformance
Validata: A tool for testing profile conformance
 
DTL Partners Event - FAIR Data Tech overview - Day 1
DTL Partners Event - FAIR Data Tech overview - Day 1DTL Partners Event - FAIR Data Tech overview - Day 1
DTL Partners Event - FAIR Data Tech overview - Day 1
 
Semantic Web use cases in outcomes research
Semantic Web use cases in outcomes researchSemantic Web use cases in outcomes research
Semantic Web use cases in outcomes research
 
Webinar@AIMS: LODE-BD
Webinar@AIMS: LODE-BDWebinar@AIMS: LODE-BD
Webinar@AIMS: LODE-BD
 
Health Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha NoyHealth Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha Noy
 
FundRef Webinar
FundRef WebinarFundRef Webinar
FundRef Webinar
 
Preparing Data for Sharing: The FAIR Principles
Preparing Data for Sharing: The FAIR PrinciplesPreparing Data for Sharing: The FAIR Principles
Preparing Data for Sharing: The FAIR Principles
 

Semelhante a The Progress on Sagace and Data Integration

NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
National Information Standards Organization (NISO)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Lucidworks (Archived)
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

Semelhante a The Progress on Sagace and Data Integration (20)

Life Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataLife Science Database Cross Search and Metadata
Life Science Database Cross Search and Metadata
 
Applied semantic technology and linked data
Applied semantic technology and linked dataApplied semantic technology and linked data
Applied semantic technology and linked data
 
Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
 
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
 
HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9 HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9
 
Datat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management plan
 
Data-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystemData-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystem
 
Data retreival system
Data retreival systemData retreival system
Data retreival system
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Cedar Overview
Cedar OverviewCedar Overview
Cedar Overview
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
Big data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at KitwareBig data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at Kitware
 
Data integration
Data integrationData integration
Data integration
 
Research data catalogues and data interoperability in life sciences
Research data catalogues and data interoperability in life sciencesResearch data catalogues and data interoperability in life sciences
Research data catalogues and data interoperability in life sciences
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Sept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the CloudSept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
 
OSFair2017 Workshop | Bioschemas
OSFair2017 Workshop | BioschemasOSFair2017 Workshop | Bioschemas
OSFair2017 Workshop | Bioschemas
 
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
 
Designing Biological Databases
Designing Biological DatabasesDesigning Biological Databases
Designing Biological Databases
 
Data integration
Data integrationData integration
Data integration
 

Mais de Maori Ito

Mais de Maori Ito (20)

42nd MTG in NIBIO
42nd MTG in NIBIO42nd MTG in NIBIO
42nd MTG in NIBIO
 
41st MTG in NIBIO
41st MTG in NIBIO41st MTG in NIBIO
41st MTG in NIBIO
 
40th MTG in NIBIO
40th MTG in NIBIO40th MTG in NIBIO
40th MTG in NIBIO
 
39th MTG in NIBIO
39th MTG in NIBIO39th MTG in NIBIO
39th MTG in NIBIO
 
Test slide for the lab - Target prioritization
Test slide for the lab - Target prioritization Test slide for the lab - Target prioritization
Test slide for the lab - Target prioritization
 
Test for lab_j Psiver j
Test for lab_j Psiver jTest for lab_j Psiver j
Test for lab_j Psiver j
 
Psiver j
Psiver jPsiver j
Psiver j
 
38th MTG in NIBIO
38th MTG in NIBIO38th MTG in NIBIO
38th MTG in NIBIO
 
37th mtg in NIBIO
37th mtg in NIBIO37th mtg in NIBIO
37th mtg in NIBIO
 
36th mtg in NIBIO
 36th mtg in NIBIO 36th mtg in NIBIO
36th mtg in NIBIO
 
35th mtg in NIBIO
35th mtg in NIBIO35th mtg in NIBIO
35th mtg in NIBIO
 
34th mtg in NIBIO
34th mtg in NIBIO34th mtg in NIBIO
34th mtg in NIBIO
 
33rd MTG In NIBIO
33rd MTG In NIBIO33rd MTG In NIBIO
33rd MTG In NIBIO
 
32nd MTG in NIBIO
32nd MTG in NIBIO32nd MTG in NIBIO
32nd MTG in NIBIO
 
31st Integrated DB MTG in NIBIO
31st Integrated DB MTG in NIBIO31st Integrated DB MTG in NIBIO
31st Integrated DB MTG in NIBIO
 
30th Integrated DB MTG in NIBIO
30th Integrated DB MTG in NIBIO30th Integrated DB MTG in NIBIO
30th Integrated DB MTG in NIBIO
 
29th Integrated DB MTG in NIBIO
29th Integrated DB MTG in NIBIO29th Integrated DB MTG in NIBIO
29th Integrated DB MTG in NIBIO
 
Bh13.13 sagace 1
Bh13.13 sagace 1Bh13.13 sagace 1
Bh13.13 sagace 1
 
28th mtg
28th mtg28th mtg
28th mtg
 
27th mtg 1
27th mtg 127th mtg 1
27th mtg 1
 

Último

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

The Progress on Sagace and Data Integration

  • 1. The Progress on Sagace and Data Integration Maori Ito 1
  • 2. Main two topics • Sagace –Cross Search • RDF –Data Integration 2
  • 3. Sagace • Search for Biomedical Data & Resources in Japan
  • 4. Features • • • • Focus on biomedical database Manual Semi-automated Ranking Refining search results with facets More informative search results with metadata
  • 5. Mechanism of Search Engine 1. Crawling 2. Indexing 3. Query Processing 4. Scoring
  • 7. Indexing • Split data convenient size and store own server Indexing Data Internal Server
  • 9. Search System NIBIO NBDC / DBCLS AgriTogo MEDALS Collaborate by using P2P architecture JCGGDB 9
  • 10. What is the most Important thing in cross search ? ! Speed and Accuracy !
  • 11. Features • Focus on biomedical database • Semi-automated Ranking
  • 12. Log Analysis and reflect search results • The members of top 8 databases are almost the same. – – – – – – – Patents KEGG MEDICUS Medicine and pharmaceutical proceedings Drug emergency call Ingredients information of health food Merck Manual Medical Information Network Distribution Service – The Encyclopedia of Psychoactive Drugs 12
  • 13. Comparison of databases • Popular databases are Medical or Pharmaceutical “literal rich” databases. • Top databases run away with the winnings! • More than half databases have never clicked! 13
  • 14. Log data has been reflected in ranking. • Original score -> A:12,000,B:8,000 • Gather clicked data • Eliminate duplicating database in the same day and pick up lowest denotative rank. – If the database score is lower than 12,400, add 200. – The other databases are added 100 basically. But if the database denotative rank is lower than 10, add 200. • Patents score is fixed 8,100. • Maximum score is 30,000.
  • 15. Unpopular databases • Sagace has started the service in March 2012. • Some databases have never clicked since then. • Eliminate these databases. • Databases – 272 DB -> 122 DB 15
  • 16. Results • Accuracy for users must have improved. • Reducing databases also caused speed up. 16
  • 17. Specific databases in life science • Some databases in life science is lacked “literal information” . • Cross search engine is suitable to show literal information. • Metadata will help these database. 17
  • 18. Metadata • If the developers mark up data with metadata… 18
  • 19. Metadata • Literal information can add into search results! Results Image
  • 20. How to mark up and reflect the results? 【HTML】 Declare scope itemtype with normal html tag <div itemscope itemtype="http://schema.org/BiologicalDatabaseEntry"> <span >2012-10-24</span> </div> Select property 【Result】 Content
  • 21. Win Win Win! • Database developers can appeal rich database information. • Users can find valuable information easily. • Crawler program can find these metadata properly. 21
  • 22. What is schema.org? • "Schema.org is a set of extensible schemas that enables webmasters to embed structured data on their web pages for use by search engines and other applications.” • "Search engines including Bing, Google, Yahoo! and Yandex rely on this markup to improve the display of search results, making it easier for people to find the right web pages.” (http://schema.org/)
  • 23. Microdata “You use the schema.org vocabulary, along with the microdata format, to add information to your HTML content.” (http://schema.org/docs/gs.html) • Finalizing the proposal of schema.org extension is a requirement to show “rich” results for major search engines.
  • 24. Current Situation • Define original "property" (entryID, isEntryOf, taxon, seeAlso, reference). • Please refer to – http://sagace.nibio.go.jp/press/metadata/markup/
  • 25. 6 DBs, 1 catalog and 1 DB archive applied microdata! • DoBISCUIT(Database Of BIoSynthesis clusters CUrated and InTegrated) • JCRB Cell Bank • Functional Glycomics with KO mice database • Glyco-Disease Genes Database • JCGGDB Report • MEDALS • Integbio Database Catalog • Life Science Database Archive
  • 26. To add biological database vocabularies into schema.org, • “Need more people who think it is a good idea.” (by organizers @ schema.org) – public-vocabs@w3.org (<- ML Let’s join !) • We need more databases and web pages that are marked up with microdata. • I want your opinion on microdata. • Let's talk!
  • 27. Data Integration with RDF http://www.mkbergman.com/968/a-new-best-friend-gephi-for-large-scale-networks/ http://www.cytoscape.org/what_is_cytoscape.html
  • 28. What is RDF? • Resource Description Framework @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix drugbank: <http://bio2rdf.org/drugbank:> . drugbank:DB00316 rdfs:label "Acetaminophen”. RDF Object Subject rdf s:label drugbank: DB00316 Acet aminophen Predicat e 28
  • 29. RDF @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix drugbank: <http://bio2rdf.org/drugbank:> . @prefix drugbank_vocab: <http://bio2rdf.org/drugbank_vocabulary:> . @prefix drugbank_target: <http://bio2rdf.org/drugbank_target:> . subject predicate object drugbank:DB00316 rdfs:label "Acetaminophen" ; drugbank_vocab:target drugbank_target:290 . drugbank_target:290 rdfs:label "Prostaglandin G/H synthase 2". Object Predicate Subject Drugbank: DB00316 rdfs:label Acetaminophen Object / Subject Predicate rdfs:label Drugbank_target: drugbank_vocab:target 290 Predicate Object Prostaglandin G/H synthase2 29
  • 30. SPARQL(SPARQL Protocol and RDF Query Language) • “SPARQL (pronounced "sparkle", a recursive acronym for SPARQL Protocol and RDF Query Language) is an RDF query language, that is, a query language for databases, able to retrieve and manipulate data stored in Resource Description Framework format.” (http://en.wikipedia.org/wiki/SPARQL) 30
  • 31. How to use? RDF @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix drugbank: <http://bio2rdf.org/drugbank:> . @prefix drugbank_vocab: <http://bio2rdf.org/drugbank_vocabulary:> . @prefix drugbank_target: <http://bio2rdf.org/drugbank_target:> . drugbank:DB00316 rdfs:label "Acetaminophen" ; drugbank_vocab:target drugbank_target:290 . drugbank_target:290 rdfs:label "Prostaglandin G/H synthase 2". PREFIX drugbank:<http://bio2rdf.org/drugbank_vocabulary:> SPARQL select distinct ?v where {#distinct means exclude duplicate ?s rdfs:label "Acetaminophen” ; drugbank:target ?t . ?t rdfs:label ?v. What is the target of “Acetaminophen” } "Prostaglandin G/H synthase 2” Results! 31
  • 33. Results • You can get results from the endpoint. 33
  • 34. RDFization in life science • Many data has been rdfized already. • Affymetrix,Drugbank, GO, OMIM, KE GG, PDB, UniProt, PubMed... 34
  • 35. Let’s try! • Bio2RDF – http://bio2rdf.org/ • EBI RDF Platform – http://www.ebi.ac.uk/rdf/ • SPARQL endpoint – e.g:http://drugbank.bio2rdf.org/sparql • How to learn? – Learning SPARQL 35
  • 36. Pros of RDF • Excellent with life science data • Comparison to RDB – Easily be expanded – RDB  RDF • Excellent with No SQL too – key value 36
  • 37. Cons of RDF • A bit hard to make RDF • A bit hard to create developing environments • Speed of SPARQL 37
  • 38. Currant situation in NIBIO • Toxygates – Johan-san and Igarashi-san have been developing . • Orphan Drug Data 38
  • 39. Toxygates • RDFization Open TG-Gates data. – microarray data, pathological data (kidney, liver, grade ,... ) • Linked to other database by using RDF – KEGG pathway – GO terms – CHEMBL – DrugBank 39
  • 41. Orphan Drug • RDFize orphan drug information in NIBIO. <http://www.nibio.go.jp/orphanDrugTarget#80> drgn:designationFiscalYear "1996"; drgn:designationDate "1996/4/1"; drgn:number "(8yaku A) No. 81"; drgb:name "Imiglucerase"; dc:description "Improvement of symptoms of anaemia, thrombocytopenia, hepatosp drgn:designationApplicant "Genzyme Japan K.K."; drgb:pharmacology "Improvement of symptoms of anaemia, thrombocytopenia, hep drgb:manufacturer "Genzyme Japan K.K."; eob:approvalDate "1998/3/6"; drgb:product "Cerezyme injection 200U"; drgb:brand "CEREZYME_ injection"; drgn:approvedName "Imiglucerase (Genetical Recombination)"; 41 drgn:status "Approved".
  • 42. Let’s try and give me your idea! • RDF data will enlarge many kinds of data in Life science. • NBDC encouraged this movement. 42
  • 43. Future Perspective • RDFize other databases in NIBIO – E.g. bioresource • Examine the benefit • Spread RDF to many scientists • Make useful environments for who are not familiar with computers 43