The Progress on Sagace and Data Integration

The Progress on Sagace and
Data Integration

Maori Ito
1

Main two topics
• Sagace
–Cross Search

• RDF
–Data Integration

2

Sagace
• Search for Biomedical Data &
Resources in Japan

Features
•
•
•
•

Focus on biomedical database
Manual Semi-automated Ranking
Refining search results with facets
More informative search results with
metadata

Mechanism of Search Engine
1. Crawling
2. Indexing
3. Query Processing
4. Scoring

Crawling

Databases

Crawling Program
6

Indexing
• Split data convenient size and store
own server
Indexing Data

Internal Server

Search System
NIBIO

NBDC / DBCLS

AgriTogo

MEDALS

Collaborate by
using P2P
architecture

JCGGDB

9

What is the most Important
thing in cross search ?

! Speed and Accuracy !

Features

• Focus on biomedical database
• Semi-automated Ranking

Log Analysis and reflect
search results
• The members of top 8 databases are
almost the same.
–
–
–
–
–
–
–

Patents
KEGG MEDICUS
Medicine and pharmaceutical proceedings
Drug emergency call
Ingredients information of health food
Merck Manual
Medical Information Network Distribution
Service
– The Encyclopedia of Psychoactive Drugs

12

Comparison of databases
• Popular databases are Medical or
Pharmaceutical “literal rich”
databases.
• Top databases run away with the
winnings!
• More than half databases have never
clicked!
13

Log data has been reflected in
ranking.
• Original score -> A:12,000,B:8,000
• Gather clicked data
• Eliminate duplicating database in the same day
and pick up lowest denotative rank.
– If the database score is lower than 12,400, add 200.
– The other databases are added 100 basically. But if the
database denotative rank is lower than 10, add 200.

• Patents score is fixed 8,100.
• Maximum score is 30,000.

Unpopular databases
• Sagace has started the service in
March 2012.
• Some databases have never clicked
since then.
• Eliminate these databases.
• Databases
– 272 DB -> 122 DB
15

Results
• Accuracy for users must have
improved.
• Reducing databases also caused
speed up.

16

Specific databases in life
science
• Some databases in life science is
lacked “literal information” .
• Cross search engine is suitable to
show literal information.
• Metadata will help these database.

17

Metadata
• If the developers mark up data with
metadata…

18

Metadata
• Literal information can add into
search results!

Results Image

How to mark up and reflect the
results?
【HTML】

Declare scope itemtype with normal html tag

<div itemscope itemtype="http://schema.org/BiologicalDatabaseEntry">
<span
>2012-10-24</span>
</div>

Select property
【Result】

Content

Win Win Win!
• Database developers can appeal rich
database information.
• Users can find valuable information
easily.
• Crawler program can find these
metadata properly.

21

What is schema.org?
• "Schema.org is a set of extensible schemas that
enables webmasters to embed structured data on
their web pages for use by search engines and other
applications.”
• "Search engines including Bing, Google, Yahoo! and
Yandex rely on this markup to improve the display of
search results, making it easier for people to find the
right web pages.”
(http://schema.org/)

Microdata
“You use the schema.org vocabulary, along
with the microdata format, to add information to
your HTML content.”
(http://schema.org/docs/gs.html)

• Finalizing the proposal of schema.org
extension is a requirement to show “rich”
results for major search engines.

Current Situation
• Define original "property"
(entryID, isEntryOf, taxon, seeAlso, reference).
• Please refer to
– http://sagace.nibio.go.jp/press/metadata/markup/

6 DBs, 1 catalog and 1 DB
archive applied microdata!
• DoBISCUIT(Database Of BIoSynthesis clusters
CUrated and InTegrated)
• JCRB Cell Bank
• Functional Glycomics with KO mice database
• Glyco-Disease Genes Database
• JCGGDB Report
• MEDALS
• Integbio Database Catalog
• Life Science Database Archive

To add biological database
vocabularies into schema.org,
• “Need more people who think it is a good
idea.” (by organizers @ schema.org)
– public-vocabs@w3.org (<- ML Let’s join !)

• We need more databases and web pages
that are marked up with microdata.
• I want your opinion on microdata.
• Let's talk!

Data Integration with RDF

http://www.mkbergman.com/968/a-new-best-friend-gephi-for-large-scale-networks/

http://www.cytoscape.org/what_is_cytoscape.html

What is RDF?
• Resource Description Framework
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix drugbank:
<http://bio2rdf.org/drugbank:> .
drugbank:DB00316 rdfs:label
"Acetaminophen”.

RDF

Object
Subject

rdf s:label
drugbank:
DB00316

Acet aminophen
Predicat e

28

RDF

@prefix drugbank: <http://bio2rdf.org/drugbank:> .
@prefix drugbank_vocab: <http://bio2rdf.org/drugbank_vocabulary:> .
@prefix drugbank_target: <http://bio2rdf.org/drugbank_target:> .
subject
predicate
object
drugbank:DB00316 rdfs:label "Acetaminophen" ;
drugbank_vocab:target drugbank_target:290 .
drugbank_target:290 rdfs:label "Prostaglandin G/H synthase 2".

Object
Predicate
Subject

Drugbank:
DB00316

rdfs:label

Acetaminophen

Object / Subject

Predicate
rdfs:label

Drugbank_target:
drugbank_vocab:target
290
Predicate

Object

Prostaglandin G/H
synthase2
29

SPARQL(SPARQL Protocol and
RDF Query Language)
• “SPARQL (pronounced "sparkle", a
recursive acronym for SPARQL
Protocol and RDF Query Language)
is an RDF query language, that is, a
query language for databases, able
to retrieve and manipulate data
stored in Resource Description
Framework format.”
(http://en.wikipedia.org/wiki/SPARQL)
30

How to use?
RDF
@prefix drugbank: <http://bio2rdf.org/drugbank:> .
@prefix drugbank_vocab: <http://bio2rdf.org/drugbank_vocabulary:> .
@prefix drugbank_target: <http://bio2rdf.org/drugbank_target:> .
drugbank:DB00316 rdfs:label "Acetaminophen" ;
drugbank_vocab:target drugbank_target:290 .
drugbank_target:290 rdfs:label "Prostaglandin G/H synthase 2".
PREFIX drugbank:<http://bio2rdf.org/drugbank_vocabulary:>
SPARQL
select distinct ?v where {#distinct means exclude duplicate
?s rdfs:label "Acetaminophen” ;
drugbank:target ?t .
?t rdfs:label ?v.
What is the target of “Acetaminophen”
}

"Prostaglandin G/H synthase 2”

Results!
31

SPARQL Endpoint
e.g:http://drugbank.bio2rdf.org/sparql

What is the target of “Acetaminophen”

32

Results
• You can get results from the
endpoint.

33

RDFization in life science
• Many data has been rdfized already.
• Affymetrix,Drugbank, GO, OMIM, KE
GG, PDB, UniProt, PubMed...

34

Let’s try!
• Bio2RDF
– http://bio2rdf.org/

• EBI RDF Platform
– http://www.ebi.ac.uk/rdf/

• SPARQL endpoint
– e.g:http://drugbank.bio2rdf.org/sparql

• How to learn?
– Learning SPARQL
35

Pros of RDF
• Excellent with life science data
• Comparison to RDB
– Easily be expanded
– RDB  RDF

• Excellent with No SQL too
– key value
36

Cons of RDF
• A bit hard to make RDF
• A bit hard to create developing
environments
• Speed of SPARQL

37

Currant situation in NIBIO
• Toxygates
– Johan-san and Igarashi-san have been
developing .

• Orphan Drug Data

38

Toxygates
• RDFization Open TG-Gates data.
– microarray data, pathological data
(kidney, liver, grade ,... )

• Linked to other database by using
RDF
– KEGG pathway
– GO terms
– CHEMBL
– DrugBank

39

http://toxygates.nibio.go.jp/

40

Orphan Drug
• RDFize orphan drug information in
NIBIO.

<http://www.nibio.go.jp/orphanDrugTarget#80> drgn:designationFiscalYear "1996";
drgn:designationDate "1996/4/1";
drgn:number "(8yaku A) No. 81";
drgb:name "Imiglucerase";
dc:description "Improvement of symptoms of anaemia, thrombocytopenia, hepatosp
drgn:designationApplicant "Genzyme Japan K.K.";
drgb:pharmacology "Improvement of symptoms of anaemia, thrombocytopenia, hep
drgb:manufacturer "Genzyme Japan K.K.";
eob:approvalDate "1998/3/6";
drgb:product "Cerezyme injection 200U";
drgb:brand "CEREZYME_ injection";
drgn:approvedName "Imiglucerase (Genetical Recombination)";
41
drgn:status "Approved".

Let’s try and give me your
idea!
• RDF data will enlarge many kinds of
data in Life science.
• NBDC encouraged this movement.

42

Future Perspective
• RDFize other databases in NIBIO
– E.g. bioresource

• Examine the benefit
• Spread RDF to many scientists
• Make useful environments for who
are not familiar with computers

43

The Progress on Sagace and Data Integration

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a The Progress on Sagace and Data Integration

Semelhante a The Progress on Sagace and Data Integration (20)

Mais de Maori Ito

Mais de Maori Ito (20)

Último

Último (20)

The Progress on Sagace and Data Integration