SlideShare uma empresa Scribd logo
1 de 13
The new EBI search engine: EB-eye Backend : An overview of what is under the hood Industry Workshop 21-22 May, 2007 Franck Valentin – External Services group
Summary ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
What is the data available ? Ligand > 20 domains >137M entries > 550 Gb of data
What is the data available – formats Ligand <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> ID  : ..  PARENT ID : .. RANK  : .. ... ID ... AC ... DT ... ID ... AC ... DT ...
What is the data available – sizes 43M 4.2G 57Gb, >500 files   1G 8.4G 374Gb, >600 files   6.3G 25K 81M
Points to take into consideration ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Parsing and indexing different formats Indexer Lucene API Db EMBL grammar Taxonomy grammar UniProt grammar . . . Parser (ANTXR) Medline grammar InterPro grammar Dump file grammar . . . Parser (ANTLR) Uniprot Index Embl Index Taxonomy Index ID  AF030562; SV 1; linear; genomic DNA; STS; FUN; 852 BP. AC  AF030562; DT  04-DEC-1997 (Rel. 53, Created) DT  03-MAR-2000 (Rel. 62, Last updated, Version 2) XX DE  Fusarium venenatum clone VEN-A RAPD band generated using Operon primer DE  OPW-03, sequence tagged site. . . . Flat files <MedlineCitationSet> <MedlineCitation Owner=&quot;NLM&quot; Status=&quot;MEDLINE&quot;> <PMID>10997935</PMID> <DateCreated> <Year>2000</Year> <Month>10</Month> <Day>04</Day> </DateCreated> … XML files <MedlineCitationSet> <MedlineCitation Owner=&quot;NLM&quot; Status=&quot;MEDLINE&quot;> <PMID> 14216186 </PMID> <DateCreated> <Year> 1965 </Year> <Month> 02 </Month> <Day> 01 </Day> </DateCreated> <DateCompleted> <Year> 1996 </Year> <Month> 12 </Month> <Day> 01 </Day> </DateCompleted> <DateRevised> <Year>2007</Year> <Month>03</Month> <Day>01</Day> </DateRevised> <Article PubModel=&quot;Print&quot;> <Journal> <ISSN IssnType=&quot;Print&quot;> 0009-8981 </ISSN> <JournalIssue CitedMedium=&quot;Print&quot;> <Volume> 10 </Volume> <PubDate> <Year>1964</Year> <Month>Jul</Month> </PubDate> </JournalIssue> <Title> Clinica chimica acta; international journal of clinical chemistry </Title> <ISOAbbreviation>Clin. Chim. Acta</ISOAbbreviation> </Journal> . . . . . . ID Creation Date Modification Date issn volume name ID  AF030562 ; SV 1; linear; genomic DNA; STS; FUN; 852 BP. XX AC  AF030562 ; XX DT  04-DEC-1997  (Rel. 53, Created) DT  03-MAR-2000  (Rel. 62, Last updated, Version 2) XX DE  Fusarium venenatum clone VEN-A RAPD band generated using Operon primer DE  OPW-03, sequence tagged site . XX KW  STS. XX OS  Fusarium venenatum OC  Eukaryota; Fungi; Ascomycota; Pezizomycotina; Sordariomycetes ; OC  Hypocreomycetidae; Hypocreales; mitosporic Hypocreales; Fusarium . XX RN  [1] RP  1-852 RA  Yoder W.T., Christianson L.M .; RT  &quot;Species-specific primers resolve members of the section Fusarium . RT  Taxonomic status of the edible 'Quorn' fungus re-evaluated &quot;; RL  Fungal Genet. Biol. 0:0-0(1997). XX RN  [2] RP  1-852 RA  Yoder W.T., Christianson L.M.; RT  ; RL  Submitted (21-OCT-1997) to the EMBL/GenBank/DDBJ databases . RL  Microbiology, Novo Nordisk Biotech, Inc., 1445 Drew Ave., Davis, CA 95616 , RL  USA XX FH  Key  Location/Qualifiers FH FT  source  1..852 FT  /organism=&quot;Fusarium venenatum&quot; FT  /strain=&quot;ATCC20334“ FT  /db_xref=&quot;taxon:56646&quot; . . .  ID AC Creation date / Modification date Description Organism species Organism classes References References <database> <name>IntAct.Experiment</name> <description>Experimental procedures that allowed to…</description> <release>1.0</release> <release_date>2007-Feb-16</release_date> <entry_count>5697</entry_count> <entries> <entry id=&quot;EBI-77680&quot;> … Dump file (XML)
Divide and Conquer the Indexing UniProt (>4M entries)   Embl (>83M entries) 2 files,  ~ 9.4G >600 files ~ 375G Medline (>16M entries) >500 files ~ 57G Taxonomy (>0.37M entries)   1 file,  ~ 81M GO (>0.23M entries) 1 file ~ 27M Others (ArrayExpress Ensembl, Intact, …) XML XML XML dump XML dump XML dump 8 cpu 8 cpu 8 cpu 8 cpu XML XML XML dump XML dump XML dump Embl Index Uniprot Index Embl Index Taxonomy Index Medline Index ArrayExpress Index Ensembl Index Intact Index XML XML XML dump XML dump XML dump Db
Let’s put some figures on it Less than 18 hours to index all the EBI
Web side story UniProt Index Embl Index Taxonomy Index Medline Index ArrayExpress Index Ensembl Index Intact Index Load balancer Tomcat 1 Tomcat 2 Tomcat 3 Tomcat 4
Being up to date ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Libraries ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Acknowledgements ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Mais conteúdo relacionado

Destaque

Tag yourlife(nfclab9月定例会発表資料)
Tag yourlife(nfclab9月定例会発表資料)Tag yourlife(nfclab9月定例会発表資料)
Tag yourlife(nfclab9月定例会発表資料)Yasuhiro Ohsaka
 
回想支援ツールNFC仏壇
回想支援ツールNFC仏壇回想支援ツールNFC仏壇
回想支援ツールNFC仏壇Yasuhiro Ohsaka
 
Freedom, Money, Time and the Key to Creative Success
Freedom, Money, Time and the Key to Creative SuccessFreedom, Money, Time and the Key to Creative Success
Freedom, Money, Time and the Key to Creative SuccessPaul Smith
 
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...Kantar
 
Topshop Power Point Ah, Gl, Jb.
Topshop Power Point   Ah, Gl, Jb.Topshop Power Point   Ah, Gl, Jb.
Topshop Power Point Ah, Gl, Jb.Marcus9000
 
How to do business in the Indian Market for Kiko Milano.
How to do business in the Indian Market for Kiko Milano.How to do business in the Indian Market for Kiko Milano.
How to do business in the Indian Market for Kiko Milano.Giacomo Caleffi
 

Destaque (6)

Tag yourlife(nfclab9月定例会発表資料)
Tag yourlife(nfclab9月定例会発表資料)Tag yourlife(nfclab9月定例会発表資料)
Tag yourlife(nfclab9月定例会発表資料)
 
回想支援ツールNFC仏壇
回想支援ツールNFC仏壇回想支援ツールNFC仏壇
回想支援ツールNFC仏壇
 
Freedom, Money, Time and the Key to Creative Success
Freedom, Money, Time and the Key to Creative SuccessFreedom, Money, Time and the Key to Creative Success
Freedom, Money, Time and the Key to Creative Success
 
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...
 
Topshop Power Point Ah, Gl, Jb.
Topshop Power Point   Ah, Gl, Jb.Topshop Power Point   Ah, Gl, Jb.
Topshop Power Point Ah, Gl, Jb.
 
How to do business in the Indian Market for Kiko Milano.
How to do business in the Indian Market for Kiko Milano.How to do business in the Indian Market for Kiko Milano.
How to do business in the Indian Market for Kiko Milano.
 

Semelhante a EB-eye Back End

Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchAnshika Bansal
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2Daniel JACOB
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics PresentationZhenhong Bao
 
Biomart Update
Biomart UpdateBiomart Update
Biomart Updatebosc
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesYasset Perez-Riverol
 
20110725 ibc xml
20110725 ibc xml20110725 ibc xml
20110725 ibc xmlagosti
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableDATAVERSITY
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra SigmodJeff Hammerbacher
 
Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...
Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...
Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...Jean-Paul Calbimonte
 

Semelhante a EB-eye Back End (20)

Biological databases
Biological databasesBiological databases
Biological databases
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
Gen bank
Gen bankGen bank
Gen bank
 
Bio2RDF@BH2010
Bio2RDF@BH2010Bio2RDF@BH2010
Bio2RDF@BH2010
 
SAFE EDBT 2011
SAFE EDBT 2011SAFE EDBT 2011
SAFE EDBT 2011
 
NCBI
NCBINCBI
NCBI
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics Presentation
 
Biomart Update
Biomart UpdateBiomart Update
Biomart Update
 
Odp
OdpOdp
Odp
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
 
20110725 ibc xml
20110725 ibc xml20110725 ibc xml
20110725 ibc xml
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with Hypertable
 
Major biological nucleotide databases
Major biological nucleotide databasesMajor biological nucleotide databases
Major biological nucleotide databases
 
Gen bank (genetic sequence databank)
Gen bank (genetic sequence databank)Gen bank (genetic sequence databank)
Gen bank (genetic sequence databank)
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra Sigmod
 
Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...
Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...
Tutorial ESWC2011 Building Semantic Sensor Web - 04 - Querying_semantic_strea...
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 

Último

Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...marcuskenyatta275
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastUXDXConf
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsStefano
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimaginedpanagenda
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityScyllaDB
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfSrushith Repakula
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024Stephen Perrenod
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyUXDXConf
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekCzechDreamin
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfUK Journal
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...FIDO Alliance
 

Último (20)

Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 

EB-eye Back End

  • 1. The new EBI search engine: EB-eye Backend : An overview of what is under the hood Industry Workshop 21-22 May, 2007 Franck Valentin – External Services group
  • 2.
  • 3. What is the data available ? Ligand > 20 domains >137M entries > 550 Gb of data
  • 4. What is the data available – formats Ligand <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> ID : .. PARENT ID : .. RANK : .. ... ID ... AC ... DT ... ID ... AC ... DT ...
  • 5. What is the data available – sizes 43M 4.2G 57Gb, >500 files 1G 8.4G 374Gb, >600 files 6.3G 25K 81M
  • 6.
  • 7. Parsing and indexing different formats Indexer Lucene API Db EMBL grammar Taxonomy grammar UniProt grammar . . . Parser (ANTXR) Medline grammar InterPro grammar Dump file grammar . . . Parser (ANTLR) Uniprot Index Embl Index Taxonomy Index ID AF030562; SV 1; linear; genomic DNA; STS; FUN; 852 BP. AC AF030562; DT 04-DEC-1997 (Rel. 53, Created) DT 03-MAR-2000 (Rel. 62, Last updated, Version 2) XX DE Fusarium venenatum clone VEN-A RAPD band generated using Operon primer DE OPW-03, sequence tagged site. . . . Flat files <MedlineCitationSet> <MedlineCitation Owner=&quot;NLM&quot; Status=&quot;MEDLINE&quot;> <PMID>10997935</PMID> <DateCreated> <Year>2000</Year> <Month>10</Month> <Day>04</Day> </DateCreated> … XML files <MedlineCitationSet> <MedlineCitation Owner=&quot;NLM&quot; Status=&quot;MEDLINE&quot;> <PMID> 14216186 </PMID> <DateCreated> <Year> 1965 </Year> <Month> 02 </Month> <Day> 01 </Day> </DateCreated> <DateCompleted> <Year> 1996 </Year> <Month> 12 </Month> <Day> 01 </Day> </DateCompleted> <DateRevised> <Year>2007</Year> <Month>03</Month> <Day>01</Day> </DateRevised> <Article PubModel=&quot;Print&quot;> <Journal> <ISSN IssnType=&quot;Print&quot;> 0009-8981 </ISSN> <JournalIssue CitedMedium=&quot;Print&quot;> <Volume> 10 </Volume> <PubDate> <Year>1964</Year> <Month>Jul</Month> </PubDate> </JournalIssue> <Title> Clinica chimica acta; international journal of clinical chemistry </Title> <ISOAbbreviation>Clin. Chim. Acta</ISOAbbreviation> </Journal> . . . . . . ID Creation Date Modification Date issn volume name ID AF030562 ; SV 1; linear; genomic DNA; STS; FUN; 852 BP. XX AC AF030562 ; XX DT 04-DEC-1997 (Rel. 53, Created) DT 03-MAR-2000 (Rel. 62, Last updated, Version 2) XX DE Fusarium venenatum clone VEN-A RAPD band generated using Operon primer DE OPW-03, sequence tagged site . XX KW STS. XX OS Fusarium venenatum OC Eukaryota; Fungi; Ascomycota; Pezizomycotina; Sordariomycetes ; OC Hypocreomycetidae; Hypocreales; mitosporic Hypocreales; Fusarium . XX RN [1] RP 1-852 RA Yoder W.T., Christianson L.M .; RT &quot;Species-specific primers resolve members of the section Fusarium . RT Taxonomic status of the edible 'Quorn' fungus re-evaluated &quot;; RL Fungal Genet. Biol. 0:0-0(1997). XX RN [2] RP 1-852 RA Yoder W.T., Christianson L.M.; RT ; RL Submitted (21-OCT-1997) to the EMBL/GenBank/DDBJ databases . RL Microbiology, Novo Nordisk Biotech, Inc., 1445 Drew Ave., Davis, CA 95616 , RL USA XX FH Key Location/Qualifiers FH FT source 1..852 FT /organism=&quot;Fusarium venenatum&quot; FT /strain=&quot;ATCC20334“ FT /db_xref=&quot;taxon:56646&quot; . . . ID AC Creation date / Modification date Description Organism species Organism classes References References <database> <name>IntAct.Experiment</name> <description>Experimental procedures that allowed to…</description> <release>1.0</release> <release_date>2007-Feb-16</release_date> <entry_count>5697</entry_count> <entries> <entry id=&quot;EBI-77680&quot;> … Dump file (XML)
  • 8. Divide and Conquer the Indexing UniProt (>4M entries) Embl (>83M entries) 2 files, ~ 9.4G >600 files ~ 375G Medline (>16M entries) >500 files ~ 57G Taxonomy (>0.37M entries) 1 file, ~ 81M GO (>0.23M entries) 1 file ~ 27M Others (ArrayExpress Ensembl, Intact, …) XML XML XML dump XML dump XML dump 8 cpu 8 cpu 8 cpu 8 cpu XML XML XML dump XML dump XML dump Embl Index Uniprot Index Embl Index Taxonomy Index Medline Index ArrayExpress Index Ensembl Index Intact Index XML XML XML dump XML dump XML dump Db
  • 9. Let’s put some figures on it Less than 18 hours to index all the EBI
  • 10. Web side story UniProt Index Embl Index Taxonomy Index Medline Index ArrayExpress Index Ensembl Index Intact Index Load balancer Tomcat 1 Tomcat 2 Tomcat 3 Tomcat 4
  • 11.
  • 12.
  • 13.