SlideShare uma empresa Scribd logo
1 de 13
current

  BHL



  ubio.org
future
process

           disambiguate /
identify                    ID lookup
              reconcile
process

                   disambiguate /
  identify                               ID lookup
                      reconcile


Mature & scalable, well defined and standardized
process

                    disambiguate /
identify                                  ID lookup
                       reconcile


           in progress, needs API & standard
process

                  disambiguate /
identify                                  ID lookup
                     reconcile


           GNI has API, needs standards
current response
<entity>
	 	 <nameString>Abietineae</nameString>
	 	 <namebankID>8401003</namebankID>
	 	 <weblinks>
	 	 	 <website>
	 	 	 	 <title>Tropicos</title>
	 	 	 	 <link>http://mobot.mobot.org/W3T/Search/vast.html</link>
	 	 	 	 <logo>http://names.ubio.org/tools/image/tropicos.png</logo>
	 	 	 	 <links>
	 	 	 	 	 <link nameString="Abietineae Eichler">http://
mobot.mobot.org/cgi-bin/search_vast?onda=N50205444</link>
	 	 	 	 </links>
	 	 	 </website>
	 	 </weblinks>
	 </entity>
issues
the TF API is doing jobs it shouldn’t do..

Namebank is a large but outdated dataset

“taxonfinder” has no idea what a namebank ID actually is, it only knows strings

current code is completely dependent on www.ubio.org and is not scalable
why change?
scaling - we can run 10,000 taxonfinding processes using any algorithm
that supports the standard. Super fast indexing of BHL

future-proofing for devs - any new namefinding tool can take advantage
of the API and doesn’t need to write a webservice or API of it’s own

future-proofing for BHL - any new namefinding tool can be added with
one parameter
(&client=taxonfinder | &client=neti)

reliability - existing TF API goes down when Rod runs a screen scraping
tool on ubio.org.
new API spec
API specs
Request
input (string)
type (text , url)
format (xml=default, json)
Response
XML Response
A response example that corresponds to the xml schema:
<names xmlns="http://globalnames.org/namefinder" xmlns:dwc="http://rs.tdwg.org/dwc/terms/">
  <name>
    <verbatim>T. rotundata</verbatim>
    <dwc:scientificName>Tillandsia rotundata</dwc:scientificName>
    <!--   0-100   -->
    <score>100</score>
    <offset start="4550" end="4573" />
  </name>
</names>
New API
you give us text, we give you strings and offsets. This is the limit of
what a “namefinding” tool can and should do

separately you also need IDs.. Namebank, EOL, tropicos, gn*, GBIF...

once you know Mus musculus is EOL ID “9872332” you don’t need to know
that again. If a book on mice has 40,000 instances of Mus musculus, you
need to know where they are, but not the NameBank ID 40,000 times..
(this is a scaling problem..)



Where do we get these? GNI has 19.3m names & IDs.
issues

misspellings etc need to be “reconciled”

this definitely isn’t the job of a name finding tool
next?
      we could make a tool that hacks together IDs and names..
                ... but that’s not dev time well spent

we could participate in a process to check off the latter two categories
            of the name finding -> ID resolution process
                             ... yes we can


                  Let’s make a spec, build some APIs.


                    silver lining - we can start now

Mais conteúdo relacionado

Destaque

Dog Breeds
Dog BreedsDog Breeds
Dog Breeds
hounds30
 
Woodpeckers
WoodpeckersWoodpeckers
Woodpeckers
hgbaize
 

Destaque (10)

Dog Breeds
Dog BreedsDog Breeds
Dog Breeds
 
Devops @ Woods Hole Informatics talks
Devops @ Woods Hole Informatics talksDevops @ Woods Hole Informatics talks
Devops @ Woods Hole Informatics talks
 
Cu00927 c gestion excepciones java try catch finally ejemplos ejercicios
Cu00927 c gestion excepciones java try catch finally ejemplos ejerciciosCu00927 c gestion excepciones java try catch finally ejemplos ejercicios
Cu00927 c gestion excepciones java try catch finally ejemplos ejercicios
 
Formulas en excel
Formulas en excelFormulas en excel
Formulas en excel
 
Woodpeckers
WoodpeckersWoodpeckers
Woodpeckers
 
Presentation about the Master of Science: Communication Technologies, Systems...
Presentation about the Master of Science: Communication Technologies, Systems...Presentation about the Master of Science: Communication Technologies, Systems...
Presentation about the Master of Science: Communication Technologies, Systems...
 
Aforismos
AforismosAforismos
Aforismos
 
Modulando nuestro oscilador_de_radiofrecuencia
Modulando nuestro oscilador_de_radiofrecuenciaModulando nuestro oscilador_de_radiofrecuencia
Modulando nuestro oscilador_de_radiofrecuencia
 
Oscilador de radiofrecuencia
Oscilador de radiofrecuenciaOscilador de radiofrecuencia
Oscilador de radiofrecuencia
 
Practicando morse con_nuestro_oscilador_de_radiofrecuencia
Practicando morse con_nuestro_oscilador_de_radiofrecuenciaPracticando morse con_nuestro_oscilador_de_radiofrecuencia
Practicando morse con_nuestro_oscilador_de_radiofrecuencia
 

Semelhante a Scaling Namefinding

API Athens Meetup - API standards 25-6-2014
API Athens Meetup - API standards   25-6-2014API Athens Meetup - API standards   25-6-2014
API Athens Meetup - API standards 25-6-2014
Michael Petychakis
 
Prophet - Beijing Perl Workshop
Prophet - Beijing Perl WorkshopProphet - Beijing Perl Workshop
Prophet - Beijing Perl Workshop
Jesse Vincent
 
API's - Successes to Replicate. Pitfalls to Avoid.
API's - Successes to Replicate. Pitfalls to Avoid.API's - Successes to Replicate. Pitfalls to Avoid.
API's - Successes to Replicate. Pitfalls to Avoid.
Inman News
 
2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_static2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_static
Lincoln III
 

Semelhante a Scaling Namefinding (20)

Get your Hero Groove On - Heroes Reborn
Get your Hero Groove On - Heroes RebornGet your Hero Groove On - Heroes Reborn
Get your Hero Groove On - Heroes Reborn
 
A tale of two proxies
A tale of two proxiesA tale of two proxies
A tale of two proxies
 
Persistently identifying website content
Persistently identifying website contentPersistently identifying website content
Persistently identifying website content
 
SADI SWSIP '09 'cause you can't always GET what you want!
SADI SWSIP '09  'cause you can't always GET what you want!SADI SWSIP '09  'cause you can't always GET what you want!
SADI SWSIP '09 'cause you can't always GET what you want!
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Building Event-driven Serverless Applications
Building Event-driven Serverless ApplicationsBuilding Event-driven Serverless Applications
Building Event-driven Serverless Applications
 
Backend as a Service
Backend as a ServiceBackend as a Service
Backend as a Service
 
Implementing Authorization
Implementing AuthorizationImplementing Authorization
Implementing Authorization
 
API Athens Meetup - API standards 25-6-2014
API Athens Meetup - API standards 25-6-2014API Athens Meetup - API standards 25-6-2014
API Athens Meetup - API standards 25-6-2014
 
API Athens Meetup - API standards 25-6-2014
API Athens Meetup - API standards   25-6-2014API Athens Meetup - API standards   25-6-2014
API Athens Meetup - API standards 25-6-2014
 
Using Semantics to personalize medical research
Using Semantics to personalize medical researchUsing Semantics to personalize medical research
Using Semantics to personalize medical research
 
Yahoo for the Masses
Yahoo for the MassesYahoo for the Masses
Yahoo for the Masses
 
Open Source Information Gathering Brucon Edition
Open Source Information Gathering Brucon EditionOpen Source Information Gathering Brucon Edition
Open Source Information Gathering Brucon Edition
 
Walter api
Walter apiWalter api
Walter api
 
Alitora Innovation Networks
Alitora Innovation NetworksAlitora Innovation Networks
Alitora Innovation Networks
 
Advanced Web Development
Advanced Web DevelopmentAdvanced Web Development
Advanced Web Development
 
Prophet - Beijing Perl Workshop
Prophet - Beijing Perl WorkshopProphet - Beijing Perl Workshop
Prophet - Beijing Perl Workshop
 
API's - Successes to Replicate. Pitfalls to Avoid.
API's - Successes to Replicate. Pitfalls to Avoid.API's - Successes to Replicate. Pitfalls to Avoid.
API's - Successes to Replicate. Pitfalls to Avoid.
 
How I Built Bill, the AI-Powered Chatbot That Reads Our Docs for Fun , by Tod...
How I Built Bill, the AI-Powered Chatbot That Reads Our Docs for Fun , by Tod...How I Built Bill, the AI-Powered Chatbot That Reads Our Docs for Fun , by Tod...
How I Built Bill, the AI-Powered Chatbot That Reads Our Docs for Fun , by Tod...
 
2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_static2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_static
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Scaling Namefinding

  • 1. current BHL ubio.org
  • 3. process disambiguate / identify ID lookup reconcile
  • 4. process disambiguate / identify ID lookup reconcile Mature & scalable, well defined and standardized
  • 5. process disambiguate / identify ID lookup reconcile in progress, needs API & standard
  • 6. process disambiguate / identify ID lookup reconcile GNI has API, needs standards
  • 7. current response <entity> <nameString>Abietineae</nameString> <namebankID>8401003</namebankID> <weblinks> <website> <title>Tropicos</title> <link>http://mobot.mobot.org/W3T/Search/vast.html</link> <logo>http://names.ubio.org/tools/image/tropicos.png</logo> <links> <link nameString="Abietineae Eichler">http:// mobot.mobot.org/cgi-bin/search_vast?onda=N50205444</link> </links> </website> </weblinks> </entity>
  • 8. issues the TF API is doing jobs it shouldn’t do.. Namebank is a large but outdated dataset “taxonfinder” has no idea what a namebank ID actually is, it only knows strings current code is completely dependent on www.ubio.org and is not scalable
  • 9. why change? scaling - we can run 10,000 taxonfinding processes using any algorithm that supports the standard. Super fast indexing of BHL future-proofing for devs - any new namefinding tool can take advantage of the API and doesn’t need to write a webservice or API of it’s own future-proofing for BHL - any new namefinding tool can be added with one parameter (&client=taxonfinder | &client=neti) reliability - existing TF API goes down when Rod runs a screen scraping tool on ubio.org.
  • 10. new API spec API specs Request input (string) type (text , url) format (xml=default, json) Response XML Response A response example that corresponds to the xml schema: <names xmlns="http://globalnames.org/namefinder" xmlns:dwc="http://rs.tdwg.org/dwc/terms/">   <name>     <verbatim>T. rotundata</verbatim>     <dwc:scientificName>Tillandsia rotundata</dwc:scientificName>     <!--   0-100   -->     <score>100</score>     <offset start="4550" end="4573" />   </name> </names>
  • 11. New API you give us text, we give you strings and offsets. This is the limit of what a “namefinding” tool can and should do separately you also need IDs.. Namebank, EOL, tropicos, gn*, GBIF... once you know Mus musculus is EOL ID “9872332” you don’t need to know that again. If a book on mice has 40,000 instances of Mus musculus, you need to know where they are, but not the NameBank ID 40,000 times.. (this is a scaling problem..) Where do we get these? GNI has 19.3m names & IDs.
  • 12. issues misspellings etc need to be “reconciled” this definitely isn’t the job of a name finding tool
  • 13. next? we could make a tool that hacks together IDs and names.. ... but that’s not dev time well spent we could participate in a process to check off the latter two categories of the name finding -> ID resolution process ... yes we can Let’s make a spec, build some APIs. silver lining - we can start now

Notas do Editor

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n