O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Chitty taxo cleveland 2019 june

104 visualizações

Publicada em

Taxonomy to ontology conversion case study: Preparing your taxonomy to be ready for data scientists and machine readability

Publicada em: Dados e análise
  • taking surveys for cash online? =>> https://t.cn/A6ybKmr1
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • Seja a primeira pessoa a gostar disto

Chitty taxo cleveland 2019 june

  1. 1. Preparing your taxonomy to be ready for data scientists & machine readability: A case study and work in progress Mary Chitty, Library Director & Taxonomist, MSLS Cambridge Healthtech, Needham MA mchitty@healthtech.com SLA Annual Conference, Cleveland Ohio, Tuesday, June 18, 2019 , Taxonomy-Ontology Conversions: Case Studies
  2. 2. 1992 2000 2006-14 2016 2018-19 Historical Taxonomy Process Taxonomies & Ontologies glossary&taxonomy http://www.genomicglossaries.com/content/ontologies.asp Company founded. Taxonomy created by CEO with a few hundred terms. Major products: conferences on emerging technologies. focus on preclinical drug discovery. Acquired companies dealing with bioinformatics, clinical trials, energy and batteries. Still integrating their databases. Met people from OntoForce, Belgian semantic search engine company. Began informal collaboration. Acquired companies in artificial intelligence and Internet of Thing. Still determining how to integrate databases. Several data scientists hired. Signed formal contract with OntoForce to use Disqover search engine. https://www.ontoforce.com/ Taxonomy stands at 1,600+ terms now. Conferences and other products in preclinical and clinical biotech and pharma, clinical trials, energy , AI and Internet of Things and more. Published Genomic Glossaries & Taxonomies www.genomicglossaries.com 2019
  3. 3. Ongoing challenges Legacy data with inconsistencies, redundancies and ambiguities. Integrating company acquisitions’ data into in-house database. Still cleaning up, disambiguating and documenting in-house data and database. Scaling up difficulties often underestimated. A major pain point for us right now.
  4. 4. FAIR Data Both the EuropeanCommissionand NIH have allocatedconsiderableresourcesto making dataFAIRer. https://www.go-fair.org/fair-principles/ Findable • First step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. … an essential component of the FAIRification process. Accessible • Once the user finds the required data, she/he needs to know how can they be accessed Interoperable • Data usually need to be integrated with other data … need to interoperate with applications or workflows. Reusable • Ultimate goal of FAIR is to optimise the reuse of data… metadata and data should be well-described so that they can be replicated and/or combined in different settings.
  5. 5. Taxonomies and ontologies are critical for interoperability and reproducibility, particularly in the life sciences. Life sciences data relatively sparse, with many attributes ”highly dimensional”, leading to complexity and sometimes chaos. Data on longitudinal health outcomes limited by HIPAA & other privacy regulations, but crucial for validation. Increasing attention being paid to data stewardship and data curation. Support still a tough sell. Reproducibility crisis? More than 70% of researchers have tried and failed to reproduce experiments. More than half have failed to reproduce their own experiments. Nature 2016 survey of researchers. https://www.nature.com/news/1-500- scientists-lift-the-lid-on-reproducibility- 1.19970
  6. 6. Life science ontologies and taxonomies So many to choose from! BioPortal https://bioportal.bioontology.org/ repository of biomedical ontologies has almost 800 ontologies, and mapping from ontologies to I2B2 http://i2b2.bioontology.org/ Interdisciplinary work holds great promise – and needs mapping of terms between disciplines. Pistoia Alliance Ontologies Mapping https://www.pistoiaalliance.org/projects/curre nt-projects/ontologies-mapping/ Data mapping also known as “data wrangling” or “data munging”. Many people trying to automate. Still works in progress.
  7. 7. ROI Return On Investment & Cost Benefit Cost of not having FAIR research data, PwC EU Services, 2018, European Union Publications. https://publications.europa.eu/en/publication-detail/-/publication/d375368c-1a0a-11e9-8d04-01aa75ed71a1 Stakeholders may balk at investing in taxonomies or ontologies. Software, other IT & technology considerations only part of the issues. Educating decision makers is an ongoing process, even with CXOs who value taxonomies and ontologies. Estimated cost benefit analysis of not having FAIR research data: Minimum of 10.2 billion Euros per year.
  8. 8. Key insights “…[T]here is a lot of work that needs doing to prepare the data sets for these technologies … there is a disproportionate amount being invested in the technologies as opposed to investing in "data- readiness“… It's just not a slam dunk to mash up a lot of data and think it will work." Life Science Leader 2019 March 1, “AI In Life Sciences: Seeing past the Hype” Francois Nicolas and comment by Christy Wilson https://www.lifescienceleader.com/doc/ai-in-life-sciences-seeing-past-the-hype-0001 “The AI solution may help accelerate some tasks, but human expertise may be required for the broad scope of what is needed. Currently AI in healthcare is in the second stage of the Gartner Hype Cycle: “the peak of inflated expectation.” However, if we don’t allow it to catch up to the hype, it may fall back into what Gartner calls the “trough of disillusionment.”
  9. 9. Key takeaways Don’t try to “boil the ocean”. Prototype early and often. Think modular • Pareto Principle 80/20 80% of effects come from 20% of effort. Don’t try for 100%. • Identify what your stakeholders value. Aim for quick wins. Understand existing workflows. • Seek out allies and shared buy-in for justification and sustainability. • Bundle stakeholders’ key wants and items you know they will eventually need. Communicating ROI on taxonomies, ontologies and metadata is still challenging. • Expectations and change management are crucial skills to cultivate. • Report metrics quantitative and qualitative. • Recognize some challenges not yet resolved by anyone.
  10. 10. Acknowledgments Many people have participated in this ongoing project. I’m grateful for their work, insights and encouragement. Cambridge Innovation Institute CII & Cambridge Healthtech • Phillips Kuhl, President • Tonya Urquizo, Knowledge Information Services Analyst and IT Liaison Sanaye Bartlett, Data Analyst & Project Manager • Kaushik Chaudhuri, Director of Product Marketing CII Disqover Team • Kaitlyn Barago, Associate Conference Producer • Nancy Clarke, Data Scientist • Mike Croft, Software Architect • Ben Lakin, Director New Initiatives • Jaime Parlee, Director Marketing Analytics • Craig Wohlers, Manager Knowledge Foundation OntoForce • Hans Constandt, CEO & Founder • Filip Pattyn, Scientific Lead • Carla Suijkerbuijk, Business Development North America • Niels Vanneste, Customer Data Scientist • Berenice Wulbrecht, Data Science Director, Systems Biology Fruitful Conversations and emails • Ingrid Akerblom, IEA Diversified Consulting • Juliane Schneider, Lead Data Curator, eagle-I, Harvard Catalyst • Jane Lomax, Head Ontologist, SciBite • Terence Russell, Chief Technologist, IRODS Consortium • John Wilbanks, Chief Commons Officer, Sage Bionetworks