Four projects (compound risk dossier, text mining, screening data management, and support for cloud collaboration) were outlined during a breakout discussion led by Paul Bradley and Barry Hardy at the Pistoia Alliance Information Ecosystem Workshop in October 2011.
Boost Fertility New Invention Ups Success Rates.pdf
Low Hanging Fruit Breakout Discussion #2
1. Compound Risk Dossier
Objectives
Improved toxicological prediction demands the best integrated view of current and historic
data, both proprietary and public domain. The objective of the compound risk dossier (CRD)
would be to create a service that is able gather and integrate risk/safety-related information
for a compound (including consideration of similar structures, key moieties, metabolites,
toxicology MoA, etc). The harvested information would then be integrated and presented to
the user in the form of a “safety profile”.
Business Case
It is envisaged that the CRD could bring the following business benefits:
The system would enable an efficient “background check” for NCEs based on
structural or biological similarity, or possibly shared pharmacology, toxicology MoAs
or adverse event effects, i.e. what is known about molecules similar to my candidate?
Creation of a safety profile, in which safety categories are normalised and can be
grouped according to public ontologies, provides a powerful method of aligning data
and enables intelligent analysis.
Pharma companies duplicate effort in aligning internal, vendor and public data; such
a CRD service would reduce the organisation time for this sort of activity down to
almost zero for common activities across organisations, which currently can be costly,
time consuming, tedious, and error prone.
Open Standards
Open vocabularies, ontologies, e.g. PubChem, ChemIDplus, WHOINN, OBO,
OpenTox, ChEBI,…
Safety data sources: AERS, drug labels, regulatory documents, etc.
Open source methods (QSAR, CDK, Weka, R, OpenTox,..)
Open APIs (e.g., extend and test OpenTox API 1.2
http://www.opentox.org/dev/apis/api-1.2 for data integration into common rdf
resource)
Implementation
It is suggested that a limited set of public domain data sources are selected in the first
instance, to allow a proof of concept within a 12 months.
Identify vocabulary, ontology sources for compounds, pathologies, etc.(See
Toxicology Ontology Roadmap, Hardy, B. et al. from OpenTox-EBI Industry Forum
workshop, in press)
Identify data sources from which to harvest risk related information. Opt for a handful
of structured sources rather than free text (NDAs, etc.) in the first instance?
Compound safety data sources, both public and private, are mined for risk-related
content which is harmonised and organised using public domain ontologies (and held
as an RDF triple store?)
Text mining and other semantic technologies will be necessary at this stage.
This data store can be called on by APIs or provide information that can be
consumed by analysis tools, ELNs, etc.
Decide on quality metrics – on-the-fly profiles vs. curated, pre-canned data, accuracy
vs. recall
Other things to consider include provenance, governance, security, legal, etc.
Pistoia Alliance Role
Definition of Use Case
Guidance on best safety-related data sources
Guidance on open standards to use, and their extensions needed
Provide partners willing to integrate public, vendor and proprietary data
Funding of early phase POCs
2. Text Mining/Metadata Mark up of Unstructured Text
Objectives
Unstructured text sources, both public and proprietary, are rich in information but several
features limit their use in analysis, such as:
No mark-up of key concepts – important terms such as drug and target names are
buried within free text with no simple mechanism to surface this information
Linguistic diversity – widespread use of synonyms and ad hoc identifiers make it
difficult to carry out semantic searching of free text sources.
The objective is to carry out carry out text mining and concept tagging of unstructured text to
provide a meta-data layer over documents. By linking the metadata to public ontologies, a
semantically consistent set of tags will be produced, allowing document sources to be queried
and clustered according to recognised standards. This resource could then be made available
using a cloud model to deliver value and standard search capabilities to Pharma and
Academics alike with appropriate consumption models.
Business Case
The mark-up and mapping of key terms from unstructured text would bring the following
benefits:
Enhanced search and document retrieval over free text sources
Linking of in-house structured data sources to unstructured information, in-house and
in the public domain
Repurpose unstructured text to produce actionable intelligence, for example by
creating assertional metadata
Drive towards a common standard for searching or at least a common “honest
broker” for search across different resources.
Open Standards
It is suggested that, in order to achieve a working implementation within a 12 month time
frame, a limited set of open standards are applied in the first instance. This could be
discussed more widely within the Pistoia Alliance, but the following areas are worthy of
consideration
Limiting by domain, e.g. protein targets, drug terms, gene names, pathology
Limit to a single standard that covers multiple domains, e.g. SNOMED-CT, ICD9CM
Implementation
Select public domain free text source, e.g. Medline
Identify public ontologies and vocabulary sources
Use text mining/concept recognition tools to identify key concepts and map to
standards: Autonomy, Metawise (BioWisdom), Helium (Ceiba), etc.
Platform for search/display – Lucene, other open source
Pistoia Alliance Role
Collaborate to define Use Case
Agree on document sources
Agree on open standards to use, extensions needed
Advise on best practice on document mark-up, search, analysis, governance,
security, etc.
Funding of early phase POCs to aid the development of the tools and a drive towards
standards.
Support for a free/reduced cost academic access mechanism to encourage common
methods of tagging and naming in the academic environment.
3. Improved Collaboration: Management of Screening Data
Objectives
To integrate screening data from multiple sources
To create a standard for expression of screening data, to allow easier integration
Business Case
Definition of a standard for reporting compound screening data allows easier
integration, with cost and time savings
Facilitates easier sharing of data and collaboration
Open Standards
MIABE, MIAME
ISA-TAB
Define standard for dose response for HTS, HCS, include vocabulary, units; support
multiple plate formats, standardised statistical anaylsis
Define how to deal with incomplete data sets, null values, etc.
Implementation
Create a the standard, learning from existing standards such as MIAME
Apply the standard in a working project
Reiterate and refine
Pistoia Alliance Role
Guidance on definition of the standard
Survey what has already been done in the area
4. Enabling better collaboration in the cloud, applied to
monitoring of NGS data
Objectives
To provide scientific, business and legal processes outlining best practices for
organisations collaborating in the cloud.
Application of these best practices in a system for monitoring the progress of NGS
projects.
Business Case
Time and cost savings in deciding whether a collaborative project should be carried
out in the cloud.
Streamline implementation of cloud-based collaborations by providing clear
guidelines.
Reduces delays in handovers.
Greater visibility of distributed project statuses across different organisations.
Early visibility, alerting of important events, allowing timely interventions.
Open Standards
Clear APIs and communication standards.
Define web services and service discovery mechanisms.
UDDI (Universal Description, Discovery and Integration).
MIAME?
Implementation
Outline best practice rules for working on the cloud
What is the use case?, e.g. alternative to an internally-hosted system, a method of
distributing large queries, etc.
What are the requirements for flexibility, such as how long is the service required for
and will capacity requirements change over time? What is the tie-in period?
Need clear APIs and communication standards.
Location – does data need to be held within certain boundaries, e.g. within the EU?
What level of encryption is required?
Create standard format for NGS data, consumable by analysis software, e.g. Spotfire.
Pistoia Alliance Role
Signposting best practice in the cloud.
Advise on standard representation of NGS data.