SlideShare uma empresa Scribd logo
1 de 40
0
Kamel Mansouri, Chris Grulke, Ann Richard*, and Antony J. Williams
National Center for Computational Toxicology, US EPA, RTP, NC
The influence of data curation on
QSAR Modeling – examining issues
of quality versus quantity of data
March, 2016
ACS Spring Meeting, San Diego, CA
This work was reviewed by EPA and approved for presentation but does not necessarily reflect official Agency policy.
1
Sustainable
Chemistry
Human Health
Hazard
Ecosystems
Hazard
Environmental
Persistence
“ design and use of chemicals that
minimize impacts to human health,
ecosystems and the environment”
 Screen chemicals for toxicity &
exposure potential
 Prioritize testing of thousands of
existing chemicals lacking data
 Evaluate how chemical
transformations impact hazard &
environmental persistence
DSSTox
Chemical
IDs
Chemistry Foundation to Support
Multiple NCCT & EPA Projects
2
EPA SRS
(TSCA, EcoTox)
EDSP21
ToxCast/Tox21
CPCAT
ToxRef
1:1 mapping
CAS:Name:structure
Quality scores
>700K substances
“Chemical Safety
& Sustainability”
(CSS) Public
Web-accessible
Dashboards
Knowledge-
informed features
& chemotypes
ToxCast/Tox21
High-throughput
screening (HTS)
projects
CAS look-up
List curation tools
Chemical search by
CAS-names
SMILES
Structures
Structure file
downloads
PK/ADME model
inputs & outputs
Structure “cleaning”
SAR-ready structure files
Structure-based predictions
Similarity searching
Fingerprints, feature sets
QSAR models;
phys-chem properties
Phys-chem calc &
measured properties
DSSTox Update
3
DSSTox_v1 DSSTox_v2
 Convert DSSTox tables to MySQL
 Develop curation interface &
cheminformatics workflow
 Expand chemical content
 Web-services & Dashboard access
 Manually curated 25K substance
records
 EPA-focus, environmental tox datasets
 Emphasis on accurate CAS-name-
structure annotations
 Public resource for high-quality
structure-data files (SDF)
Building DSSTox_v2
DSSTox_v1
~21K
~5K
1:1:1 CAS:name:structure
No conflicts
allowed
Environmental/Toxicity/Exposure Relevance
Manually CuratedEPA
SRS
ChemID
PubChemEPA
ACToR
EPA-
relevant
lists
~735K ~150K
>1M
~54K
~310K ~625K ~5K~16K~33K~101K
DSSTox QC
Levels:
Public_Med
Public_High DSSTox_High
DSSTox_Low
Public_LowPublic_Untrusted
(not loaded)
DSSTox_v2 Totals
DSSTox_v2
PublicCurated
5. Low
2. Low
6. Untrusted
1. High
3. High
4. Med
7. Incomplete
QC Level Totals (12Jun2015)
- highest confidence
Public_Untrusted
Public_Low
Public_Medium
Public_High
DSSTox_Low
DSSTox_High 4535
16K
33K
101K
584K
~ 310K pending
~ 150K pending
~ 150K substances in
top 4 QC quality bins
(single source)
Building the Chemistry
Software Architecture
External Services
APIs and WEB SERVICES
DATA LAYER
Chemical-BiologyPlatforms
USER INTERFACES
UI COMPONENTS
Building Predictive Models
• Receptor-mediated
activities (e.g., ER, AR)
• Toxicity endpoint QSAR
models
• Metabolism prediction
• In vitro-in vivo toxicity
extrapolation models (IVIVE)
• Read-across toxicity
estimation
• Near and far-field exposure
models
NCCTModelingActivities
• Analytical dust and
media screening for
environmental
contaminants
• Biotransformation,
transport, fate modeling
• ADME, PBPK models
• Chemical use category
modeling
• Green chemistry
EPA-NCCTCollaborations
Physical
chemical
propertiesWater solubility
LogP
Partition coefficients
(air/water/soil)
MP, BP, Vapor Pressure
Building a physical chemical
property database
Physical
chemical
properties
Measured
(where available)
Predicted
(where not available)
 Collect data
(sort wheat from chaff)
 Determine how to best
mesh data together
(structure? CAS? Other?)
 Check data self-consistency
 Structure validation vs.
property value validation
(very different challenges)
 Build training/test sets of
chemicals with
experimental values
(collapse or resolve dups)
 Apply structure-cleaning
workflow to produce
“QSAR-ready” structures
 Calculate descriptors
 Apply modeling algorithm
(regression, ML, etc.)
Store measured & predicted values in relational
database, linked to data sources, model details, etc.
Curating structure-linked data
• Establish accurate linkage of measured data
to chemical structure
• Published experimental data  chemical IDs
{CAS, name, and/or structure}
– Fix ID errors
– Resolve ID conflicts
– Normalize, clean structures
– Resolve duplicate mappings
9
Lots of experience from building DSSTox!
EPI SuiteTM Property Estimation
Software
• Developed by EPA’s Office of Pollution Prevention & Toxics
(w/ Syracuse Research Corp), published initial models over
20 yrs ago, released desktop app in 2000
• Estimates a wide range of physical and environmental
properties using QSAR approaches,
– KOWWINTM, AOPWINTM, HENRYWINTM, MPBPWINTM, BIOWINTM,
BioHCwin, KOCWINTM
• and incorporates these into dependent fate and toxicity
models,
– WSKOWWINTM, WATERNTTM, BCFBAFTM, HYDROWINTM, KOAWIN,
AEROWINTM, WVOLWINTM, STPWINTM, LEV3EPITM, ECOSARTM
• Used to fill data gaps in EPA Pre-Manufacture Notification
(PMN) chemical submissions, wide usage outside EPA
10
EPI SuiteTM Property Estimation
Software
11
• Downloadable Windows application available at:
http://www.epa.gov/tsca-screening-tools/epi-suitetm-estimation-
program-interface
• 14 PhysProp experimental property data sets used in training models
can be freely accessed from within app
EPI Suite KOWWIN:
2447 Training Compounds
• 2447 chemicals with
measured Log Kow values
used to train model
• >15K measured Log Kow
values available in
PhysProp file
Overall Predicted vs. Experimental
(>15k chemicals)
• Use EPI Suite model built
on 2447 measured
values to predict >15K
values in PhysProp file
Why derive new models from EPI
Suite PhysProp datasets?
• Significant advances in cheminformatics and
modeling approaches since <2000
• Enlarged training sets likely to improve models
and expand applicability domain
• But first…
Perhaps we should take a look at the >20 yr
old PhysProp datasets that we’ll be using to
rebuild the models!
Review of PhysProp datasets
• Exported SDF files from EPI Suite
application
• Basic manual review searching for errors:
– hypervalency, charge imbalance, undefined stereo
– deduplication
– mismatches between identifiers
CAS Numbers not matching structure
Names not matching structure
Collisions between identifiers
Incorrect valences
Differences in Values (MP)
Two experimental records:
Same structure
Different CAS, name,
experimental & predicted MP
Covalently bound salt structures
Collisions in Records
Same structure depictions (Molfiles) - different
CAS, different names, AND different SMILES
Missing or erroneous
identifiers
• Many chemical
names are truncated
• Many chemicals don’t
have CAS Numbers
• Invalid CAS Checksum: 3646
• Invalid names: 555
• Invalid SMILES: 133
• Valence errors: 322 Molfile, 3782 SMILES
• Duplicates check:
– 31 MOLFILE; 626 SMILES; 531 NAMES
• SMILES vs. Molfiles (structure check):
– 1279 differ in stereochemistry
– 362 “covalent halogens”
– 191 differ as tautomers
– 436 are different compounds
Whoa!! Lots of problems…
15,809 chemicals in KOWWIN data file
Quality flags added to data:
1-4 Stars
FLAG Definition
4 Stars ENHANCED 4 levels of consistency with stereo information
4 Stars 4 levels of consistency, stereo ignored
3 Starts Plus 3 of 4 levels consistent; 4th is a tautomer
3 Stars ENHANCED 3 levels of consistency with stereo information
3 Stars 3 levels of consistency, stereo ignored.
2 Stars Plus 2 of 4 levels consistent; 3rd is a tautomer
1 Star Whatever's left – too many errors to count!
 Molblock
 SMILES string
 chemical name (based on ACD/Labs dictionary)
 CAS Number (based on a DSSTox lookup)
4 levels of
consistency
possible
Auto-curation
KNIME structure-“cleaning” workflow
23
Indigo
 Combine community approaches to structure processing
 Develop a flexible workflow to be used by EPA and shared publicly
 Process DSSTox files to create “QSAR-ready” structures
Publicly available cheminformatics
toolkits in KNIME:
https://www.knime.org/knime
 Parse SDF, remove fragments
 Explicit hydrogen removed
 Dearomatization
 Removal of chirality info, isotopes and
pseudo-atoms
 Aromatization + add explicit hydrogen
atoms
 Standardize Nitro groups
 Other tautomerize/mesomerization
 Neutralize (when possible)
Building the Models
• QSAR ready forms for modeling – standardize
tautomers, remove stereochemistry, no salts
• Remove approx. 1800 chemicals due to poor
quality score
• Build using 3 STAR and BETTER chemicals
Rederived model for PhysProp
Log Kow (n=14049)
25
Weighted kNN model, 5-nearest
neighbors
Training: 11251 chemicals
Test set: 2798 chemicals
5 fold cross validation:
R2: 0.87
RMSE: 0.67
Open source
PaDEL
descriptors
Remember the original EPI Suite
outlier scatter?
26
EPISuite Predicted vs. Experimental EPISuite vs kNN Comparison of Cluster
• Applicability domain (AD) of original EPI Suite LogKow
Model improved in updated PhysProp model
* 280 cmpd cluster outside AD of EPI Suite  no longer outliers in new model
Updated PhysProp results
• Performed curation and cleaning of all 14 PhysProp measured
property datasets
• Rederived models using modern ML methods & open source
descriptors provide significantly improved prediction accuracy
over older EPI Suite models
• Predictions generated for entire (>700K) DSSTox database,
stored in new property database
• Measured & predicted properties to be surfaced in first release
of iCSS Chemistry Dashboard (April 2016)
 Global statistics insensitive to local curation improvements in
>15K training set, BUT exceedingly important when surfacing
measured properties in support of individual prediction values
27
iCSS Chemistry Dashboard
Releasing April 2016
28
• PHASE 1 Delivery – Web interface supporting CSS research
 access to DSSTox content: >700,000 chemicals
 access to experimental property data
• Initial set of models based on reanalysis of cleaned, curated
EPI Suite PHYSPROP datasets – logP, BP, MP, Wsol etc.
iCSS Chemistry Dashboard
Releasing in April 2016
29
Formerly DSSTox GSID, new unique
public substance ID, essential for
RDF/semantic web applications
Melting Point
iCSS Chemistry Dashboard
Releasing in April 2016
• Original raw source result view
• Users can submit comments
• Links to external resources: EPA, NIH, property predictors
31
iCSS Chemistry Dashboard
Releasing in April 2016
e.g., ChemAxon’s
“Chemicalize” web-service
e.g., PubChem’s
bioactivity summary
Linkage to External Predictor, e.g.
Chemicalize
32
Chemicalize.org (powered by ChemAxon)
Atomic
Charges
Molecule
pKa
Working to integrate other EPA
predictors
• ExpoCast
 near & far-field exposure models
• Environmental Fate Simulator (EFS)
 air/soil/water distribution, biotransformation
• CERAPP
 estrogen receptor activity QSAR model
• T.E.S.T (in progress)
 phys-chem properties & toxicity endpoints
33
e.g., T.E.S.T.
34
http://www.epa.gov/chemical-research/toxicity-estimation-software-tool-test
 Downloadable Windows app
 Both phys-chem and tox
endpoints predicted
 Multiple QSAR modeling
methods employed
 Multiple views of data
T.E.S.T. QSAR Model
Prediction Views
35
Model statistics
Query chemical
Prediction results
Building “NCCT Models”
• Collect “Open Data” for various endpoints of interest to
EPA, from e.g.
 PubChem
 eChemPortal
 Open Data Sources
• & Build QSAR prediction models
 Using modern machine-learning approaches
 Ensuring high quality data as inputs
 Populating database with measured & predicted values
 Making experimental data training sets available
 Allowing real-time predictions (in the future)
36
Quality review required at all levels of chemical
“Representations”
Structure
Generic
Substance
Test
Sample
Supplier, Lot/Batch,
physical description
Features
Properties
Descriptors, fingerprints, charges,
phys-chem properties, etc.
SMILES
InChI
Chemical Name
CASRN
Increasingdataaggregation
In conclusion…
• iCSS Chemistry Dashboard will provide public access to
data & services focused on chemicals of interest to EPA
• Initially serve up results for measured & pre-predicted
physical chemical property data
• Chemical-Data consistency quality flags for DSSTox and
property data add value to public domain data, BUT …
chemical-data linkages require curation/validation!!
• All data and models will be available as OPEN DATA and
OPEN CODE
Acknowledgements
NCCT Contributors to the iCSS Chemistry Dashboard
39
Jeff Edwards
Jeremy Fitzpatrick
Jordan Foster
Jason Harris
Dave Lyons
Kamel Mansouri
Aria Smith
Jennifer Smith
Indira Thillainadarajah
Chris Grulke
Antony Williams

Mais conteúdo relacionado

Mais procurados

The influence of data curation on QSAR Modeling – examining issues of qualit...
 The influence of data curation on QSAR Modeling – examining issues of qualit... The influence of data curation on QSAR Modeling – examining issues of qualit...
The influence of data curation on QSAR Modeling – examining issues of qualit...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Kamel Mansouri
 
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

Mais procurados (20)

The influence of data curation on QSAR Modeling – examining issues of qualit...
 The influence of data curation on QSAR Modeling – examining issues of qualit... The influence of data curation on QSAR Modeling – examining issues of qualit...
The influence of data curation on QSAR Modeling – examining issues of qualit...
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
 
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
 
Open PHACTS Chemistry Platform Update and Learnings
Open PHACTS Chemistry Platform Update and Learnings Open PHACTS Chemistry Platform Update and Learnings
Open PHACTS Chemistry Platform Update and Learnings
 
International Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology ProblemsInternational Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology Problems
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
 
Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
 
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
 
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
 
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
 
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
 
The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...
 
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
New Approach Methods - What is That?
 
Data drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryData drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistry
 
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
 
How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...
 
Organic chemist
Organic chemistOrganic chemist
Organic chemist
 
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...
 
Update on Phase 2 of the C4SL Project
Update on Phase 2 of the C4SL ProjectUpdate on Phase 2 of the C4SL Project
Update on Phase 2 of the C4SL Project
 

Semelhante a The influence of data curation on QSAR Modeling – Presented at American Chemical Society, San Diego, CA, March 13 - 17, 2016.

Prediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source toolsPrediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source tools
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
Kamel Mansouri
 
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning modelsDevelopment and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
Sean Ekins
 
Cheminformatics approaches to support chemical identification delivered via t...
Cheminformatics approaches to support chemical identification delivered via t...Cheminformatics approaches to support chemical identification delivered via t...
Cheminformatics approaches to support chemical identification delivered via t...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Consensus ranking and fragmentation prediction for identification of unknowns...
Consensus ranking and fragmentation prediction for identification of unknowns...Consensus ranking and fragmentation prediction for identification of unknowns...
Consensus ranking and fragmentation prediction for identification of unknowns...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Delivering web-based access to data and algorithms to support computational t...
Delivering web-based access to data and algorithms to support computational t...Delivering web-based access to data and algorithms to support computational t...
Delivering web-based access to data and algorithms to support computational t...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Chemical identification of unknowns in high resolution mass spectrometry usin...
Chemical identification of unknowns in high resolution mass spectrometry usin...Chemical identification of unknowns in high resolution mass spectrometry usin...
Chemical identification of unknowns in high resolution mass spectrometry usin...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

Semelhante a The influence of data curation on QSAR Modeling – Presented at American Chemical Society, San Diego, CA, March 13 - 17, 2016. (20)

Prediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source toolsPrediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source tools
 
Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
 
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted AnalysisThe US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
 
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
 
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
 
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
 
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
 
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning modelsDevelopment and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
 
Cheminformatics approaches to support chemical identification delivered via t...
Cheminformatics approaches to support chemical identification delivered via t...Cheminformatics approaches to support chemical identification delivered via t...
Cheminformatics approaches to support chemical identification delivered via t...
 
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
 
Consensus ranking and fragmentation prediction for identification of unknowns...
Consensus ranking and fragmentation prediction for identification of unknowns...Consensus ranking and fragmentation prediction for identification of unknowns...
Consensus ranking and fragmentation prediction for identification of unknowns...
 
Delivering web-based access to data and algorithms to support computational t...
Delivering web-based access to data and algorithms to support computational t...Delivering web-based access to data and algorithms to support computational t...
Delivering web-based access to data and algorithms to support computational t...
 
Validation of homology modeling
Validation of homology modelingValidation of homology modeling
Validation of homology modeling
 
Chemical identification of unknowns in high resolution mass spectrometry usin...
Chemical identification of unknowns in high resolution mass spectrometry usin...Chemical identification of unknowns in high resolution mass spectrometry usin...
Chemical identification of unknowns in high resolution mass spectrometry usin...
 
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
 
Reaxys rmc unified platform_ webinar_
Reaxys rmc unified platform_ webinar_Reaxys rmc unified platform_ webinar_
Reaxys rmc unified platform_ webinar_
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
 

Mais de Kamel Mansouri

CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor ActivityCoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
Kamel Mansouri
 
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Kamel Mansouri
 
In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...
Kamel Mansouri
 

Mais de Kamel Mansouri (7)

Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)
Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)
Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)
 
Scoring and ranking of metabolic trees to computationally prioritize chemical...
Scoring and ranking of metabolic trees to computationally prioritize chemical...Scoring and ranking of metabolic trees to computationally prioritize chemical...
Scoring and ranking of metabolic trees to computationally prioritize chemical...
 
CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor ActivityCoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
 
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
 
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
 
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
 
In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 

Último (20)

Velocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.pptVelocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.ppt
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 

The influence of data curation on QSAR Modeling – Presented at American Chemical Society, San Diego, CA, March 13 - 17, 2016.

  • 1. 0 Kamel Mansouri, Chris Grulke, Ann Richard*, and Antony J. Williams National Center for Computational Toxicology, US EPA, RTP, NC The influence of data curation on QSAR Modeling – examining issues of quality versus quantity of data March, 2016 ACS Spring Meeting, San Diego, CA This work was reviewed by EPA and approved for presentation but does not necessarily reflect official Agency policy.
  • 2. 1 Sustainable Chemistry Human Health Hazard Ecosystems Hazard Environmental Persistence “ design and use of chemicals that minimize impacts to human health, ecosystems and the environment”  Screen chemicals for toxicity & exposure potential  Prioritize testing of thousands of existing chemicals lacking data  Evaluate how chemical transformations impact hazard & environmental persistence
  • 3. DSSTox Chemical IDs Chemistry Foundation to Support Multiple NCCT & EPA Projects 2 EPA SRS (TSCA, EcoTox) EDSP21 ToxCast/Tox21 CPCAT ToxRef 1:1 mapping CAS:Name:structure Quality scores >700K substances “Chemical Safety & Sustainability” (CSS) Public Web-accessible Dashboards Knowledge- informed features & chemotypes ToxCast/Tox21 High-throughput screening (HTS) projects CAS look-up List curation tools Chemical search by CAS-names SMILES Structures Structure file downloads PK/ADME model inputs & outputs Structure “cleaning” SAR-ready structure files Structure-based predictions Similarity searching Fingerprints, feature sets QSAR models; phys-chem properties Phys-chem calc & measured properties
  • 4. DSSTox Update 3 DSSTox_v1 DSSTox_v2  Convert DSSTox tables to MySQL  Develop curation interface & cheminformatics workflow  Expand chemical content  Web-services & Dashboard access  Manually curated 25K substance records  EPA-focus, environmental tox datasets  Emphasis on accurate CAS-name- structure annotations  Public resource for high-quality structure-data files (SDF)
  • 5. Building DSSTox_v2 DSSTox_v1 ~21K ~5K 1:1:1 CAS:name:structure No conflicts allowed Environmental/Toxicity/Exposure Relevance Manually CuratedEPA SRS ChemID PubChemEPA ACToR EPA- relevant lists ~735K ~150K >1M ~54K ~310K ~625K ~5K~16K~33K~101K DSSTox QC Levels: Public_Med Public_High DSSTox_High DSSTox_Low Public_LowPublic_Untrusted (not loaded)
  • 6. DSSTox_v2 Totals DSSTox_v2 PublicCurated 5. Low 2. Low 6. Untrusted 1. High 3. High 4. Med 7. Incomplete QC Level Totals (12Jun2015) - highest confidence Public_Untrusted Public_Low Public_Medium Public_High DSSTox_Low DSSTox_High 4535 16K 33K 101K 584K ~ 310K pending ~ 150K pending ~ 150K substances in top 4 QC quality bins (single source)
  • 7. Building the Chemistry Software Architecture External Services APIs and WEB SERVICES DATA LAYER Chemical-BiologyPlatforms USER INTERFACES UI COMPONENTS
  • 8. Building Predictive Models • Receptor-mediated activities (e.g., ER, AR) • Toxicity endpoint QSAR models • Metabolism prediction • In vitro-in vivo toxicity extrapolation models (IVIVE) • Read-across toxicity estimation • Near and far-field exposure models NCCTModelingActivities • Analytical dust and media screening for environmental contaminants • Biotransformation, transport, fate modeling • ADME, PBPK models • Chemical use category modeling • Green chemistry EPA-NCCTCollaborations Physical chemical propertiesWater solubility LogP Partition coefficients (air/water/soil) MP, BP, Vapor Pressure
  • 9. Building a physical chemical property database Physical chemical properties Measured (where available) Predicted (where not available)  Collect data (sort wheat from chaff)  Determine how to best mesh data together (structure? CAS? Other?)  Check data self-consistency  Structure validation vs. property value validation (very different challenges)  Build training/test sets of chemicals with experimental values (collapse or resolve dups)  Apply structure-cleaning workflow to produce “QSAR-ready” structures  Calculate descriptors  Apply modeling algorithm (regression, ML, etc.) Store measured & predicted values in relational database, linked to data sources, model details, etc.
  • 10. Curating structure-linked data • Establish accurate linkage of measured data to chemical structure • Published experimental data  chemical IDs {CAS, name, and/or structure} – Fix ID errors – Resolve ID conflicts – Normalize, clean structures – Resolve duplicate mappings 9 Lots of experience from building DSSTox!
  • 11. EPI SuiteTM Property Estimation Software • Developed by EPA’s Office of Pollution Prevention & Toxics (w/ Syracuse Research Corp), published initial models over 20 yrs ago, released desktop app in 2000 • Estimates a wide range of physical and environmental properties using QSAR approaches, – KOWWINTM, AOPWINTM, HENRYWINTM, MPBPWINTM, BIOWINTM, BioHCwin, KOCWINTM • and incorporates these into dependent fate and toxicity models, – WSKOWWINTM, WATERNTTM, BCFBAFTM, HYDROWINTM, KOAWIN, AEROWINTM, WVOLWINTM, STPWINTM, LEV3EPITM, ECOSARTM • Used to fill data gaps in EPA Pre-Manufacture Notification (PMN) chemical submissions, wide usage outside EPA 10
  • 12. EPI SuiteTM Property Estimation Software 11 • Downloadable Windows application available at: http://www.epa.gov/tsca-screening-tools/epi-suitetm-estimation- program-interface • 14 PhysProp experimental property data sets used in training models can be freely accessed from within app
  • 13. EPI Suite KOWWIN: 2447 Training Compounds • 2447 chemicals with measured Log Kow values used to train model • >15K measured Log Kow values available in PhysProp file
  • 14. Overall Predicted vs. Experimental (>15k chemicals) • Use EPI Suite model built on 2447 measured values to predict >15K values in PhysProp file
  • 15. Why derive new models from EPI Suite PhysProp datasets? • Significant advances in cheminformatics and modeling approaches since <2000 • Enlarged training sets likely to improve models and expand applicability domain • But first… Perhaps we should take a look at the >20 yr old PhysProp datasets that we’ll be using to rebuild the models!
  • 16. Review of PhysProp datasets • Exported SDF files from EPI Suite application • Basic manual review searching for errors: – hypervalency, charge imbalance, undefined stereo – deduplication – mismatches between identifiers CAS Numbers not matching structure Names not matching structure Collisions between identifiers
  • 18. Differences in Values (MP) Two experimental records: Same structure Different CAS, name, experimental & predicted MP
  • 19. Covalently bound salt structures
  • 20. Collisions in Records Same structure depictions (Molfiles) - different CAS, different names, AND different SMILES
  • 21. Missing or erroneous identifiers • Many chemical names are truncated • Many chemicals don’t have CAS Numbers
  • 22. • Invalid CAS Checksum: 3646 • Invalid names: 555 • Invalid SMILES: 133 • Valence errors: 322 Molfile, 3782 SMILES • Duplicates check: – 31 MOLFILE; 626 SMILES; 531 NAMES • SMILES vs. Molfiles (structure check): – 1279 differ in stereochemistry – 362 “covalent halogens” – 191 differ as tautomers – 436 are different compounds Whoa!! Lots of problems… 15,809 chemicals in KOWWIN data file
  • 23. Quality flags added to data: 1-4 Stars FLAG Definition 4 Stars ENHANCED 4 levels of consistency with stereo information 4 Stars 4 levels of consistency, stereo ignored 3 Starts Plus 3 of 4 levels consistent; 4th is a tautomer 3 Stars ENHANCED 3 levels of consistency with stereo information 3 Stars 3 levels of consistency, stereo ignored. 2 Stars Plus 2 of 4 levels consistent; 3rd is a tautomer 1 Star Whatever's left – too many errors to count!  Molblock  SMILES string  chemical name (based on ACD/Labs dictionary)  CAS Number (based on a DSSTox lookup) 4 levels of consistency possible Auto-curation
  • 24. KNIME structure-“cleaning” workflow 23 Indigo  Combine community approaches to structure processing  Develop a flexible workflow to be used by EPA and shared publicly  Process DSSTox files to create “QSAR-ready” structures Publicly available cheminformatics toolkits in KNIME: https://www.knime.org/knime  Parse SDF, remove fragments  Explicit hydrogen removed  Dearomatization  Removal of chirality info, isotopes and pseudo-atoms  Aromatization + add explicit hydrogen atoms  Standardize Nitro groups  Other tautomerize/mesomerization  Neutralize (when possible)
  • 25. Building the Models • QSAR ready forms for modeling – standardize tautomers, remove stereochemistry, no salts • Remove approx. 1800 chemicals due to poor quality score • Build using 3 STAR and BETTER chemicals
  • 26. Rederived model for PhysProp Log Kow (n=14049) 25 Weighted kNN model, 5-nearest neighbors Training: 11251 chemicals Test set: 2798 chemicals 5 fold cross validation: R2: 0.87 RMSE: 0.67 Open source PaDEL descriptors
  • 27. Remember the original EPI Suite outlier scatter? 26 EPISuite Predicted vs. Experimental EPISuite vs kNN Comparison of Cluster • Applicability domain (AD) of original EPI Suite LogKow Model improved in updated PhysProp model * 280 cmpd cluster outside AD of EPI Suite  no longer outliers in new model
  • 28. Updated PhysProp results • Performed curation and cleaning of all 14 PhysProp measured property datasets • Rederived models using modern ML methods & open source descriptors provide significantly improved prediction accuracy over older EPI Suite models • Predictions generated for entire (>700K) DSSTox database, stored in new property database • Measured & predicted properties to be surfaced in first release of iCSS Chemistry Dashboard (April 2016)  Global statistics insensitive to local curation improvements in >15K training set, BUT exceedingly important when surfacing measured properties in support of individual prediction values 27
  • 29. iCSS Chemistry Dashboard Releasing April 2016 28 • PHASE 1 Delivery – Web interface supporting CSS research  access to DSSTox content: >700,000 chemicals  access to experimental property data • Initial set of models based on reanalysis of cleaned, curated EPI Suite PHYSPROP datasets – logP, BP, MP, Wsol etc.
  • 30. iCSS Chemistry Dashboard Releasing in April 2016 29 Formerly DSSTox GSID, new unique public substance ID, essential for RDF/semantic web applications Melting Point
  • 31. iCSS Chemistry Dashboard Releasing in April 2016 • Original raw source result view • Users can submit comments
  • 32. • Links to external resources: EPA, NIH, property predictors 31 iCSS Chemistry Dashboard Releasing in April 2016 e.g., ChemAxon’s “Chemicalize” web-service e.g., PubChem’s bioactivity summary
  • 33. Linkage to External Predictor, e.g. Chemicalize 32 Chemicalize.org (powered by ChemAxon) Atomic Charges Molecule pKa
  • 34. Working to integrate other EPA predictors • ExpoCast  near & far-field exposure models • Environmental Fate Simulator (EFS)  air/soil/water distribution, biotransformation • CERAPP  estrogen receptor activity QSAR model • T.E.S.T (in progress)  phys-chem properties & toxicity endpoints 33
  • 35. e.g., T.E.S.T. 34 http://www.epa.gov/chemical-research/toxicity-estimation-software-tool-test  Downloadable Windows app  Both phys-chem and tox endpoints predicted  Multiple QSAR modeling methods employed  Multiple views of data
  • 36. T.E.S.T. QSAR Model Prediction Views 35 Model statistics Query chemical Prediction results
  • 37. Building “NCCT Models” • Collect “Open Data” for various endpoints of interest to EPA, from e.g.  PubChem  eChemPortal  Open Data Sources • & Build QSAR prediction models  Using modern machine-learning approaches  Ensuring high quality data as inputs  Populating database with measured & predicted values  Making experimental data training sets available  Allowing real-time predictions (in the future) 36
  • 38. Quality review required at all levels of chemical “Representations” Structure Generic Substance Test Sample Supplier, Lot/Batch, physical description Features Properties Descriptors, fingerprints, charges, phys-chem properties, etc. SMILES InChI Chemical Name CASRN Increasingdataaggregation
  • 39. In conclusion… • iCSS Chemistry Dashboard will provide public access to data & services focused on chemicals of interest to EPA • Initially serve up results for measured & pre-predicted physical chemical property data • Chemical-Data consistency quality flags for DSSTox and property data add value to public domain data, BUT … chemical-data linkages require curation/validation!! • All data and models will be available as OPEN DATA and OPEN CODE
  • 40. Acknowledgements NCCT Contributors to the iCSS Chemistry Dashboard 39 Jeff Edwards Jeremy Fitzpatrick Jordan Foster Jason Harris Dave Lyons Kamel Mansouri Aria Smith Jennifer Smith Indira Thillainadarajah Chris Grulke Antony Williams