O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Prediction of therapeutic targets using the Open Targets data

Mais Conteúdo rRelacionado

Prediction of therapeutic targets using the Open Targets data

  1. 1. Prediction of therapeutic targets using the Open Targets data Enrico Ferrero, PhD, Associate GSK Fellow Scientific Leader, Computational Biology, Target Sciences GSK Artificial Intelligence in Drug Development 27.09.2017
  2. 2. Challenges in pharma R&D Time and costs are increasing but success rate is declining 2In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  3. 3. Why focus on targets? Late phase failures cost (a lot) more 3 0 200 400 600 800 1000 1200 0 10 20 30 40 50 60 70 80 90 100 Lead discovery Lead optimization Pre-clinical FTIH Phase 2 Phase 3 Relativecost(permolecule) Nmolecules Manhattan Institute, 2012 In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  4. 4. Rethink the drug discovery pipeline Spend more time and resources in target validation to reduce attrition in later phases 4 Targetvalidation Potentialtargets Pre-clinical FTIH LaunchPhase 2 Phase 3 Lead discovery Lead optimisation Launch PotentialtargetsPotentialtargets Lead discovery Lead optimisation Pre-clinical FTIH Phase 2 Phase 3 Target validation In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  5. 5. Cook et al., 2014; Nelson et al., 2015 Target discovery and genetics evidence  40% of efficacy failures are due to poor linkage between target and disease.  The proportion of drug mechanisms with direct genetic support increases significantly across the drug development pipeline.  Selecting genetically supported targets could double the success rate in clinical development.
  6. 6. Open Targets A platform for therapeutic target identification and validation 6In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  7. 7. Could it be as easy as spotting spam emails? 7In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero  Is it possible to predict novel therapeutic targets using available gene – disease association data? Predicting therapeutic targets
  8. 8. A simple machine learning workflow 8 Generate input data matrix Assign labels and split into training, test and prediction sets Exploratory data analysis Tune, train and test classifiers using nested cross-validation Evaluate best classifier performance on test set Explore predicted targets across the drug discovery pipeline Make predictions using best performing classifier Validate with literature text mining In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero Predict therapeutic targets only using gene – disease association data
  9. 9. Data sources and data processing 9  Obtain all gene disease associations and supporting evidence from Open Targets platform.  For all genes, create numeric features by taking the mean score across all diseases: – Genetic associations (germline) – Somatic mutations – Significant gene expression changes – Disease-relevant phenotype in animal model – Pathway-level evidence  Gather positive labels from Pharmaprojects: only consider targets with drugs currently on the market, in clinical trials or preclinical studies. Exclude targets with drugs withdrawn from market or whose development has been discontinued. Input data matrix generation In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  10. 10. A positive – unlabelled (PU) semi-supervised learning approach 10  A semi-supervised framework with only positive labels is used: targets according to PharmaProjects constitute the positive class (P), while the rest of the proteome is used as the unlabelled class (U), containing both negatives and yet-to-be-discovered positive.  All positive cases (1421) and an equal number of randomly selected unlabelled cases (2842 in total) are set apart for training (80%) and testing (20%).  The remainder is kept as a prediction set where predictions from the final model will be made. Split data into training, test and prediction set In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  11. 11. Dimensionality reduction reveals structure in the data 11 t-Distributed Stochastic Neighbour Embedding (t-SNE) In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  12. 12. What are the most “important” features? 12 Chi-squared test + information gain In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  13. 13. Nested cross-validation and bagging for tuning and model selection 13  Four classifiers are independently tuned, trained and tested on the training set using a nested cross-validation strategy (4 inner rounds for parameter tuning and 4 outer rounds to assess performance): – Random forest (tuned parameters: number of trees and number of features); – Feed-forward neural network with single hidden layer (tuned parameters: size and decay); – Support vector machine with radial kernel (tuned parameters: gamma and cost); – Gradient boosting machine with AdaBoost exponential loss function (tuned parameters: number of trees and interaction depth).  In PU learning, U contains both positive and negative cases, which results in classifier instability. Bagging (bootstrap aggregating) can improve the performance of instable classifiers by randomly resampling P and U with replacement (bootstrap) and then aggregating the results by majority voting: – Bagging with 100 iterations was applied to the neural network, the support vector machine and the gradient boosting machine. – Random forests are already a special case of bagging. Tuning, training and testing four classifiers In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  14. 14. Evaluating classifiers performance 14 Neural network classifier achieves 71% accuracy (0.76 AUC) on test set In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  15. 15. Disease association evidence higher for more advanced targets 15 Model predicts late-stage targets more easily than early-stage ones In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  16. 16. Literature text mining validation of predictions 16 Highly significant overlap between predictions and text mining results In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero p-value = 5.05e-172
  17. 17. Conclusions 17 In silico predictions of novel therapeutic targets using gene – disease association data In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero  The gene – disease association data from Open Targets contains enough information to predict whether a protein can make a therapeutic target or not with decent accuracy (71%)  Aside from standard cross-validation and testing, prediction results were also validated by mining the scientific literature for therapeutic targets and assessing the significance of the overlap.  The ability of the neural network model to predict late stage targets with greater accuracy confirms that clear linkage between target and disease is essential to maximise chances of success in the clinic.  Of the evidence types tested, animal models showing disease-relevant phenotypes, dysregulated gene expression in disease tissue and genetic associations between gene and disease appear as the most informative ones.  Limitations:  Lack of prediction on indication;  No tractability considerations.
  18. 18. Acknowledgements 18  Ian Dunham  Philippe Sanseau  Gautier Koscielny  Giovanni Dall’Olio  Pankaj Agarwal  Mark Hurle  Steven Barrett  Nicola Richmond  Jin Yao In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  19. 19. Thank you 19
  20. 20. Pharmaprojects 20In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero An industry-wide drug development database
  21. 21. Exploratory data analysis reveals sparse data with little structure 21 Hierarchical clustering + principal component analysis In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  22. 22. Tune, train and test classifiers using cross-validation 22 Decision tree classification criteria In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  23. 23. Evaluating classifiers performance 23 Performance measures for supervised learning In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  24. 24. Neural network performance on independent test set 24 Selected classifier with most balanced overall performance for further analyses Cross-validation Test Misclassification error 0.303 0.287 Accuracy 0.697 0.713 AUC 0.758 0.763 Recall/Sensitivity 0.610 0.638 Specificity 0.785 0.784 Precision 0.742 0.736 F1 Score 0.670 0.683 In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  25. 25. Tune, train and test classifiers using cross-validation 25 Misclassification error In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  26. 26. Evaluate best classifier performance on test set 26 Confusion matrices Crossvalidation Prediction outcome Unknown Target Actual value Unknown 912 217 Target 445 700 Test Prediction outcome Unknown Target Actual value Unknown 225 67 Target 99 177 In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  27. 27. Split into training, test and prediction sets 27 Assess the effect of randomly sampling from unlabelled class: Monte Carlo simulation In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  28. 28. Tune, train and test classifiers using crossvalidation 28 Precision recall curves In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  29. 29. Tune, train and test classifiers using crossvalidation Predicted targets Predicted non-targets In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero 29 Overlap between predictions on training set
  30. 30. 30 Majority of targets with discontinued programmes not predicted as targets In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero Targets with lower disease association fail more often
  31. 31. Generating predictions on remaining 15K genes 31 Run model on prediction set (not used for training/testing) In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  32. 32. Validate with literature text mining 32 Assess the significance of the literature-based validation: permutation test In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero

×