1. Knowledge Based Expert System development in Bioinformatics applied to Multiple Sequence Alignment Mohamed RadhouaneAniba Laboratoire de Bioinformatique et de GénomiqueIntégratives Supervisors Julie Thompson AronMarchler-Bauer Mohamed RadhouaneAniba 15/09/2010 1
2. Outline INTRODUCTION Infosphere and KnowledgeDiscovery BiologicalKnowledgeDiscovery Data Integration KnowledgeBased Expert Systems in Bioinformatics KBS : Application to Multiple SequenceAlignment ALIGNMENT EXPERT SYSTEM : AlexSys Design Implementation Evaluation CONCLUSIONS AND PERSPECTIVES When The Information Age meets the PostgenomicEra Data Storage, warehousing and Quality From Data Integration to KnowledgeDiscovery… Challenges Why do weneedthem ? Ideal case study ? Data/Textmining, machine learning, knowledgeware Not a software, Not a workflow … BioinformaticsMash-up : Unstructured Information + Apps + Artificial Intelligence Benchmarking, Training data, Test data, performance Mohamed RadhouaneAniba 15/09/2010 2
5. KnowledgeDiscovery Cycle List of simple facts / observations WITHOUTcontext or meaning Whatwelearnafter Information absorption Organized data generatingmeaning ( relationshipbetweenpieces of data ) Knowledge extraction is a complexprocess Mohamed RadhouaneAniba 15/09/2010 5
6. BiologicalKnowledgeDiscovery Integration Integration Data Warehouse Integration Integration Integration Integration Integration Raw Data Knowledge Raw Data Knowledge Knowledge Raw Data Knowledge Raw Data Knowledge Raw Data Knowledge Raw Data Raw Data Knowledge Distributed Data Access Patterns / Rules Patterns / Rules Patterns / Rules Patterns / Rules Patterns / Rules Patterns / Rules Patterns / Rules SRS Transformed Data Transformed Data Transformed Data Transformed Data Transformed Data Transformed Data Transformed Data Target Data Target Data Target Data Target Data Target Data Target Data Target Data Data Access Data Access Data Access Data Access Data Access Data Access Data Warehouse ENTREZ Understanding Understanding Understanding Understanding Understanding Understanding Understanding Data Mining TextMining Interpretation & Evaluation Data Mining TextMining Interpretation & Evaluation Data Mining TextMining Interpretation & Evaluation Data Mining TextMining Interpretation & Evaluation Data Mining TextMining Interpretation & Evaluation Data Mining TextMining Interpretation & Evaluation Selection & Cleaning Selection & Cleaning ATLAS Selection & Cleaning Selection & Cleaning Selection & Cleaning Selection & Cleaning Data Mining TextMining Interpretation & Evaluation Selection & Cleaning Mohamed RadhouaneAniba 15/09/2010 6
7. Towards « Knowledgeware » Bioinformaticsresourcesintegration Reasoning and decisionmaking Artificial Intelligence and Machine Learning Pipelines / Workflows Metadatabased system Human expertise Mohamed RadhouaneAniba 15/09/2010 7
8. Towards « Knowledgeware » Bioinformaticsresourcesintegration Reasoning and decisionmaking Artificial Intelligence and Machine Learning Pipelines / Workflows Metadatabased system Human expertise Software Knowledgeware (expert system) ProblemSolver atSingleLevel ProblemSolver atSystem Level Mohamed RadhouaneAniba 15/09/2010 8
25. Data complexityMultidomainproteins P53/P63/P73 Toomanysequences (> 10 000) Errors (Sequencingerrors , poorpredictions .. ) 40 ~ 50 % more and more long and complex proteinsequences Complicating the construction and analysis of MSA Mohamed RadhouaneAniba 15/09/2010 11
28. MSA construction stages and validation (Expertise exists)Co-operativealgorithms : non redundantand important approach Thompson et al. J. Mol. Biol 2001 Thompson and Poch, Current Bioinformatics, 2006 Mohamed RadhouaneAniba 15/09/2010 12
29. MSA state of the art Complexproteinfamilies : programs behavedifferently CONSERVED VS DIVERGENT No single algorithm to solve all problems : cooperativeapproaches Mohamed RadhouaneAniba 15/09/2010 13
34. Expert System Development Specifications Design ProblemDefinition Development Maintenance Evolution Knowledge Base Tools Choice Data access Analysis modules Bug Reporting Code optimization Testing Results Modules Extensions Cross platform Deployment Exploitation Mohamed RadhouaneAniba 15/09/2010 15
35. KnowledgeBased Expert System Design Users Domain Expertise, factsused by ES To makedeterminations 2 1 User Interface Databasecontaining data specific to a problembeingsolved 5 6 2 Analysis Modules Aquisition 3 InferenceEngine : Code at the core of the system thatderivesrecommendations 6 4 3 4 Knowledge base Working Storage Update or expand the knowledge base 4 1 UI : dialogbetween the user and the ES 5 Experts Mohamed RadhouaneAniba 15/09/2010 16
36.
37. AlexSys developmentplatform UIMA Expert System architecture Type System (Data Containers) ExampleSequence ID : String Sequence : String Length : Integer CrossReference : String Etc … Type System (Data Containers) Example Blast Query : String Result : String Evalue : float Hits : Integer …. Analysis Module (1 module = 1 task) Example Blast/Alignment Structured Data Unstructured Data Mohamed RadhouaneAniba 15/09/2010 18
38. AlexSys Core System: Milestone 1 Development of Bio-scenarios Data access and standardization Metadataretrieval and integration (structure, function, literature, clinicalstudies, …) Data curation and validation (predictionerrors, sequencequality …) Data classification according to the analysis scenario Alignment construction (combination of differentalgorithms) Alignment validation, refinement and qualitymeasurement Alignmentautomatic annotation Mohamed RadhouaneAniba 15/09/2010 19
39. AlexSys : MSA Construction Choose a suitable MSA program to align input sequences Input/Output management module (API : Biojava) SequenceFeature Extraction modules: Number, length, %ID, helix, strands, hydrophobicity, composition etc … BIRD, MACSIMS, Interproscan Type System (CAS) Sequences Multiple Alignment modules: Incorporation of differentalgorithms Type System (CAS) NewFormat Type System (CAS) Features Type System (CAS) Alignment Alignment Program = f ( feature1, feature2, feature3…featureN ) ? Whichfeaturecombinationis « dangerous » for a given program ? Whatmakes a given program sensitive to a givenfeaturecombination ? Mohamed RadhouaneAniba 15/09/2010 20
40. AlexSys intelligent decision Incorporation of machine learningstep Alignment Program = f ( feature1, feature2, feature3…featureN ) ? Collect Data Choose Features Choose Model Train Classifier Evaluate Classifier Mohamed RadhouaneAniba 15/09/2010 21
41. AlexSys intelligent decision Incorporation of machine learningstep Alignment Program = f( feature1, feature2, feature3…featureN ) ? Collect Data Choose Features Choose Model Train Classifier Evaluate Classifier Mohamed RadhouaneAniba 15/09/2010 22
42. AlexSys intelligent decision Incorporation of machine learningstep Training/Test set: BAliBase 3.0 Reference 1 equi-distant sequences with various levels of conservation Reference 2 families aligned with a highly divergent "orphan" sequence Reference 3 subgroups with <25% residue identity between groups Reference 4 sequences with N/C-terminal extensions Reference 5 internal insertions Reference 6 repeats Reference 7 transmembrane regions Reference 8 circular permutations 218 Alignments 6222 Sequences Thompson et al. Bioinformatics 1999 Bahr et al., Nucl Acids Res, 2001 Thompson et al. Proteins 2005 http://www-bio3d-igbmc.u-strasbg.fr/balibase/ Mohamed RadhouaneAniba 15/09/2010 23
43. AlexSys intelligent decision Incorporation of machine learningstep Alignment Program = f ( feature1, feature2, feature3…featureN ) ? Collect Data Choose Features Choose Model Train Classifier Evaluate Classifier Mohamed RadhouaneAniba 15/09/2010 24
47. AlexSys intelligent decision Incorporation of machine learningstep Alignment Program = f ( feature1, feature2, feature3…featureN ) ? Collect Data Choose Features Choose Model Train Classifier Evaluate Classifier Mohamed RadhouaneAniba 15/09/2010 26
48. AlexSys intelligent decision Incorporation of machine learningstep Sum of Pairs Probcons 0 1 ProbCons UnalignedSequences Sum of Pairs Mafft 0 1 Mafft Reference (BAliBase) Sum of Pairs Muscle 0 1 Muscle All in one model Instances Class Mafft Attributes Probcons Muscle 175 sets (80%) x 6 alignment programs = 1050 operations Mohamed RadhouaneAniba 15/09/2010 27
49. AlexSys intelligent decision Incorporation of machine learningstep All in one model Class Mafft Probcons Muscle Machine Learning Model UnknownSequences Which Class ? Mohamed RadhouaneAniba 15/09/2010 28
54. Can be applied to metric, nominal, or mixed data.DecisionTrees BayesianMethods Hidden Markov Models Support Vector Machines Neural Networks Clustering GeneticAlgorithms Association Rules Reinforcement Learning Fuzzy Sets DecisionTrees Mohamed RadhouaneAniba 15/09/2010 29
55. AlexSys intelligent decision Incorporation of machine learningstep Alignment Program = f ( feature1, feature2, feature3…featureN ) ? Collect Data Choose Features Choose Model J48 / RandomTree / Random Forest Train Classifier Train Set / 10 fold Cross Validation Evaluate Classifier Test Set / Performance Mohamed RadhouaneAniba 15/09/2010 30
56. AlexSys intelligent decision Incorporation of machine learningstep All in one model 2x(PxR)/(P+R) TP/(TP+FN) TP/(TP+FP) TP C4.5 (J48) FN Correctlyclassifiedalignments ~ 42 % FP Mohamed RadhouaneAniba 15/09/2010 31
57. AlexSys intelligent decision Incorporation of machine learningstep All in one model RandomTree Correctlyclassifiedalignments ~ 41 % Mohamed RadhouaneAniba 15/09/2010 32
58. AlexSys intelligent decision Incorporation of machine learningstep All in one model Random Forest Correctlyclassifiedalignments ~ 52 % Mohamed RadhouaneAniba 15/09/2010 33
59. AlexSys intelligent decision Incorporation of machine learningstep Not Accurate Training set toosmall, not representative ? Not enoughfeatures ? Complex multi-dimensional model ? Alignment programs are difficult to distinguish in some cases ? Mohamed RadhouaneAniba 15/09/2010 34
60. AlexSys intelligent decision Incorporation of machine learningstep Alignment Program = f ( feature1, feature2, feature3…featureN ) ? Oxbench 605 Alignments 3656 Sequences BAliBase 4.0 (New) 240 Alignments 19806 Sequences ADD MORE Data 1063 Alignments 29684 Sequences Choose Features Choose Model Train Classifier Not YetAccurate Evaluate Classifier Mohamed RadhouaneAniba 15/09/2010 35
61. AlexSys intelligent decision Incorporation of machine learningstep Alignment Program = f ( feature1, feature2, feature3…featureN ) ? AlignmentQuality = f ( feature1, feature2, feature3…featureN ) ? Collect Data Choose Features CHANGE Model Train Classifier Binary Classification Models for AlignmentStrength Evaluate Classifier Mohamed RadhouaneAniba 15/09/2010 36
64. AlexSys intelligent decision ClustalW Mafft Dialign Mode 1 Muscle Predictive Based on Probability UnalignedSequences + Kalign Mode 2 Probcons Intuitive Based on rules Pr(Probcons) = Strong Pr(Mafft) = Strong Pr(Dialign) = Weak Pr(ClustalW) = Weak Pr(Kalign) = Strong Pr(Muscle) = Weak If Decision Inside AlexSys Combine seperatepredictionsinto a single decision Mohamed RadhouaneAniba 15/09/2010 39
65. AlexSys Intelligent System: Milestone 2 AnalysisEngine to predictsuitable program for an unknown set of sequences Aligner Predictor Mohamed RadhouaneAniba 15/09/2010 40
78. Objectives achieved:Core system, incorporatingdifferent, complementaryalgorithms Understanding of relationshipsbetweensequencecharacteristics and algorithmicstrengths and weaknesses Development of a system thatcanautomaticallydefinewhichalgorithm to use depending on the sequencefeaturesusing an Intelligent Engine Application in a highthroughputproject Mohamed RadhouaneAniba 15/09/2010 47
79.
80. Use knowledgegained to improvealgorithms for alignment construction (ClustalW/X, …)
96. Information hierarchical classification Luciano Floridi Unstructured Data Primary Information (Info in Databases …) Data (Structured) Secondary Information (Presence / Absence …) Environmental Semantic (Content) Meta Information (Copyright …) Instructional factual Untrue True (Information) Operational Information (Info about IS dynamics) Derivative Information (comparative/quantitative analyses) Unintentional (Misinformation) Intentional (Disinformation) Knowledge Mohamed RadhouaneAniba 15/09/2010 52
Myphdworkisgenerallyplaced in a context of information management and knowledgediscoverysincewe are living in a world whereuseful or unsueful information comesfromeverywhereeitherfrom books, journals, magazines, audio and video sources and especiallysince the nternetrevolutionthroughunexpectable information sources like social networks, informationalwebsitesetcwhich are onlymeans by whichhumans exchange their data and information. All of thisuniverseconstitutes the infospherewhere the man istrapped in hisowngame and istodayfacing lot of problems and challenges for thisamount of data management
Biology as manyotherfieldsisclearly a good illustration of this data management challenges, especiallywith the computational and infrastructure rapidgrowth and evolutionwhich has a direct effect on the creation of new biological data types and concepts
Entre le KB : apprentissageToolchoices plus compliqué : plusieurs outils pour la meme chose : tout integrer