SlideShare uma empresa Scribd logo
1 de 55
Knowledge Based Expert System development in Bioinformatics applied to Multiple Sequence Alignment Mohamed RadhouaneAniba Laboratoire de Bioinformatique et de GénomiqueIntégratives Supervisors Julie Thompson AronMarchler-Bauer Mohamed RadhouaneAniba 15/09/2010 1
Outline INTRODUCTION Infosphere and KnowledgeDiscovery BiologicalKnowledgeDiscovery 	Data Integration KnowledgeBased Expert Systems in Bioinformatics 	KBS : Application to Multiple SequenceAlignment ALIGNMENT EXPERT SYSTEM : AlexSys 	Design Implementation 	Evaluation CONCLUSIONS AND PERSPECTIVES When The Information Age meets the PostgenomicEra Data Storage, warehousing and Quality From Data Integration to KnowledgeDiscovery… Challenges Why do weneedthem ? Ideal case study ?  Data/Textmining, machine learning, knowledgeware Not a software, Not a workflow … BioinformaticsMash-up : Unstructured Information + Apps + Artificial Intelligence Benchmarking, Training data, Test data, performance Mohamed RadhouaneAniba 15/09/2010 2
Infosphere and KnowledgeDiscovery Mohamed RadhouaneAniba 15/09/2010 3
BiologicalInfosphere http://genomics.energy.gov Mohamed RadhouaneAniba 15/09/2010 4
KnowledgeDiscovery Cycle List of simple facts / observations WITHOUTcontext or meaning Whatwelearnafter Information absorption Organized data generatingmeaning ( relationshipbetweenpieces  of data ) Knowledge extraction is a complexprocess Mohamed RadhouaneAniba 15/09/2010 5
BiologicalKnowledgeDiscovery Integration Integration Data Warehouse Integration Integration Integration Integration Integration Raw Data Knowledge Raw Data Knowledge Knowledge Raw Data Knowledge Raw Data Knowledge Raw Data Knowledge Raw Data Raw Data Knowledge Distributed Data Access Patterns / Rules Patterns / Rules Patterns / Rules Patterns / Rules Patterns / Rules Patterns / Rules Patterns / Rules SRS Transformed Data Transformed Data Transformed Data Transformed Data Transformed Data Transformed Data Transformed Data Target Data Target Data Target Data Target Data Target Data Target Data Target Data Data Access Data  Access Data Access Data Access Data Access Data Access Data Warehouse ENTREZ Understanding Understanding Understanding Understanding Understanding Understanding Understanding Data Mining TextMining Interpretation & Evaluation Data Mining TextMining Interpretation & Evaluation Data Mining TextMining Interpretation & Evaluation Data Mining TextMining Interpretation & Evaluation Data Mining TextMining Interpretation & Evaluation Data Mining TextMining Interpretation & Evaluation Selection &  Cleaning Selection &  Cleaning ATLAS Selection &  Cleaning Selection &  Cleaning Selection &  Cleaning Selection &  Cleaning Data Mining TextMining Interpretation & Evaluation Selection &  Cleaning Mohamed RadhouaneAniba 15/09/2010 6
Towards «  Knowledgeware » Bioinformaticsresourcesintegration Reasoning and decisionmaking Artificial Intelligence and Machine Learning Pipelines / Workflows Metadatabased system Human expertise Mohamed RadhouaneAniba 15/09/2010 7
Towards «  Knowledgeware » Bioinformaticsresourcesintegration Reasoning and decisionmaking Artificial Intelligence and Machine Learning Pipelines / Workflows Metadatabased system Human expertise Software Knowledgeware (expert system) ProblemSolver atSingleLevel ProblemSolver atSystem Level Mohamed RadhouaneAniba 15/09/2010 8
Expert Systems in Bioinformatics, Why ? ,[object Object]
 Stop talking about AMOUNT OF DATA !!
Bioinformatics : the real problemis data dynamics and complexity
 Data origins and evolution
 Tons of answers to one single question
Necessity to rely on models and gold standards => Comparison
Comparison, Conservation, Differences
Weneed standards ! Weneed to learnfrom the past and to make the right predictionsLarge amount of data ,[object Object]
Comparison = Conservation + Differences
Weneed to learnfrom the past and to make the right predictions
 Intelligent decisions+ Data complexity and dynamics Bioinformatics + Large number of softwares and algorithms Mohamed RadhouaneAniba 15/09/2010 9
Expert System for MSA ,[object Object]
Strategic application:  impact on otherfieldsMohamed RadhouaneAniba 15/09/2010 10
MSA complexity ,[object Object]
Sequencenumber forces process automation
Noisy data (error propagation)
 Data complexityMultidomainproteins P53/P63/P73 Toomanysequences (> 10 000) Errors  (Sequencingerrors , poorpredictions .. )  40 ~ 50 %  more and more long and complex proteinsequences Complicating the construction and analysis of MSA Mohamed RadhouaneAniba 15/09/2010 11
MSA Algorithm Evolution ,[object Object]
 MSA : a mature field
 MSA construction stages and validation (Expertise exists)Co-operativealgorithms : non redundantand important approach Thompson et al. J. Mol. Biol  2001 Thompson and Poch, Current Bioinformatics, 2006 Mohamed RadhouaneAniba 15/09/2010 12
MSA state of the art Complexproteinfamilies : programs behavedifferently CONSERVED   VS  DIVERGENT No single algorithm to solve all problems : cooperativeapproaches Mohamed RadhouaneAniba 15/09/2010 13
Thesis Objectives: AlexSys Specification ,[object Object],Objectives ,[object Object]
Milestone  2Automaticallydefinealgorithms to use in eachstep of multiple sequencealignment construction based on intelligent decisions
Milestone  3Understandrelationshipsbetweensequencecharacteristics and algorithmicstrengths and weaknesses
Milestone 4Developdifferentanalysisprotocolsdedicated to different applications (Bio-Scenarios : comparative genomics, functional annotation, 3D modeling, evolutionarystudies … )Mohamed RadhouaneAniba 15/09/2010 14
Expert System Development Specifications Design ProblemDefinition Development Maintenance Evolution Knowledge Base Tools Choice Data access Analysis modules Bug Reporting Code optimization Testing Results Modules Extensions Cross platform Deployment Exploitation Mohamed RadhouaneAniba 15/09/2010 15
KnowledgeBased Expert System Design Users Domain Expertise, factsused by ES To makedeterminations 2 1 User Interface Databasecontaining data specific to a problembeingsolved 5 6 2 Analysis Modules Aquisition 3 InferenceEngine : Code at the core of the system thatderivesrecommendations 6 4 3 4 Knowledge base Working Storage Update or expand the knowledge base 4 1 UI : dialogbetween the user and the ES 5 Experts Mohamed RadhouaneAniba 15/09/2010 16
AlexSys developmentplatform Development alternatives UIMAUnstructured Information Management Architecture. Scalable and extensible platform  Deployment of unstructured information management solutions ,[object Object],Time consuming, not easy to maintain(C, prolog …) ,[object Object],UIMA Advantages: Ready to use architecture Modulesoriented Services and developmenttools Data-Drivenflows XML basedcomponents Active community Apache incubatorproject Wide support Javaprogramminglanguage Mohamed RadhouaneAniba 15/09/2010 17
AlexSys developmentplatform UIMA Expert System architecture Type System (Data Containers) ExampleSequence ID : String Sequence : String Length : Integer CrossReference : String Etc … Type System (Data Containers) Example Blast Query : String Result : String Evalue : float Hits : Integer …. Analysis Module (1 module = 1 task) Example Blast/Alignment Structured Data Unstructured Data Mohamed RadhouaneAniba 15/09/2010 18
AlexSys Core System: Milestone 1 Development of Bio-scenarios Data access and standardization Metadataretrieval and integration (structure, function, literature, clinicalstudies, …) Data curation and validation (predictionerrors, sequencequality …) Data classification according to the analysis scenario Alignment construction (combination of differentalgorithms) Alignment validation, refinement and qualitymeasurement Alignmentautomatic annotation  Mohamed RadhouaneAniba 15/09/2010 19
AlexSys : MSA Construction Choose a suitable MSA program to align input sequences Input/Output management module (API : Biojava) SequenceFeature Extraction modules:  Number, length, %ID, helix, strands,  hydrophobicity, composition etc  … BIRD, MACSIMS, Interproscan Type System (CAS) Sequences Multiple Alignment  modules: Incorporation of differentalgorithms Type System (CAS) NewFormat Type System (CAS) Features Type System (CAS) Alignment Alignment Program  = f ( feature1, feature2, feature3…featureN ) ? Whichfeaturecombinationis « dangerous » for a given program ? Whatmakes a given program sensitive to a givenfeaturecombination ? Mohamed RadhouaneAniba 15/09/2010 20
AlexSys intelligent decision Incorporation of machine learningstep Alignment Program  = f ( feature1, feature2, feature3…featureN ) ? Collect Data Choose Features Choose Model Train Classifier Evaluate Classifier  Mohamed RadhouaneAniba 15/09/2010 21
AlexSys intelligent decision Incorporation of machine learningstep Alignment Program  = f( feature1, feature2, feature3…featureN ) ? Collect Data Choose Features Choose Model Train Classifier Evaluate Classifier  Mohamed RadhouaneAniba 15/09/2010 22
AlexSys intelligent decision Incorporation of machine learningstep Training/Test set: BAliBase 3.0 Reference 1 equi-distant sequences with various levels of conservation Reference 2 families aligned with a highly divergent "orphan" sequence  Reference 3 subgroups with <25% residue identity between groups  Reference 4 sequences with N/C-terminal extensions  Reference 5 internal insertions Reference 6 repeats Reference 7 transmembrane regions Reference 8 circular permutations 218 Alignments 6222 Sequences Thompson et al. Bioinformatics 1999 Bahr et al., Nucl Acids Res, 2001 Thompson et al. Proteins 2005  http://www-bio3d-igbmc.u-strasbg.fr/balibase/ Mohamed RadhouaneAniba 15/09/2010 23
AlexSys intelligent decision Incorporation of machine learningstep Alignment Program  = f ( feature1, feature2, feature3…featureN ) ? Collect Data Choose Features Choose Model Train Classifier Evaluate Classifier  Mohamed RadhouaneAniba 15/09/2010 24
AlexSys intelligent decision Incorporation of machine learningstep ,[object Object]
ExperiencewithAlignment benchmarks
FeatureselectionMohamed RadhouaneAniba 15/09/2010 25
AlexSys intelligent decision Incorporation of machine learningstep Alignment Program  = f ( feature1, feature2, feature3…featureN ) ? Collect Data Choose Features Choose Model Train Classifier Evaluate Classifier  Mohamed RadhouaneAniba 15/09/2010 26
AlexSys intelligent decision Incorporation of machine learningstep Sum of Pairs Probcons 0                    1 ProbCons UnalignedSequences Sum of Pairs Mafft 0                    1 Mafft Reference (BAliBase) Sum of Pairs Muscle 0                    1 Muscle All in one model Instances Class Mafft Attributes Probcons Muscle 175 sets (80%) x 6 alignment programs = 1050 operations Mohamed RadhouaneAniba 15/09/2010 27
AlexSys intelligent decision Incorporation of machine learningstep All in one model Class Mafft Probcons Muscle Machine Learning Model UnknownSequences Which Class ? Mohamed RadhouaneAniba 15/09/2010 28
AlexSys intelligent decision Incorporation of machine learningstep ,[object Object]
Decisiontrees are understandable by humans
Trees are easilyconverted to rules
Simple learning procedure, fast evaluation.
 Can be applied to metric, nominal, or mixed data.DecisionTrees BayesianMethods Hidden Markov Models Support Vector Machines Neural Networks Clustering GeneticAlgorithms Association Rules Reinforcement Learning Fuzzy Sets DecisionTrees Mohamed RadhouaneAniba 15/09/2010 29
AlexSys intelligent decision Incorporation of machine learningstep Alignment Program  = f ( feature1, feature2, feature3…featureN ) ? Collect Data Choose Features Choose Model J48 / RandomTree / Random Forest Train Classifier Train Set / 10 fold Cross Validation Evaluate Classifier  Test Set / Performance  Mohamed RadhouaneAniba 15/09/2010 30

Mais conteúdo relacionado

Destaque

Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignment
avrilcoghlan
 
Socialines tinklaveikos irankiai ir priemones neformaliajame mokyme(si)
Socialines tinklaveikos irankiai ir priemones neformaliajame mokyme(si)Socialines tinklaveikos irankiai ir priemones neformaliajame mokyme(si)
Socialines tinklaveikos irankiai ir priemones neformaliajame mokyme(si)
Andrej Afonin
 
ICIS Chemical Business' Joseph Chang presentation to the Societe/Racemics mee...
ICIS Chemical Business' Joseph Chang presentation to the Societe/Racemics mee...ICIS Chemical Business' Joseph Chang presentation to the Societe/Racemics mee...
ICIS Chemical Business' Joseph Chang presentation to the Societe/Racemics mee...
jchangiii99
 
مطوية وحدة الصف
مطوية وحدة الصفمطوية وحدة الصف
مطوية وحدة الصف
dangermind
 
البر حسن الخلق
البر حسن الخلقالبر حسن الخلق
البر حسن الخلق
dangermind
 
7-6 Relating Fractions and Decimals
7-6 Relating Fractions and Decimals7-6 Relating Fractions and Decimals
7-6 Relating Fractions and Decimals
Rudy Alfonso
 
Tunnustettu asiantuntijuus thought leadership
Tunnustettu asiantuntijuus thought leadershipTunnustettu asiantuntijuus thought leadership
Tunnustettu asiantuntijuus thought leadership
Sari Aapola
 
8-7 Add Mixed Numbers
8-7 Add Mixed Numbers8-7 Add Mixed Numbers
8-7 Add Mixed Numbers
Rudy Alfonso
 

Destaque (20)

Msa
MsaMsa
Msa
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignment
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignment
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In Bioinformatics
 
Bsd
BsdBsd
Bsd
 
Socialines tinklaveikos irankiai ir priemones neformaliajame mokyme(si)
Socialines tinklaveikos irankiai ir priemones neformaliajame mokyme(si)Socialines tinklaveikos irankiai ir priemones neformaliajame mokyme(si)
Socialines tinklaveikos irankiai ir priemones neformaliajame mokyme(si)
 
Commas (Part One)
Commas (Part One)Commas (Part One)
Commas (Part One)
 
21 2 the reign of louis xiv
21 2 the reign of louis xiv21 2 the reign of louis xiv
21 2 the reign of louis xiv
 
ICIS Chemical Business' Joseph Chang presentation to the Societe/Racemics mee...
ICIS Chemical Business' Joseph Chang presentation to the Societe/Racemics mee...ICIS Chemical Business' Joseph Chang presentation to the Societe/Racemics mee...
ICIS Chemical Business' Joseph Chang presentation to the Societe/Racemics mee...
 
Pemasaran Email untuk bisnes dari rumah
Pemasaran Email untuk bisnes dari rumahPemasaran Email untuk bisnes dari rumah
Pemasaran Email untuk bisnes dari rumah
 
مطوية وحدة الصف
مطوية وحدة الصفمطوية وحدة الصف
مطوية وحدة الصف
 
البر حسن الخلق
البر حسن الخلقالبر حسن الخلق
البر حسن الخلق
 
Safety and Supply of hemophilia products
Safety and Supply of hemophilia productsSafety and Supply of hemophilia products
Safety and Supply of hemophilia products
 
7-6 Relating Fractions and Decimals
7-6 Relating Fractions and Decimals7-6 Relating Fractions and Decimals
7-6 Relating Fractions and Decimals
 
Tunnustettu asiantuntijuus thought leadership
Tunnustettu asiantuntijuus thought leadershipTunnustettu asiantuntijuus thought leadership
Tunnustettu asiantuntijuus thought leadership
 
8-7 Add Mixed Numbers
8-7 Add Mixed Numbers8-7 Add Mixed Numbers
8-7 Add Mixed Numbers
 
Triple D Designs
Triple D DesignsTriple D Designs
Triple D Designs
 
Preguntasclave
PreguntasclavePreguntasclave
Preguntasclave
 
K6 Jeopardy
K6 JeopardyK6 Jeopardy
K6 Jeopardy
 

Semelhante a Knowledge based expert systems in Bioinformatics

Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Chakkrit (Kla) Tantithamthavorn
 
Innovation at the Edge_Final
Innovation at the Edge_FinalInnovation at the Edge_Final
Innovation at the Edge_Final
Chris Waller
 
Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
Data Science: Driving Smarter Finance and Workforce Decsions for the EnterpriseData Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
DataWorks Summit
 
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
MLconf
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
c.titus.brown
 
Fendley how secure is your e learning
Fendley how secure is your e learningFendley how secure is your e learning
Fendley how secure is your e learning
Bryan Fendley
 
eTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service PlatformeTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service Platform
ibemam
 

Semelhante a Knowledge based expert systems in Bioinformatics (20)

Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
 
Machine_Learning_with_MATLAB_Seminar_Latest.pdf
Machine_Learning_with_MATLAB_Seminar_Latest.pdfMachine_Learning_with_MATLAB_Seminar_Latest.pdf
Machine_Learning_with_MATLAB_Seminar_Latest.pdf
 
Innovation at the Edge_Final
Innovation at the Edge_FinalInnovation at the Edge_Final
Innovation at the Edge_Final
 
Pistoia Alliance US Conference 2015 - 1.1.2 Innovation in Pharma - Chris Waller
Pistoia Alliance US Conference 2015 - 1.1.2 Innovation in Pharma - Chris WallerPistoia Alliance US Conference 2015 - 1.1.2 Innovation in Pharma - Chris Waller
Pistoia Alliance US Conference 2015 - 1.1.2 Innovation in Pharma - Chris Waller
 
Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
Data Science: Driving Smarter Finance and Workforce Decsions for the EnterpriseData Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
Data Science: Driving Smarter Finance and Workforce Decsions for the Enterprise
 
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
 
Unlock the power of MLOps.pdf
Unlock the power of MLOps.pdfUnlock the power of MLOps.pdf
Unlock the power of MLOps.pdf
 
Building Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated DiscoveryBuilding Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated Discovery
 
Streaming Hypothesis Reasoning - William Smith, Jan 2016
Streaming Hypothesis Reasoning - William Smith, Jan 2016Streaming Hypothesis Reasoning - William Smith, Jan 2016
Streaming Hypothesis Reasoning - William Smith, Jan 2016
 
Streaming HYpothesis REasoning
Streaming HYpothesis REasoningStreaming HYpothesis REasoning
Streaming HYpothesis REasoning
 
Security Application for Malicious Code Detection using Data Mining
Security Application for Malicious Code Detection using Data MiningSecurity Application for Malicious Code Detection using Data Mining
Security Application for Malicious Code Detection using Data Mining
 
From Model-based to Model and Simulation-based Systems Architectures
From Model-based to Model and Simulation-based Systems ArchitecturesFrom Model-based to Model and Simulation-based Systems Architectures
From Model-based to Model and Simulation-based Systems Architectures
 
Fendley how secure is your e learning
Fendley how secure is your e learningFendley how secure is your e learning
Fendley how secure is your e learning
 
Only Abstract
Only AbstractOnly Abstract
Only Abstract
 
SiavoshKaviani-CV[2021] francais.pdf
SiavoshKaviani-CV[2021] francais.pdfSiavoshKaviani-CV[2021] francais.pdf
SiavoshKaviani-CV[2021] francais.pdf
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)
 
Quant university MRM and machine learning
Quant university MRM and machine learningQuant university MRM and machine learning
Quant university MRM and machine learning
 
eTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service PlatformeTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service Platform
 
Lifesaving AI and Javascript (JSConf Korea 2019)
Lifesaving AI and Javascript (JSConf Korea 2019)Lifesaving AI and Javascript (JSConf Korea 2019)
Lifesaving AI and Javascript (JSConf Korea 2019)
 

Knowledge based expert systems in Bioinformatics

  • 1. Knowledge Based Expert System development in Bioinformatics applied to Multiple Sequence Alignment Mohamed RadhouaneAniba Laboratoire de Bioinformatique et de GénomiqueIntégratives Supervisors Julie Thompson AronMarchler-Bauer Mohamed RadhouaneAniba 15/09/2010 1
  • 2. Outline INTRODUCTION Infosphere and KnowledgeDiscovery BiologicalKnowledgeDiscovery Data Integration KnowledgeBased Expert Systems in Bioinformatics KBS : Application to Multiple SequenceAlignment ALIGNMENT EXPERT SYSTEM : AlexSys Design Implementation Evaluation CONCLUSIONS AND PERSPECTIVES When The Information Age meets the PostgenomicEra Data Storage, warehousing and Quality From Data Integration to KnowledgeDiscovery… Challenges Why do weneedthem ? Ideal case study ? Data/Textmining, machine learning, knowledgeware Not a software, Not a workflow … BioinformaticsMash-up : Unstructured Information + Apps + Artificial Intelligence Benchmarking, Training data, Test data, performance Mohamed RadhouaneAniba 15/09/2010 2
  • 3. Infosphere and KnowledgeDiscovery Mohamed RadhouaneAniba 15/09/2010 3
  • 5. KnowledgeDiscovery Cycle List of simple facts / observations WITHOUTcontext or meaning Whatwelearnafter Information absorption Organized data generatingmeaning ( relationshipbetweenpieces of data ) Knowledge extraction is a complexprocess Mohamed RadhouaneAniba 15/09/2010 5
  • 6. BiologicalKnowledgeDiscovery Integration Integration Data Warehouse Integration Integration Integration Integration Integration Raw Data Knowledge Raw Data Knowledge Knowledge Raw Data Knowledge Raw Data Knowledge Raw Data Knowledge Raw Data Raw Data Knowledge Distributed Data Access Patterns / Rules Patterns / Rules Patterns / Rules Patterns / Rules Patterns / Rules Patterns / Rules Patterns / Rules SRS Transformed Data Transformed Data Transformed Data Transformed Data Transformed Data Transformed Data Transformed Data Target Data Target Data Target Data Target Data Target Data Target Data Target Data Data Access Data Access Data Access Data Access Data Access Data Access Data Warehouse ENTREZ Understanding Understanding Understanding Understanding Understanding Understanding Understanding Data Mining TextMining Interpretation & Evaluation Data Mining TextMining Interpretation & Evaluation Data Mining TextMining Interpretation & Evaluation Data Mining TextMining Interpretation & Evaluation Data Mining TextMining Interpretation & Evaluation Data Mining TextMining Interpretation & Evaluation Selection & Cleaning Selection & Cleaning ATLAS Selection & Cleaning Selection & Cleaning Selection & Cleaning Selection & Cleaning Data Mining TextMining Interpretation & Evaluation Selection & Cleaning Mohamed RadhouaneAniba 15/09/2010 6
  • 7. Towards «  Knowledgeware » Bioinformaticsresourcesintegration Reasoning and decisionmaking Artificial Intelligence and Machine Learning Pipelines / Workflows Metadatabased system Human expertise Mohamed RadhouaneAniba 15/09/2010 7
  • 8. Towards «  Knowledgeware » Bioinformaticsresourcesintegration Reasoning and decisionmaking Artificial Intelligence and Machine Learning Pipelines / Workflows Metadatabased system Human expertise Software Knowledgeware (expert system) ProblemSolver atSingleLevel ProblemSolver atSystem Level Mohamed RadhouaneAniba 15/09/2010 8
  • 9.
  • 10. Stop talking about AMOUNT OF DATA !!
  • 11. Bioinformatics : the real problemis data dynamics and complexity
  • 12. Data origins and evolution
  • 13. Tons of answers to one single question
  • 14. Necessity to rely on models and gold standards => Comparison
  • 16.
  • 18. Weneed to learnfrom the past and to make the right predictions
  • 19. Intelligent decisions+ Data complexity and dynamics Bioinformatics + Large number of softwares and algorithms Mohamed RadhouaneAniba 15/09/2010 9
  • 20.
  • 21. Strategic application: impact on otherfieldsMohamed RadhouaneAniba 15/09/2010 10
  • 22.
  • 24. Noisy data (error propagation)
  • 25. Data complexityMultidomainproteins P53/P63/P73 Toomanysequences (> 10 000) Errors (Sequencingerrors , poorpredictions .. ) 40 ~ 50 % more and more long and complex proteinsequences Complicating the construction and analysis of MSA Mohamed RadhouaneAniba 15/09/2010 11
  • 26.
  • 27. MSA : a mature field
  • 28. MSA construction stages and validation (Expertise exists)Co-operativealgorithms : non redundantand important approach Thompson et al. J. Mol. Biol 2001 Thompson and Poch, Current Bioinformatics, 2006 Mohamed RadhouaneAniba 15/09/2010 12
  • 29. MSA state of the art Complexproteinfamilies : programs behavedifferently CONSERVED VS DIVERGENT No single algorithm to solve all problems : cooperativeapproaches Mohamed RadhouaneAniba 15/09/2010 13
  • 30.
  • 31. Milestone 2Automaticallydefinealgorithms to use in eachstep of multiple sequencealignment construction based on intelligent decisions
  • 33. Milestone 4Developdifferentanalysisprotocolsdedicated to different applications (Bio-Scenarios : comparative genomics, functional annotation, 3D modeling, evolutionarystudies … )Mohamed RadhouaneAniba 15/09/2010 14
  • 34. Expert System Development Specifications Design ProblemDefinition Development Maintenance Evolution Knowledge Base Tools Choice Data access Analysis modules Bug Reporting Code optimization Testing Results Modules Extensions Cross platform Deployment Exploitation Mohamed RadhouaneAniba 15/09/2010 15
  • 35. KnowledgeBased Expert System Design Users Domain Expertise, factsused by ES To makedeterminations 2 1 User Interface Databasecontaining data specific to a problembeingsolved 5 6 2 Analysis Modules Aquisition 3 InferenceEngine : Code at the core of the system thatderivesrecommendations 6 4 3 4 Knowledge base Working Storage Update or expand the knowledge base 4 1 UI : dialogbetween the user and the ES 5 Experts Mohamed RadhouaneAniba 15/09/2010 16
  • 36.
  • 37. AlexSys developmentplatform UIMA Expert System architecture Type System (Data Containers) ExampleSequence ID : String Sequence : String Length : Integer CrossReference : String Etc … Type System (Data Containers) Example Blast Query : String Result : String Evalue : float Hits : Integer …. Analysis Module (1 module = 1 task) Example Blast/Alignment Structured Data Unstructured Data Mohamed RadhouaneAniba 15/09/2010 18
  • 38. AlexSys Core System: Milestone 1 Development of Bio-scenarios Data access and standardization Metadataretrieval and integration (structure, function, literature, clinicalstudies, …) Data curation and validation (predictionerrors, sequencequality …) Data classification according to the analysis scenario Alignment construction (combination of differentalgorithms) Alignment validation, refinement and qualitymeasurement Alignmentautomatic annotation Mohamed RadhouaneAniba 15/09/2010 19
  • 39. AlexSys : MSA Construction Choose a suitable MSA program to align input sequences Input/Output management module (API : Biojava) SequenceFeature Extraction modules: Number, length, %ID, helix, strands, hydrophobicity, composition etc … BIRD, MACSIMS, Interproscan Type System (CAS) Sequences Multiple Alignment modules: Incorporation of differentalgorithms Type System (CAS) NewFormat Type System (CAS) Features Type System (CAS) Alignment Alignment Program = f ( feature1, feature2, feature3…featureN ) ? Whichfeaturecombinationis « dangerous » for a given program ? Whatmakes a given program sensitive to a givenfeaturecombination ? Mohamed RadhouaneAniba 15/09/2010 20
  • 40. AlexSys intelligent decision Incorporation of machine learningstep Alignment Program = f ( feature1, feature2, feature3…featureN ) ? Collect Data Choose Features Choose Model Train Classifier Evaluate Classifier Mohamed RadhouaneAniba 15/09/2010 21
  • 41. AlexSys intelligent decision Incorporation of machine learningstep Alignment Program = f( feature1, feature2, feature3…featureN ) ? Collect Data Choose Features Choose Model Train Classifier Evaluate Classifier Mohamed RadhouaneAniba 15/09/2010 22
  • 42. AlexSys intelligent decision Incorporation of machine learningstep Training/Test set: BAliBase 3.0 Reference 1 equi-distant sequences with various levels of conservation Reference 2 families aligned with a highly divergent "orphan" sequence Reference 3 subgroups with <25% residue identity between groups Reference 4 sequences with N/C-terminal extensions Reference 5 internal insertions Reference 6 repeats Reference 7 transmembrane regions Reference 8 circular permutations 218 Alignments 6222 Sequences Thompson et al. Bioinformatics 1999 Bahr et al., Nucl Acids Res, 2001 Thompson et al. Proteins 2005 http://www-bio3d-igbmc.u-strasbg.fr/balibase/ Mohamed RadhouaneAniba 15/09/2010 23
  • 43. AlexSys intelligent decision Incorporation of machine learningstep Alignment Program = f ( feature1, feature2, feature3…featureN ) ? Collect Data Choose Features Choose Model Train Classifier Evaluate Classifier Mohamed RadhouaneAniba 15/09/2010 24
  • 44.
  • 47. AlexSys intelligent decision Incorporation of machine learningstep Alignment Program = f ( feature1, feature2, feature3…featureN ) ? Collect Data Choose Features Choose Model Train Classifier Evaluate Classifier Mohamed RadhouaneAniba 15/09/2010 26
  • 48. AlexSys intelligent decision Incorporation of machine learningstep Sum of Pairs Probcons 0 1 ProbCons UnalignedSequences Sum of Pairs Mafft 0 1 Mafft Reference (BAliBase) Sum of Pairs Muscle 0 1 Muscle All in one model Instances Class Mafft Attributes Probcons Muscle 175 sets (80%) x 6 alignment programs = 1050 operations Mohamed RadhouaneAniba 15/09/2010 27
  • 49. AlexSys intelligent decision Incorporation of machine learningstep All in one model Class Mafft Probcons Muscle Machine Learning Model UnknownSequences Which Class ? Mohamed RadhouaneAniba 15/09/2010 28
  • 50.
  • 53. Simple learning procedure, fast evaluation.
  • 54. Can be applied to metric, nominal, or mixed data.DecisionTrees BayesianMethods Hidden Markov Models Support Vector Machines Neural Networks Clustering GeneticAlgorithms Association Rules Reinforcement Learning Fuzzy Sets DecisionTrees Mohamed RadhouaneAniba 15/09/2010 29
  • 55. AlexSys intelligent decision Incorporation of machine learningstep Alignment Program = f ( feature1, feature2, feature3…featureN ) ? Collect Data Choose Features Choose Model J48 / RandomTree / Random Forest Train Classifier Train Set / 10 fold Cross Validation Evaluate Classifier Test Set / Performance Mohamed RadhouaneAniba 15/09/2010 30
  • 56. AlexSys intelligent decision Incorporation of machine learningstep All in one model 2x(PxR)/(P+R) TP/(TP+FN) TP/(TP+FP) TP C4.5 (J48) FN Correctlyclassifiedalignments ~ 42 % FP Mohamed RadhouaneAniba 15/09/2010 31
  • 57. AlexSys intelligent decision Incorporation of machine learningstep All in one model RandomTree Correctlyclassifiedalignments ~ 41 % Mohamed RadhouaneAniba 15/09/2010 32
  • 58. AlexSys intelligent decision Incorporation of machine learningstep All in one model Random Forest Correctlyclassifiedalignments ~ 52 % Mohamed RadhouaneAniba 15/09/2010 33
  • 59. AlexSys intelligent decision Incorporation of machine learningstep Not Accurate Training set toosmall, not representative ? Not enoughfeatures ? Complex multi-dimensional model ? Alignment programs are difficult to distinguish in some cases ? Mohamed RadhouaneAniba 15/09/2010 34
  • 60. AlexSys intelligent decision Incorporation of machine learningstep Alignment Program = f ( feature1, feature2, feature3…featureN ) ? Oxbench 605 Alignments 3656 Sequences BAliBase 4.0 (New) 240 Alignments 19806 Sequences ADD MORE Data 1063 Alignments 29684 Sequences Choose Features Choose Model Train Classifier Not YetAccurate Evaluate Classifier Mohamed RadhouaneAniba 15/09/2010 35
  • 61. AlexSys intelligent decision Incorporation of machine learningstep Alignment Program = f ( feature1, feature2, feature3…featureN ) ? AlignmentQuality = f ( feature1, feature2, feature3…featureN ) ? Collect Data Choose Features CHANGE Model Train Classifier Binary Classification Models for AlignmentStrength Evaluate Classifier Mohamed RadhouaneAniba 15/09/2010 36
  • 62. AlexSys intelligent decision Incorporation of machine learningstep Mohamed RadhouaneAniba 15/09/2010 37
  • 63. AlexSys intelligent decision Incorporation of machine learningstep Mohamed RadhouaneAniba 15/09/2010 38
  • 64. AlexSys intelligent decision ClustalW Mafft Dialign Mode 1 Muscle Predictive Based on Probability UnalignedSequences + Kalign Mode 2 Probcons Intuitive Based on rules Pr(Probcons) = Strong Pr(Mafft) = Strong Pr(Dialign) = Weak Pr(ClustalW) = Weak Pr(Kalign) = Strong Pr(Muscle) = Weak If Decision Inside AlexSys Combine seperatepredictionsinto a single decision Mohamed RadhouaneAniba 15/09/2010 39
  • 65. AlexSys Intelligent System: Milestone 2 AnalysisEngine to predictsuitable program for an unknown set of sequences Aligner Predictor Mohamed RadhouaneAniba 15/09/2010 40
  • 66. AlexSys evaluation: MSA Accuracy Mohamed RadhouaneAniba 15/09/2010 41
  • 67. Exploitation : Milestone 3 (Sequence/Algorithmrelationship) RadViz (Radial Visualization) ClustalW Mohamed RadhouaneAniba 15/09/2010 42
  • 68. (Sequence/Algorithmrelationship) RadViz (Radial Visualization) Exploitation : Milestone 3 Mafft ClustalW Probcons Dialign Mohamed RadhouaneAniba 15/09/2010 43
  • 69. Exploitation : Milestone 3 Detectingunalignablesequence sets High SP Scores Alignmentswithlow SP scores All programs fail Low SP Scores Mohamed RadhouaneAniba 15/09/2010 44
  • 70.
  • 72. Search for homologs in 20 vertebratespecies
  • 74. Reconstructphylogenetictrees and geneticeventsDomain organisation Referencegenome Gene order Exon shuffling duplication insertion Mohamed RadhouaneAniba 15/09/2010 45
  • 75. Exploitation : High throughputproject 16/800 (2%) Predicted to be « weak » AlexSys[Mafft] ~ 0.4 800 random MSA (EvolHHuPro) Examplequery ANR60_HUMAN => Eitherchooseanother program, or warning « unalignable » Ankyrinrepeats ZU5 Death Mohamed RadhouaneAniba 15/09/2010 46
  • 76.
  • 77. Wedeveloped a novel system for MSA construction and validation
  • 78. Objectives achieved:Core system, incorporatingdifferent, complementaryalgorithms Understanding of relationshipsbetweensequencecharacteristics and algorithmicstrengths and weaknesses Development of a system thatcanautomaticallydefinewhichalgorithm to use depending on the sequencefeaturesusing an Intelligent Engine Application in a highthroughputproject Mohamed RadhouaneAniba 15/09/2010 47
  • 79.
  • 80. Use knowledgegained to improvealgorithms for alignment construction (ClustalW/X, …)
  • 81. Integration of additionalalgorithms (transmembrane, repeats, disorderedregions, motif detection, …)
  • 82. Integration of additional data (domains, 3D structures, function, mutation, …)
  • 83. Integration of information from the literature (exploitation of UIMA)
  • 84. Extendknowledge base (BAliBASE, feature investigation)
  • 85. Develop Bio-Scenarios for specifictasks/projectsMohamed RadhouaneAniba 15/09/2010 48
  • 86.
  • 87. Dedicated system design : Bioinformaticsproblems
  • 88. Human expertise needs to beformalized (ontologies, logicprogramming …)
  • 89. Dynamic, evolving: integration of new, useful data and algorithms as they are developed
  • 90. Evaluation of the quality of input data and results (Objective functions)
  • 92. Cloud Computing : Amazon EC2, IBM, Google, Sun … (BlastReduce, Biodoop, CloudBrust, CloudBlast, …)
  • 93. Create a community for ES in bioinformatics (Standards development and open projects)Mohamed RadhouaneAniba 15/09/2010 49
  • 94. Acknowledgement Julie Thompson Aron Marchler-Bauer Mohamed RadhouaneAniba 15/09/2010 50
  • 95. Infosphere and KnowledgeDiscovery Mohamed RadhouaneAniba 15/09/2010 51
  • 96. Information hierarchical classification Luciano Floridi Unstructured Data Primary Information (Info in Databases …) Data (Structured) Secondary Information (Presence / Absence …) Environmental Semantic (Content) Meta Information (Copyright …) Instructional factual Untrue True (Information) Operational Information (Info about IS dynamics) Derivative Information (comparative/quantitative analyses) Unintentional (Misinformation) Intentional (Disinformation) Knowledge Mohamed RadhouaneAniba 15/09/2010 52
  • 97. AlexSys Prototype Testing Mohamed RadhouaneAniba 15/09/2010 53
  • 98. Transmembrane Repeats Mafft Dialign Probcons Mohamed RadhouaneAniba 15/09/2010 54
  • 99. Sequence/Algorithmrelationship: Milestone 3 Mohamed RadhouaneAniba 15/09/2010 55

Notas do Editor

  1. Myphdworkisgenerallyplaced in a context of information management and knowledgediscoverysincewe are living in a world whereuseful or unsueful information comesfromeverywhereeitherfrom books, journals, magazines, audio and video sources and especiallysince the nternetrevolutionthroughunexpectable information sources like social networks, informationalwebsitesetcwhich are onlymeans by whichhumans exchange their data and information. All of thisuniverseconstitutes the infospherewhere the man istrapped in hisowngame and istodayfacing lot of problems and challenges for thisamount of data management
  2. Biology as manyotherfieldsisclearly a good illustration of this data management challenges, especiallywith the computational and infrastructure rapidgrowth and evolutionwhich has a direct effect on the creation of new biological data types and concepts
  3. Entre le KB : apprentissageToolchoices plus compliqué : plusieurs outils pour la meme chose : tout integrer