SlideShare a Scribd company logo
1 of 22
A Study of Random Forests Learning Mechanism with
 Application to the Identification of Informative Gene
           Interactions in Microarray Data

Jorge M. Arevalillo and Hilario Navarro
Dpt. Statistics and Operational Research
University Nacional de Educación a Distancia




1          Salford Analytics and Data Mining Conference 2012. San Diego
Outline
      Weak Marginal / Strong bivariate genetic interactions


      RF learning mechanism


      RF bivariate interaction detector procedure


          Controlling the curse of dimensionality
          Handling the small sample effect


      Application to microarray data


      Conclusions


2   Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Human Genetics Basics
 DNA is often described as the blueprint of living organisms. It is composed by
two complementary strands of nucleotides (A-T, C-G)

   Adenine (A) pairs with thymine (T) and cytosine (C) with guanine (G)

 Basically, a gene is a piece of the DNA that contains the genetic information for
the synthesis of a protein

                                                  The human genome in numbers

                                                  23 pairs of chromosomes
                                                  2 meters of DNA
                                                  A sequence of 3 billion bps length
                                                  30000 – 40000 genes
                                                  Over 99% of the genome is identical in all
                                                 human beings



3     Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
The central dogma of molecular biology
                                                The expression of the genetic information stored
                                               in the DNA occurs in two stages

                                               •TRANSCIPTION. During which DNA is transcribed into
                                               messenger RNA (mRNA).
                                               •TRANSLATION. At this stage mRNA is transported to cell
                                               cytoplasm and translated to produce a protein

                                                 Amino acids are used to construct proteins which
                                               in turn will determine the observed phenotype

 DNA microarray technologies allow to measure the abundance of mRNA by
monitoring the expression levels for hundreds or thousands of genes at different
conditions of the phenotype




4   Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Weak marginal / Strong bivariate genetic
interactions
 In binary classification we define a WM/SB bivariate gene to gene interaction as a
pair of variables (genes) whose joint distribution discriminates the outcome but have
irrelevant marginal distributions for class separation




5    Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
RF learning mechanism

                                           Random Forest is an ensemble of decision trees
                                          grown in a special way

                                            Randomness is injected in RF mechanism by
                                          bootstrap resampling to grow each tree in the forest
                                          and also by finding the best splitter at each node within
                                          a randomly selected subset of inputs


 The number ntree of trees in the forest and the number R of candidate inputs for
splitting each node must be set in advance. Defaults: ntree = 500 and R = square
root of the number p of inputs

  Each tree is grown on nearly 63% of data. The classification error rate is
estimated using the 37% left out observations. The error rate evaluated on the out
of bag cases is called oob


6   Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
RF table of variable importance

  The high dimensional nature of the data obtained by gene expression microarray
 experiments has created the need for variable selection procedures that separate
 relevant predictors (genes) carrying on useful information for classifying the
 phenotype from irrelevant predictor (genes)

  RF generates variable importance measures that allow to rank predictors in
 accordance to their contribution to the predictive accuracy of the ensemble

    RF gives two measures of variable importance

      •      GINI MEASURE. Each variable is assigned a score that accounts for the all the improvements
             in the Gini index in all the nodes of the trees in the forests that use the variable as splitting
             variable

      •      PERMUTATION BASED MEASURE. For each variable, all the cases are randomly permuted
             to a noisy predictor; this noisy predictor is used in place of the original predictor and the oob is
             computed again. The importance of the variable is defined by the difference between oob
             errors after and before permutation


7     Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
The oob error rate degradation in high
dimensional settings
     An extreme synthetic example. XOR interaction pattern




     The oob error rate rapidly becomes degraded as the number of noisy inputs
    increases; hence the XOR signal will be lost
     The interaction is captured as long as it appears alone without the disturbance of
    the noisy inputs; so an exhaustive search among all the pairs of inputs is required if
    we want RF learning mechanism detects the interaction
     Our proposal offers shortcuts and tricky artifacts that simplify the search


8      Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Search procedure. Sequential stage
 RF ranking of variable importance gives new insights regarding the degradation
of the oob error rate

 Some alternatives, Díaz Uriarte (2006) and Genuer (2008), that explore this
ranking in a sequential manner have been proposed to identify relevant patterns
correlated to the outcome




9   Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Search procedure. Hunting stage
 The second stage is designed to hunt difficult to uncover bivariate associations,
which are lost by sequential search strategies

 The idea is to group the inputs in blocks; then use the oob error of RF run for all
the variables belonging to each pair of blocks in order to highlight block matches
where the WM / SB interactions are more likely to appear. This will limit the search
                               Block j




                   Block i



     Match (i,j)

                                                                                                       Ranking of
                                                                                                       block matches




10        Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Drawback with the oob error rate
        Simulation experiment with block size = 6

        The boxplots show that the oob error rate cannot distinguish between block
       matches containing a weak marginal / strong bivariate association and block
       matches with only noisy inputs

                                sample sizes (40,40)                                         sample sizes (40,20)
                 0.7




                                                                                    0.50
                                                                                    0.45
                 0.6




                                                                                    0.40


                                                                                                                                   The curses of
oob error rate




                                                                   oob error rate
                 0.5




                                                                                                                                dimensionality  and
                                                                                    0.35




                                                                                                                                low sample size are
                 0.4




                                                                                    0.30




                                                                                                                                coming up again
                                                                                    0.25
                 0.3




                                                                                    0.20




                              XOR                   NOISY INPUTS                           XOR                   NOISY INPUTS

                                     overlap=0.31                                                 overlap=0.42




11                     Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Data augmentation

 To overcome this drawback, data are artificially augmented and then oob
error rate of a RF run on the augmented data is computed

    Data perturbation is carried out in accordance to the following scheme

                                                                                r is the sample range of X
                                                                                b is the number of bins the range is
                                                                               divided in. It controls the amount of
                                                                               perturbation
                                                                                An augmentation parameter k that
      Details in Arevalillo and Navarro (2011),                                gives the factor by which the dataset
      Fundamenta Informaticae Special issue on                                 must be amplified is also introduced
      Machine Learning in Bioinformatics


 The new oob error computed on the augmented merged dataset is actually a
perturbed error rate measure. We call it perturbed oob


12     Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
The perturbed oob measure
                                                                                                                                                          sample sizes (40,40)


                                                 sample sizes (40,40)




                                                                                                             0.7
                                  0.7




                                                                                                             0.6
                                  0.6




                                                                                                             0.5
oob error rate




                                                                                        perturbed oob
                                  0.5




                                                                                                                                                                                                                     The perturbed

                                                                                                             0.4
                                  0.4




                                                                                                                                                                                                                    oob measure
                                                                                                             0.3
                                  0.3




                                                                                                             0.2                                                                                                    overcomes the
                                                                                                                                                                                                                    initial drawback
                                                                                                             0.1



                                               XOR                     NOISY INPUTS

                                                        overlap=0.31
                                                                                                             0.0




                                                                                                                    1 (overlap=0.15)   3 (overlap=0.07)   5 (overlap=0.05)    7 (overlap=0.05)   9 (overlap=0.03)

                                                                                                                                                                     k
                                                                                                                                                          sample sizes (40,20)
                                                                                                              0.7




                                                     sample sizes (40,20)
                                                                                                              0.6
                                        0.50
                                        0.45




                                                                                                              0.5
                                        0.40




                                                                                             perturbed oob
                 oob error rate




                                                                                                              0.4
                                        0.35




                                                                                                              0.3
                                        0.30
                                        0.25




                                                                                                              0.2
                                        0.20




                                                                                                              0.1




                                                XOR                      NOISY INPUTS

                                                          overlap=0.42
                                                                                                              0.0




                                                                                                                    1 (overlap=0.35)   3 (overlap=0.24)    5 (overlap=0.18)   7 (overlap=0.16)   9 (overlap=0.14)

                                                                                                                                                                      k


                              13               Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Summary of the algorithm
 The details about the implementation of the algorithm can be seen in
Arevalillo and Navarro (2011), Fundamenta Informaticae Special issue on
Machine Learning in Bioinformatics

                                                                                                   Usually bsize = 6, 8,
                                                                                                   b =5 and k = 3, 5, 7
                                                                                                   are good settings

                                                                                              Strategies for this step
                                                                                              include: screeplots for
                                                                                              variable importance, VARSEL
                                                                                              (Díaz Uriarte (BMC.
                                                                                              Bioinformatics. 2006) and
                                                                                              oob error smoothing
                                                                                              (Genuer et al. INRIA. 2008)




14   Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Application to the colon cancer data
 Gene expression levels corresponding to 40 tumor and 22 healthy tissue samples
were collected with an Affymetrix oligonucleotide Hum6000 array (Alon et al. PNAS
1999). The expression levels were arranged in a matrix with 2000 columns (genes)
and 62 rows along with a column containing the clinical outcome variable Y

   Y=1 for tumorous samples and Y=0 for healthy samples




                                                                      The data are publicly available and
                                                                      can be downloaded from the
                                                                      package colonCA of Bioconductor
                                                                      www.bioconductor.org




15      Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Data pre-processing
 Gene expression intensities were pre-processed with a log transformation and a
standardization across genes

 The figure shows the potential outliers given by RF outlier detector. Cases 18, 20,
52, 55 and 58 were previously indentified as outliers in the specialized literature
(Chow et al. Physiol. Genomics 2001. Ambroise and McLachlan. PNAS 2002)



                                                                 These outliers might be caused by
                                                                 different sources of error while collecting
                                                                 the data. We eliminate them from the
                                                                 analysis and end up with a data set
                                                                 containing 57 cases and 2000 predictors




16   Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
A first selection. Sequential search

                                                           Simple inspection of the screeplot of RF
                                                           variable importance allow us to identify the
                                                           most relevant variables. A forward sequential
                                                           search strategy as in Genuer (2008) gives a
                                                           selection containing the most informative
                                                           genes for classifying the clinical outcome




                                                             List of genes selected after the sequential
                                                             search step. It has a great agreement with
                                                             previous selections (Ben-Dor et al.
                                                             J.Comp.Biol. 2000)




17   Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Results
 Control parameters for the hunting stage of the procedure have been set to block
size = 5, k = 5 and b = 5. RF controls ntree and mtry were set to their default values

Findings for three top ranked block matches (heat map plots of the oob for each
match and the scatter plots for the selected gene to gene interactions)


                                                                                       Bivariate gene
                                                                                         interaction

                                                                                    (X86693, M80815)


                                                                                    (R60883, U04953)


                                                                                    (L12350, X86693)



18   Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Additional insights

                                            Oob error rate with all the genes = 3.5%
                                            Oob error rate with the first 300 top ranked genes as
                                           predictors = 1.8%
                                            Oob error rate with all the genes but the 300 top
                                           ranked = 26.3%


                                            In this case the sequential stage is carried out
                                           manually by filtering the 300 top ranked genes
                                            The hunting step of RF bivariate interaction detector
                                           procedure allows to uncover interesting patterns from
                                           the remaining 1700 genes
                                            Interesting gene associations come up from the first
                                           100 positions of the ranking of block matches
                                            RF oob error for the best 10 gene to gene
                                           interactions is 10.5%


19   Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Summary and conclusions
  RF is a widely used algorithm for classification and variable selection in high
 dimensional small sample data. However, sequential search strategies based on
 the oob error and its ranking of variable importance usually fail in uncovering weak
 marginal / strong bivariate hidden interactions in these data structures

  This happens because of the curse of dimensionality and the small sample size;
 both of them produce the degradation in the performance of RF classifier. Data
 augmentation and an exhaustive exploration by blocks of the feature space, which
 uses RF as the search engine, will protect us from this phenomenon

   A perturbed oob measure is obtained when RF is run for all the features
 belonging to every pair of blocks in the augmented dataset

  So the ranking of perturbed oobs will limit the search from the set of all possible
 bivariate interactions to the variables within the top ranked blocks

  The application of the proposed bivariate interaction detector algorithm to a real
 gene expression data was able to uncover WM/SB gene to gene interactions
 associated with the phenotype
20   Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Future research

 The method was proposed for binary classification. Its extension to multi-class
problems and the development of tricks and shortcuts that reduce the
computational cost open future research avenues

  The interaction detector algorithm utilizes RF as the search engine. The use of
other search engines with classifiers like LDA, QDA, SVM, … is also an issue for
future research. Recently, Arevalillo and Navarro (2011) BMC Bioinformatics have
proposed the QDA as search engine

    The development of an R package that incorporates all these improvements

   Finally, the study of the problem of finding informative WM/SB genomic
interactions in SNP data is an open research issue




21     Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
Thank you for your attention




       Jorge M. Arevalillo: jmartin@ccia.uned.es
       Hilario Navarro: hnavarro@ccia.uned.es

       Department of Statistics and Operational Research
       University Nacional Educación a Distancia
       Paseo Senda del Rey nº 9. 28040 Madrid




22   Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions

More Related Content

Similar to A Study of RandomForests Learning Mechanism with Application to the Identification of Informative Gene Interactions in Microarray Data

jin-HMG2014-post
jin-HMG2014-postjin-HMG2014-post
jin-HMG2014-postJin Yu
 
Bioc strucvariant seattle_11_09
Bioc strucvariant seattle_11_09Bioc strucvariant seattle_11_09
Bioc strucvariant seattle_11_09Sean Davis
 
Association mapping in plants
Association mapping in plantsAssociation mapping in plants
Association mapping in plantsWaseem Hussain
 
RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities Paolo Dametto
 
Part 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goalPart 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goalJoachim Jacob
 
Analytical Study of Hexapod miRNAs using Phylogenetic Methods
Analytical Study of Hexapod miRNAs using Phylogenetic MethodsAnalytical Study of Hexapod miRNAs using Phylogenetic Methods
Analytical Study of Hexapod miRNAs using Phylogenetic Methodscscpconf
 
ASHG 2015 - Redundant Annotations in Tertiary Analysis
ASHG 2015 - Redundant Annotations in Tertiary AnalysisASHG 2015 - Redundant Annotations in Tertiary Analysis
ASHG 2015 - Redundant Annotations in Tertiary AnalysisJames Warren
 
Microarray @ujjwal sirohi
Microarray @ujjwal sirohiMicroarray @ujjwal sirohi
Microarray @ujjwal sirohiujjwal sirohi
 
High Sensitivity Sanger Sequencing for Minor Indel Detection and Characteriza...
High Sensitivity Sanger Sequencing for Minor Indel Detection and Characteriza...High Sensitivity Sanger Sequencing for Minor Indel Detection and Characteriza...
High Sensitivity Sanger Sequencing for Minor Indel Detection and Characteriza...Thermo Fisher Scientific
 
Stephen Friend National Heart Lung & Blood Institute 2011-07-19
Stephen Friend National Heart Lung & Blood Institute 2011-07-19Stephen Friend National Heart Lung & Blood Institute 2011-07-19
Stephen Friend National Heart Lung & Blood Institute 2011-07-19Sage Base
 
Stephen Friend SciLife 2011-09-20
Stephen Friend SciLife 2011-09-20Stephen Friend SciLife 2011-09-20
Stephen Friend SciLife 2011-09-20Sage Base
 
Genomics Technologies
Genomics TechnologiesGenomics Technologies
Genomics TechnologiesSean Davis
 
DNA microarray
DNA microarrayDNA microarray
DNA microarrayS Rasouli
 
Stephen Friend Institute of Development, Aging and Cancer 2011-11-28
Stephen Friend Institute of Development, Aging and Cancer 2011-11-28Stephen Friend Institute of Development, Aging and Cancer 2011-11-28
Stephen Friend Institute of Development, Aging and Cancer 2011-11-28Sage Base
 
Genome wide association studies---In genomics, a genome-wide association stud...
Genome wide association studies---In genomics, a genome-wide association stud...Genome wide association studies---In genomics, a genome-wide association stud...
Genome wide association studies---In genomics, a genome-wide association stud...DrAmitJoshi9
 

Similar to A Study of RandomForests Learning Mechanism with Application to the Identification of Informative Gene Interactions in Microarray Data (20)

jin-HMG2014-post
jin-HMG2014-postjin-HMG2014-post
jin-HMG2014-post
 
Bioc strucvariant seattle_11_09
Bioc strucvariant seattle_11_09Bioc strucvariant seattle_11_09
Bioc strucvariant seattle_11_09
 
Association mapping in plants
Association mapping in plantsAssociation mapping in plants
Association mapping in plants
 
RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities
 
Part 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goalPart 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goal
 
Analytical Study of Hexapod miRNAs using Phylogenetic Methods
Analytical Study of Hexapod miRNAs using Phylogenetic MethodsAnalytical Study of Hexapod miRNAs using Phylogenetic Methods
Analytical Study of Hexapod miRNAs using Phylogenetic Methods
 
ASHG 2015 - Redundant Annotations in Tertiary Analysis
ASHG 2015 - Redundant Annotations in Tertiary AnalysisASHG 2015 - Redundant Annotations in Tertiary Analysis
ASHG 2015 - Redundant Annotations in Tertiary Analysis
 
Microarray @ujjwal sirohi
Microarray @ujjwal sirohiMicroarray @ujjwal sirohi
Microarray @ujjwal sirohi
 
High Sensitivity Sanger Sequencing for Minor Indel Detection and Characteriza...
High Sensitivity Sanger Sequencing for Minor Indel Detection and Characteriza...High Sensitivity Sanger Sequencing for Minor Indel Detection and Characteriza...
High Sensitivity Sanger Sequencing for Minor Indel Detection and Characteriza...
 
Stephen Friend National Heart Lung & Blood Institute 2011-07-19
Stephen Friend National Heart Lung & Blood Institute 2011-07-19Stephen Friend National Heart Lung & Blood Institute 2011-07-19
Stephen Friend National Heart Lung & Blood Institute 2011-07-19
 
Stephen Friend SciLife 2011-09-20
Stephen Friend SciLife 2011-09-20Stephen Friend SciLife 2011-09-20
Stephen Friend SciLife 2011-09-20
 
Gene mapping
Gene mappingGene mapping
Gene mapping
 
prediction methods for ORF
prediction methods for ORFprediction methods for ORF
prediction methods for ORF
 
Genomics Technologies
Genomics TechnologiesGenomics Technologies
Genomics Technologies
 
DNA microarray
DNA microarrayDNA microarray
DNA microarray
 
Stephen Friend Institute of Development, Aging and Cancer 2011-11-28
Stephen Friend Institute of Development, Aging and Cancer 2011-11-28Stephen Friend Institute of Development, Aging and Cancer 2011-11-28
Stephen Friend Institute of Development, Aging and Cancer 2011-11-28
 
Genome wide association studies---In genomics, a genome-wide association stud...
Genome wide association studies---In genomics, a genome-wide association stud...Genome wide association studies---In genomics, a genome-wide association stud...
Genome wide association studies---In genomics, a genome-wide association stud...
 
Genome Mapping
Genome MappingGenome Mapping
Genome Mapping
 
Gene Array Analyzer
Gene Array AnalyzerGene Array Analyzer
Gene Array Analyzer
 
Analysis of gene expression
Analysis of gene expressionAnalysis of gene expression
Analysis of gene expression
 

More from Salford Systems

Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4Salford Systems
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsSalford Systems
 
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Salford Systems
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Salford Systems
 
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningThe Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningSalford Systems
 
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerIntroduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerSalford Systems
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like YouSalford Systems
 
Statistically Significant Quotes To Remember
Statistically Significant Quotes To RememberStatistically Significant Quotes To Remember
Statistically Significant Quotes To RememberSalford Systems
 
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetUsing CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetSalford Systems
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideSalford Systems
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to marsSalford Systems
 
Data Mining for Higher Education
Data Mining for Higher EducationData Mining for Higher Education
Data Mining for Higher EducationSalford Systems
 
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingSalford Systems
 
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hivSalford Systems
 
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees:  A Winning CombinationTreeNet Tree Ensembles & CART Decision Trees:  A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning CombinationSalford Systems
 
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSalford Systems
 
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998Salford Systems
 
Session Logs Tutorial for SPM
Session Logs Tutorial for SPMSession Logs Tutorial for SPM
Session Logs Tutorial for SPMSalford Systems
 
Some of the new features in SPM 7
Some of the new features in SPM 7Some of the new features in SPM 7
Some of the new features in SPM 7Salford Systems
 

More from Salford Systems (20)

Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForests
 
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications
 
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningThe Do's and Don'ts of Data Mining
The Do's and Don'ts of Data Mining
 
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerIntroduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele Cutler
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You
 
Statistically Significant Quotes To Remember
Statistically Significant Quotes To RememberStatistically Significant Quotes To Remember
Statistically Significant Quotes To Remember
 
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetUsing CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example Dataset
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User Guide
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to mars
 
Data Mining for Higher Education
Data Mining for Higher EducationData Mining for Higher Education
Data Mining for Higher Education
 
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modeling
 
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hiv
 
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees:  A Winning CombinationTreeNet Tree Ensembles & CART Decision Trees:  A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
 
SPM v7.0 Feature Matrix
SPM v7.0 Feature MatrixSPM v7.0 Feature Matrix
SPM v7.0 Feature Matrix
 
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARS
 
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998
 
Session Logs Tutorial for SPM
Session Logs Tutorial for SPMSession Logs Tutorial for SPM
Session Logs Tutorial for SPM
 
Some of the new features in SPM 7
Some of the new features in SPM 7Some of the new features in SPM 7
Some of the new features in SPM 7
 

Recently uploaded

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 

Recently uploaded (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 

A Study of RandomForests Learning Mechanism with Application to the Identification of Informative Gene Interactions in Microarray Data

  • 1. A Study of Random Forests Learning Mechanism with Application to the Identification of Informative Gene Interactions in Microarray Data Jorge M. Arevalillo and Hilario Navarro Dpt. Statistics and Operational Research University Nacional de Educación a Distancia 1 Salford Analytics and Data Mining Conference 2012. San Diego
  • 2. Outline  Weak Marginal / Strong bivariate genetic interactions  RF learning mechanism  RF bivariate interaction detector procedure  Controlling the curse of dimensionality  Handling the small sample effect  Application to microarray data  Conclusions 2 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • 3. Human Genetics Basics  DNA is often described as the blueprint of living organisms. It is composed by two complementary strands of nucleotides (A-T, C-G)  Adenine (A) pairs with thymine (T) and cytosine (C) with guanine (G)  Basically, a gene is a piece of the DNA that contains the genetic information for the synthesis of a protein  The human genome in numbers  23 pairs of chromosomes  2 meters of DNA  A sequence of 3 billion bps length  30000 – 40000 genes  Over 99% of the genome is identical in all human beings 3 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • 4. The central dogma of molecular biology  The expression of the genetic information stored in the DNA occurs in two stages •TRANSCIPTION. During which DNA is transcribed into messenger RNA (mRNA). •TRANSLATION. At this stage mRNA is transported to cell cytoplasm and translated to produce a protein  Amino acids are used to construct proteins which in turn will determine the observed phenotype  DNA microarray technologies allow to measure the abundance of mRNA by monitoring the expression levels for hundreds or thousands of genes at different conditions of the phenotype 4 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • 5. Weak marginal / Strong bivariate genetic interactions  In binary classification we define a WM/SB bivariate gene to gene interaction as a pair of variables (genes) whose joint distribution discriminates the outcome but have irrelevant marginal distributions for class separation 5 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • 6. RF learning mechanism  Random Forest is an ensemble of decision trees grown in a special way  Randomness is injected in RF mechanism by bootstrap resampling to grow each tree in the forest and also by finding the best splitter at each node within a randomly selected subset of inputs  The number ntree of trees in the forest and the number R of candidate inputs for splitting each node must be set in advance. Defaults: ntree = 500 and R = square root of the number p of inputs  Each tree is grown on nearly 63% of data. The classification error rate is estimated using the 37% left out observations. The error rate evaluated on the out of bag cases is called oob 6 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • 7. RF table of variable importance  The high dimensional nature of the data obtained by gene expression microarray experiments has created the need for variable selection procedures that separate relevant predictors (genes) carrying on useful information for classifying the phenotype from irrelevant predictor (genes)  RF generates variable importance measures that allow to rank predictors in accordance to their contribution to the predictive accuracy of the ensemble  RF gives two measures of variable importance • GINI MEASURE. Each variable is assigned a score that accounts for the all the improvements in the Gini index in all the nodes of the trees in the forests that use the variable as splitting variable • PERMUTATION BASED MEASURE. For each variable, all the cases are randomly permuted to a noisy predictor; this noisy predictor is used in place of the original predictor and the oob is computed again. The importance of the variable is defined by the difference between oob errors after and before permutation 7 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • 8. The oob error rate degradation in high dimensional settings  An extreme synthetic example. XOR interaction pattern  The oob error rate rapidly becomes degraded as the number of noisy inputs increases; hence the XOR signal will be lost  The interaction is captured as long as it appears alone without the disturbance of the noisy inputs; so an exhaustive search among all the pairs of inputs is required if we want RF learning mechanism detects the interaction  Our proposal offers shortcuts and tricky artifacts that simplify the search 8 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • 9. Search procedure. Sequential stage  RF ranking of variable importance gives new insights regarding the degradation of the oob error rate  Some alternatives, Díaz Uriarte (2006) and Genuer (2008), that explore this ranking in a sequential manner have been proposed to identify relevant patterns correlated to the outcome 9 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • 10. Search procedure. Hunting stage  The second stage is designed to hunt difficult to uncover bivariate associations, which are lost by sequential search strategies  The idea is to group the inputs in blocks; then use the oob error of RF run for all the variables belonging to each pair of blocks in order to highlight block matches where the WM / SB interactions are more likely to appear. This will limit the search Block j Block i Match (i,j) Ranking of block matches 10 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • 11. Drawback with the oob error rate  Simulation experiment with block size = 6  The boxplots show that the oob error rate cannot distinguish between block matches containing a weak marginal / strong bivariate association and block matches with only noisy inputs sample sizes (40,40) sample sizes (40,20) 0.7 0.50 0.45 0.6 0.40  The curses of oob error rate oob error rate 0.5 dimensionality and 0.35 low sample size are 0.4 0.30 coming up again 0.25 0.3 0.20 XOR NOISY INPUTS XOR NOISY INPUTS overlap=0.31 overlap=0.42 11 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • 12. Data augmentation  To overcome this drawback, data are artificially augmented and then oob error rate of a RF run on the augmented data is computed  Data perturbation is carried out in accordance to the following scheme  r is the sample range of X  b is the number of bins the range is divided in. It controls the amount of perturbation  An augmentation parameter k that Details in Arevalillo and Navarro (2011), gives the factor by which the dataset Fundamenta Informaticae Special issue on must be amplified is also introduced Machine Learning in Bioinformatics  The new oob error computed on the augmented merged dataset is actually a perturbed error rate measure. We call it perturbed oob 12 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • 13. The perturbed oob measure sample sizes (40,40) sample sizes (40,40) 0.7 0.7 0.6 0.6 0.5 oob error rate perturbed oob 0.5  The perturbed 0.4 0.4 oob measure 0.3 0.3 0.2 overcomes the initial drawback 0.1 XOR NOISY INPUTS overlap=0.31 0.0 1 (overlap=0.15) 3 (overlap=0.07) 5 (overlap=0.05) 7 (overlap=0.05) 9 (overlap=0.03) k sample sizes (40,20) 0.7 sample sizes (40,20) 0.6 0.50 0.45 0.5 0.40 perturbed oob oob error rate 0.4 0.35 0.3 0.30 0.25 0.2 0.20 0.1 XOR NOISY INPUTS overlap=0.42 0.0 1 (overlap=0.35) 3 (overlap=0.24) 5 (overlap=0.18) 7 (overlap=0.16) 9 (overlap=0.14) k 13 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • 14. Summary of the algorithm  The details about the implementation of the algorithm can be seen in Arevalillo and Navarro (2011), Fundamenta Informaticae Special issue on Machine Learning in Bioinformatics Usually bsize = 6, 8, b =5 and k = 3, 5, 7 are good settings Strategies for this step include: screeplots for variable importance, VARSEL (Díaz Uriarte (BMC. Bioinformatics. 2006) and oob error smoothing (Genuer et al. INRIA. 2008) 14 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • 15. Application to the colon cancer data  Gene expression levels corresponding to 40 tumor and 22 healthy tissue samples were collected with an Affymetrix oligonucleotide Hum6000 array (Alon et al. PNAS 1999). The expression levels were arranged in a matrix with 2000 columns (genes) and 62 rows along with a column containing the clinical outcome variable Y  Y=1 for tumorous samples and Y=0 for healthy samples The data are publicly available and can be downloaded from the package colonCA of Bioconductor www.bioconductor.org 15 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • 16. Data pre-processing  Gene expression intensities were pre-processed with a log transformation and a standardization across genes  The figure shows the potential outliers given by RF outlier detector. Cases 18, 20, 52, 55 and 58 were previously indentified as outliers in the specialized literature (Chow et al. Physiol. Genomics 2001. Ambroise and McLachlan. PNAS 2002) These outliers might be caused by different sources of error while collecting the data. We eliminate them from the analysis and end up with a data set containing 57 cases and 2000 predictors 16 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • 17. A first selection. Sequential search Simple inspection of the screeplot of RF variable importance allow us to identify the most relevant variables. A forward sequential search strategy as in Genuer (2008) gives a selection containing the most informative genes for classifying the clinical outcome List of genes selected after the sequential search step. It has a great agreement with previous selections (Ben-Dor et al. J.Comp.Biol. 2000) 17 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • 18. Results  Control parameters for the hunting stage of the procedure have been set to block size = 5, k = 5 and b = 5. RF controls ntree and mtry were set to their default values Findings for three top ranked block matches (heat map plots of the oob for each match and the scatter plots for the selected gene to gene interactions) Bivariate gene interaction (X86693, M80815) (R60883, U04953) (L12350, X86693) 18 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • 19. Additional insights  Oob error rate with all the genes = 3.5%  Oob error rate with the first 300 top ranked genes as predictors = 1.8%  Oob error rate with all the genes but the 300 top ranked = 26.3%  In this case the sequential stage is carried out manually by filtering the 300 top ranked genes  The hunting step of RF bivariate interaction detector procedure allows to uncover interesting patterns from the remaining 1700 genes  Interesting gene associations come up from the first 100 positions of the ranking of block matches  RF oob error for the best 10 gene to gene interactions is 10.5% 19 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • 20. Summary and conclusions  RF is a widely used algorithm for classification and variable selection in high dimensional small sample data. However, sequential search strategies based on the oob error and its ranking of variable importance usually fail in uncovering weak marginal / strong bivariate hidden interactions in these data structures  This happens because of the curse of dimensionality and the small sample size; both of them produce the degradation in the performance of RF classifier. Data augmentation and an exhaustive exploration by blocks of the feature space, which uses RF as the search engine, will protect us from this phenomenon  A perturbed oob measure is obtained when RF is run for all the features belonging to every pair of blocks in the augmented dataset  So the ranking of perturbed oobs will limit the search from the set of all possible bivariate interactions to the variables within the top ranked blocks  The application of the proposed bivariate interaction detector algorithm to a real gene expression data was able to uncover WM/SB gene to gene interactions associated with the phenotype 20 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • 21. Future research  The method was proposed for binary classification. Its extension to multi-class problems and the development of tricks and shortcuts that reduce the computational cost open future research avenues  The interaction detector algorithm utilizes RF as the search engine. The use of other search engines with classifiers like LDA, QDA, SVM, … is also an issue for future research. Recently, Arevalillo and Navarro (2011) BMC Bioinformatics have proposed the QDA as search engine  The development of an R package that incorporates all these improvements  Finally, the study of the problem of finding informative WM/SB genomic interactions in SNP data is an open research issue 21 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions
  • 22. Thank you for your attention Jorge M. Arevalillo: jmartin@ccia.uned.es Hilario Navarro: hnavarro@ccia.uned.es Department of Statistics and Operational Research University Nacional Educación a Distancia Paseo Senda del Rey nº 9. 28040 Madrid 22 Jorge M. Arevalillo and H. Navarro. RF Learning Mechanism with Application to the Identification of Gene Interactions