SlideShare uma empresa Scribd logo
1 de 40
Baixar para ler offline
Comparative Study




                  Retrieval from Software Libraries for Bug
                Localization: A Comparative Study of Generic
                         and Composite Text Models

                                        Shivani Rao and Avinash Kak

                                             School of ECE,Purdue University


                                                    May 21, 2011
                                                    MSR, Hawaii



Mining Software Repositories, Hawaii, 2011
Comparative Study




                                             Outline
          1   Bug localization

          2   IR(Information Retrieval)-based bug localization

          3   Text Models

          4   Preprocessing of the source files

          5   Evaluation Metrics

          6   Results

          7   Conclusion
Mining Software Repositories, Hawaii, 2011
Comparative Study
    Bug localization




                                             Bug localization




                 Bug localization means to locate the files, methods, classes,
                 etc., that are directly related to the problem causing abnormal
                 execution behavior of the software.
                 IR Bug localization means to locate a bug from its textual
                 description.




Mining Software Repositories, Hawaii, 2011
Comparative Study
    Background




                         A typical bug localization process




Mining Software Repositories, Hawaii, 2011
Comparative Study
    Background




                                 A typical bug report:JEdit




Mining Software Repositories, Hawaii, 2011
Comparative Study
    Background




                  Past work on IR-based bug localization
            Authors/Paper                Model         Software dataset
            Marcus et al.                VSM           Jedit
            [1]
            Cleary et al. [2]            LM, LSA and   Eclipse JDT
                                         CA
            Lukins et al. [3]            LDA           Mozilla, Eclipse, Rhino and
                                                       JEdit

         Drawbacks
           1 None of the work reported has been evaluated on a standard

             dataset.
             2   Inability to compare with the static and dynamic techniques.
             3   Number of bugs is of the order 5-30

Mining Software Repositories, Hawaii, 2011
Comparative Study
    Background




                                              iBUGS

                 Created by Dallmeier and Zimmerman [4], iBUGS contains a
                 large number of real bugs with corresponding test suites in
                 order to generate failing and passing test runs
                 ASPECTJ software
                         Software Library Size (Number of files)         6546
                                     Lines of Code                    75 KLOC
                                    Vocabulary Size                     7553
                                    Number of bugs                       291
                              Table: The iBUGS dataset after preprocessing




Mining Software Repositories, Hawaii, 2011
Comparative Study
    Background




           A typical bug report in the iBUGS repository




Mining Software Repositories, Hawaii, 2011
Comparative Study
    Text Models




                                             Text models



                      VSM : Vector Space Model
                       LSA : Latent Semantic Analysis Model
                       UM : Unigram Model
                      LDA : Latent Dirichlet Allocation Model
                  CBDM : Cluster-Based Document Model




Mining Software Repositories, Hawaii, 2011
Comparative Study
    Text Models




                                        Vector Space Model


                                                      If V is the vocabulary
                                                      then queries and
                                                      documents are
                                                      |V|-dimensional vectors.
                                                                        wq .wm
                                                       sim(q, dm ) =
                                                                       |w q ||w m |

                                                      Sparse yet high
                                                      dimensional space.




Mining Software Repositories, Hawaii, 2011
Comparative Study
    Text Models




         Latent semantic analysis: Eigen decomposition




                                             A = UΣV T

Mining Software Repositories, Hawaii, 2011
Comparative Study
    Text Models




                                             LSA based models
                 Topic based representation: wk (m) which is a K -dimensional
                 eigen vector that mth document wm .

                                                wK (m) = Σ−1 UK wm
                                                          K
                                                              T

                                                    qK   = Σ−1 UK q
                                                             K
                                                                  T

                                                            qK .wK (m)
                                             sim(q, dm ) =
                                                           |qK ||wK (m)|

                 LSA2: Fold back the K-dimensional representation to a
                 smoothed |V| dimensional represenation and compare directly
                 with the query q. w = UK ΣK wK
                                   ˜           T

                 Combined Representation: combines the LSA2 with the VSM
                 representation using the mixture parameter λ .
                               ˜
                 Acombined = λA + (1 − λ)A

Mining Software Repositories, Hawaii, 2011
Comparative Study
    Text Models




          Unigram model to represent documents using
                  probability distribution [5]
                 The term frequencies in a document are considered to be its
                 probability distribution
                 The term frequencies in a query become the query’s
                 probablity distribution
                 The similarities are established by comparing the probability
                 distributions using KL divergence.
                 To add smoothing we add the probability distribution over the
                 entire source library.
                                                                      |D|
                                             c(w , dm )               m=1 c(w , dm )
                       puni (w |Dm ) = µ                + (1 − µ)        |D|
                                               |dm |
                                                                         m=1 |dm |
                                                                    |D|
                                             c(w , q)               m=1 c(w , dm )
                          puni (w |q) = µ             + (1 − µ)        |D|
                                               |q|
                                                                       m=1 |dm |
Mining Software Repositories, Hawaii, 2011
Comparative Study
    Text Models




                     LDA: A mixture model to represent
                     documents using topics/concepts [6]




Mining Software Repositories, Hawaii, 2011
Comparative Study
    Text Models




                                      LDA based models [7]
         Topic based representation θm which is a K -dimensional
                      probability vector that indicates the topic proportions
                      present in mth document.
         Maximum Likelihood Representation folds back to the |V|
                      dimensional term space.
                                                        t=K
                                      plda (w |Dm ) =         p(w |z = t)p(z = t|Dm )
                                                        t=1
                                                        t=K
                                                   =          φ(t, w )θm (t)
                                                        t=1
         Combined Representation combines the Unigram representation of
                    the document and the MLE-LDA representation of a
                    document.
                               pcombined (w |Dm ) = λplda (w |Dm ) + (1 − λ)puni (w |Dm )
Mining Software Repositories, Hawaii, 2011
Comparative Study
    Text Models




           Cluster Based Document Model (CBDM) [8]
                 Cluster the documents into K clusters using deterministic
                 algorithms like K-means, hierarchical, agglomerative clustering
                 and so on.
                 Represent each of the clusters using a multinomial distribution
                 over the terms in the vocabulary. This distribution is
                 commonly denoted by pML (w |Clusterj ) and we can express
                 probabilistic distribution for a words in a dm ∈ Clusterj by:


                                                      wm (n)
                      pcbdm (w |wm ) = λ1 ×          n=|V|
                                                                   + λ2 × pc (w ) +
                                                     n=1 wm (n)
                                             λ3 × pML (w |Clusterj )                  (1)



Mining Software Repositories, Hawaii, 2011
Comparative Study
    Text Models




                    Summary of Text Models used in the
                           comparative study




Mining Software Repositories, Hawaii, 2011
Comparative Study
    Text Models




                    Summary of Text Models used in the
                        comparative study (cont.)

            Model   Representation                            Similarity Metric
            VSM     frequency vector                          Cosine similarity
            LSA     K dimensional vector in the               Cosine similarity
                    eigen space
            Unigram |V| dimensional probability vec-          KL divergence
                    tor (smoothed)
            LDA     K dimensional probability vec-            KL divergence
                    tor
            CBDM    |V| dimensional combined prob-            KL divergence or likeli-
                    ability vector                            hood
                     Table: Generic models used in the comparative evaluation


Mining Software Repositories, Hawaii, 2011
Comparative Study
    Text Models




                    Summary of Text Models used in the
                        comparative study (cont.)


            Model            Representation                   Similarity Metric
            LSA2             |V| dimensional representation   Cosine similarity
                             in term-space
            MLE-             |V| dimensional MLE-LDA          KL divergence or likeli-
            LDA              probability vector               hood
         Table: The variations on two of the generic models used in the
         comparative evaluation




Mining Software Repositories, Hawaii, 2011
Comparative Study
    Text Models




                    Summary of Text Models used in the
                        comparative study (cont.)


            Model            Representation                      Similarity Metric
            Unigram          |V| dimensional combined prob-      KL divergence or likeli-
            + LDA            ability vector                      hood
            VSM +            |V| dimensional combined VSM        Cosine similarity
            LSA              and LSA representation
                                   Table: The two composite models used




Mining Software Repositories, Hawaii, 2011
Comparative Study
    Preprocessing of the source files




                          Preprocessing of the source files

                 If a patch file does not exist in the /trunk then it is searched
                 and added to the source library from the other branches/tags
                 of the ASPECTJ
                 The source library consists of ”.java” files only. After this
                 step, our library ended up with 6546 Java files.
                 The repository.xml file documents all the information related
                 to a bug. This includes the BugID, the bug description, the
                 relevant source files, and so on. We shall call this
                 ground-truth information as relevance judgements.
                 The bugs that are documented in iBUGS and do not have any
                 relevant software files in the source library that results from
                 the previous step are eliminated. After this step, we are left
                 with 291 bugs.

Mining Software Repositories, Hawaii, 2011
Comparative Study
    Preprocessing of the source files




                Preprocessing of the source files (contd)


                 Hard-words, camel-case words and soft-words are handled by
                 using popular identifier-splitting methods [9, 10].
                 Stop-list consists of most commonly occuring words.
                 Example: “for,” “else,” “while,” “int,”, “double,” “long,”
                 “public,” “void,” etc. There are 375 such words in iBUGS
                 ASPECTJ software. We also drop from the vocabulary all
                 unicode strings.
                 The vocabulary is pruned further by calculating the relative
                 importance of terms and eliminating ubiquitous and
                 rarely-occuring terms.



Mining Software Repositories, Hawaii, 2011
Comparative Study
    Evaluation Metrics
      Mean Average Precision (MAP)


                         Mean Average Precision (MAP)

         Calculated using the following two sets:
         retreived(Nr ) set consists of the top Nr documents from a ranked
                       list of documents retrieved vis-a-vis the query.
           relevant set is extracted from relevance judgements available
                        from repository.xml
         Precision and Recall:
                                               |{relevant} {retrieved}|
                        Precision(P@Nr ) =
                                                      |{retrieved}|
                                               |{relevant} {retrieved}|
                             Recall(R@Nr ) =
                                                       |{relevant}|



Mining Software Repositories, Hawaii, 2011
Comparative Study
    Evaluation Metrics
      Mean Average Precision (MAP)


                 Mean Average Precision (MAP) (cont.)


             1   If we were to plot a typical P-R curve from the values for
                 P@Nr and R@Nr , we would get a monotonically decrceasing
                 curve that has high values of Precision for low values of Recall
                 and vice versa.
             2   Area under the P-R curve is called the Average Precision.
             3   Taking mean of the Average Precision over all the queries
                 gives Mean Average Precision (MAP).
             4   Physical significance of MAP: Same as that of Precision.




Mining Software Repositories, Hawaii, 2011
Comparative Study
    Evaluation Metrics
      Rank of Retrieved Files


                                Rank of Retrieved Files [3]




                 The number of queries/bugs for which relevant source files
                 were retrieved with ranks rlow ≤ R ≤ rhigh is reported.
                 For the retrieval performance reported in [3], ranks used are
                 R = 1, 2 ≤ R ≤ 5, 6 ≤ R ≤ 10 and R > 10.




Mining Software Repositories, Hawaii, 2011
Comparative Study
    Evaluation Metrics
      SCORE


                                             SCORE [11]




             1   Indicates the proportion of the program that need to be
                 examined in order to locate or localize a fault
             2   For each range of this proportion (example, 10 − 20%) the
                 number of test-runs (bugs) is reported.




Mining Software Repositories, Hawaii, 2011
Comparative Study
    Results




                                             Models using LDA




         Figure: MAP using the three LDA models for different values of K, the
         experimental parameters for LDA+Unigram model are λ = 0.9 µ = 0.5,
         β = 0.01 and α = 50/K
Mining Software Repositories, Hawaii, 2011
Comparative Study
    Results




                     The combined LDA+Unigram model




         Figure: MAP plotted for different values of mixture proportions (λ and
         µ) of the LDA+Unigram combined model.

Mining Software Repositories, Hawaii, 2011
Comparative Study
    Results




                                             Models using LSA




         Figure: MAP using LSA model and its variations and combinations for
         different values of K. The experimental parameter for the LSA+VSM
         combined model is λ = 0.5.
Mining Software Repositories, Hawaii, 2011
Comparative Study
    Results




                                                CBDM

             Model parameters                                    K
             λ1     λ2     λ3                   100       250          500       1000
            0.25   0.25   0.5                0.093144   0.0914       0.08666   0.07664
            0.15   0.35   0.5                 0.0883    0.0897       0.0963     0.0932
            0.81   0.09   0.1                 0.143      0.102        0.108    0.09952
            0.27   0.63   0.1                 0.1306     0.117        0.111     0.0998
            0.495 0.495 0.01                   0.141     0.141        0.141     0.141
            0.05   0.05 0.99                   0.069     0.075        0.072     0.065
         Table: Retrieval performance using MAP with the CBDM.
         λ1 + λ2 + λ3 = 1. λ1 Unigram model λ2 Collection Model λ3 Cluster
         model



Mining Software Repositories, Hawaii, 2011
Comparative Study
    Results




                                             Rank based metric




         Figure: The height of the bars shows the number of queries (bugs) for
         which at least one relevant source file was retrieved at rank 1.

Mining Software Repositories, Hawaii, 2011
Comparative Study
    Results




                 SCORE: IR based bug localization tools




Mining Software Repositories, Hawaii, 2011
Comparative Study
    Results




                      SCORE: Compare with AMPLE and
                               FINDBUGS




                                                SCORE with FINDBUGS
                                                None of the bugs were
                                                localized correctly.



      Figure: SCORE values calculated over 44
      bugs in iBUGS ASPECTJ using AMPLE
      [12]
Mining Software Repositories, Hawaii, 2011
Comparative Study
    Conclusion




                                             Conclusion


                 IR based bug localization techniques are equally or more
                 effective compared to static or dynamic bug localization tools.
                 Sophisticated models like LDA, LSA or CBDM do not
                 out-perform simpler models like Unigram or VSM for IR based
                 bug localization on large software systems.
                 An analysis of the spread of the word distributions over the
                 source files with the help of measures such as tf and idf can
                 give useful insights into the usability of topic and cluster
                 based models for localization.




Mining Software Repositories, Hawaii, 2011
Comparative Study
    Conclusion




                                        End of Presentation



                                              Thanks to




                                             Questions?




Mining Software Repositories, Hawaii, 2011
Comparative Study
    Conclusion




                                             Threads to validity




                 We have tested on a single database like iBUGS. How does
                 this generalize?
                 We have eliminated xml files among those that are indexed
                 and queried. Maybe not a valid assumption?




Mining Software Repositories, Hawaii, 2011
Comparative Study
    Conclusion




                                             References
                A. Marcus, A. Sergeyev, V. Rajlich, and J. I. Maletic, “An
                Information Retrieval Approach to Concept Location in Source
                code,” in In Proceedings of the 11th Working Conference on
                Reverse Engineering (WCRE 2004, pp. 214–223, IEEE
                Computer Society, 2004.
                B. Cleary, C. Exton, J. Buckley, and M. English, “An Empirical
                Analysis of Information Retrieval based Concept Location
                Techniques in Software Comprehension,” Empirical Softw.
                Engg., vol. 14, no. 1, pp. 93–130, 2009.
                S. K. Lukins, N. A. Karft, and E. H. Letha, “Source Code
                Retrieval for Bug Localization using Latent Dirichlet
                Allocation,” in 15th Working Conference on Reverse
                Engineering, 2008.

Mining Software Repositories, Hawaii, 2011
Comparative Study
    Conclusion




                                             References (cont.)


                V. Dallmeier and T. Zimmermann, “Extraction of Bug
                Localization Benchmarks from History,” in ASE ’07:
                Proceedings of the twenty-second IEEE/ACM international
                conference on Automated software engineering, (New York,
                NY, USA), pp. 433–436, ACM, 2007.
                J. Lafferty and C. Zhai, “A Study of Smoothing Methods for
                Language Models Applied to information retrieval,” ACM
                Transactions Information Systems, pp. 179–214, 2004.
                D. M. Blei, A. V. Ng, and M. I. Jordan, “Latent Dirichlet
                Allocation,” Journal of Machine Learning, pp. 993–1022, 2003.



Mining Software Repositories, Hawaii, 2011
Comparative Study
    Conclusion




                                             References (cont.)

                X. Wei and W. B. Croft, “Lda-Based Document Models for
                Ad-hoc Retrieval,” in Proceedings of the 29th annual
                international ACM SIGIR conference on Research and
                development in information retrieval, ACM, 2006.
                L. X and W. B. Croft, “Cluster-Based Retrieval Using
                Language Models,” in ACM SIGIR Conference on Research
                and Development in Information Retrieval, ACM, 2004.
                D. B. H. Field and D. Lawrie., “An Empirical Comparison of
                Techniques for Extracting Concept Abbreviations from
                Identifiers.,” in Proceedings of IASTED International
                Conference on Software Engineering and Applications, 2006.


Mining Software Repositories, Hawaii, 2011
Comparative Study
    Conclusion




                                             References (cont.)

                E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker, “Mining
                Source Code to Automatically Split Identifiers for Software
                Analysis,” in Proceedings of the 2009 6th IEEE International
                Working Conference on Mining Software Repositories, MSR
                ’09, (Washington, DC, USA), pp. 71–80, IEEE Computer
                Society, 2009.
                J. A. Jones and M. J. Harrold, “Empirical Evaluation of the
                Tarantula Automatic Fault-Localization Technique,” in
                Automated Software Engineering, 2005.
                V. Dallmeier and T. Zimmermann, “Automatic Extraction of
                Bug Localization Benchmarks from History,” tech. rep.,
                Universi¨t des Saarlandes, Saarbr¨cken, Germany, June 2007.
                        a                        u

Mining Software Repositories, Hawaii, 2011

Mais conteúdo relacionado

Destaque (7)

Semantic Annotation of Documents
Semantic Annotation of DocumentsSemantic Annotation of Documents
Semantic Annotation of Documents
 
Tweet Recommendation with Graph Co-Ranking
Tweet Recommendation with Graph Co-RankingTweet Recommendation with Graph Co-Ranking
Tweet Recommendation with Graph Co-Ranking
 
Part-based Object Retrieval with Binary Partition Trees
Part-based Object Retrieval with Binary Partition TreesPart-based Object Retrieval with Binary Partition Trees
Part-based Object Retrieval with Binary Partition Trees
 
Calculating precision
Calculating precisionCalculating precision
Calculating precision
 
Building Decision Tree model with numerical attributes
Building Decision Tree model with numerical attributesBuilding Decision Tree model with numerical attributes
Building Decision Tree model with numerical attributes
 
Evaluation metrics: Precision, Recall, F-Measure, ROC
Evaluation metrics: Precision, Recall, F-Measure, ROCEvaluation metrics: Precision, Recall, F-Measure, ROC
Evaluation metrics: Precision, Recall, F-Measure, ROC
 
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems -  ACM RecSys 2013 tutorialLearning to Rank for Recommender Systems -  ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
 

Semelhante a MSR presentation

NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationEugene Nho
 
10.1.1.70.8789
10.1.1.70.878910.1.1.70.8789
10.1.1.70.8789Hoài Bùi
 
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Dwaipayan Roy
 
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERSA WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERSSilvio Cesare
 
Interprocedural Slicing Using Dependence Graphs
Interprocedural Slicing Using Dependence GraphsInterprocedural Slicing Using Dependence Graphs
Interprocedural Slicing Using Dependence Graphsjames marioki
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs
Selectivity Estimation for Hybrid Queries over Text-Rich Data GraphsSelectivity Estimation for Hybrid Queries over Text-Rich Data Graphs
Selectivity Estimation for Hybrid Queries over Text-Rich Data GraphsWagner Andreas
 
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge GraphJoint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge GraphFedorNikolaev
 
[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery Algorithms
[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery Algorithms[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery Algorithms
[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery AlgorithmsShiva Sandeep Garlapati
 
Using semantic annotation of web services for analyzing
Using semantic annotation of web services for analyzingUsing semantic annotation of web services for analyzing
Using semantic annotation of web services for analyzingShahab Mokarizadeh
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
Filtering Inaccurate Entity Co-references on the Linked Open Data
Filtering Inaccurate Entity Co-references on the Linked Open DataFiltering Inaccurate Entity Co-references on the Linked Open Data
Filtering Inaccurate Entity Co-references on the Linked Open Dataebrahim_bagheri
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
Designing, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural NetworksDesigning, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural Networksconnectbeubax
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
 

Semelhante a MSR presentation (20)

NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
10.1.1.70.8789
10.1.1.70.878910.1.1.70.8789
10.1.1.70.8789
 
Vsm lsi
Vsm lsiVsm lsi
Vsm lsi
 
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
 
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERSA WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
 
Interprocedural Slicing Using Dependence Graphs
Interprocedural Slicing Using Dependence GraphsInterprocedural Slicing Using Dependence Graphs
Interprocedural Slicing Using Dependence Graphs
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs
Selectivity Estimation for Hybrid Queries over Text-Rich Data GraphsSelectivity Estimation for Hybrid Queries over Text-Rich Data Graphs
Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs
 
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge GraphJoint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
 
[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery Algorithms
[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery Algorithms[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery Algorithms
[ppt] A Comparison of SAWSDL Based Semantic Web Service Discovery Algorithms
 
Using semantic annotation of web services for analyzing
Using semantic annotation of web services for analyzingUsing semantic annotation of web services for analyzing
Using semantic annotation of web services for analyzing
 
LDAvis
LDAvisLDAvis
LDAvis
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
Filtering Inaccurate Entity Co-references on the Linked Open Data
Filtering Inaccurate Entity Co-references on the Linked Open DataFiltering Inaccurate Entity Co-references on the Linked Open Data
Filtering Inaccurate Entity Co-references on the Linked Open Data
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Webs2008
Webs2008Webs2008
Webs2008
 
Designing, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural NetworksDesigning, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural Networks
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
 

MSR presentation

  • 1. Comparative Study Retrieval from Software Libraries for Bug Localization: A Comparative Study of Generic and Composite Text Models Shivani Rao and Avinash Kak School of ECE,Purdue University May 21, 2011 MSR, Hawaii Mining Software Repositories, Hawaii, 2011
  • 2. Comparative Study Outline 1 Bug localization 2 IR(Information Retrieval)-based bug localization 3 Text Models 4 Preprocessing of the source files 5 Evaluation Metrics 6 Results 7 Conclusion Mining Software Repositories, Hawaii, 2011
  • 3. Comparative Study Bug localization Bug localization Bug localization means to locate the files, methods, classes, etc., that are directly related to the problem causing abnormal execution behavior of the software. IR Bug localization means to locate a bug from its textual description. Mining Software Repositories, Hawaii, 2011
  • 4. Comparative Study Background A typical bug localization process Mining Software Repositories, Hawaii, 2011
  • 5. Comparative Study Background A typical bug report:JEdit Mining Software Repositories, Hawaii, 2011
  • 6. Comparative Study Background Past work on IR-based bug localization Authors/Paper Model Software dataset Marcus et al. VSM Jedit [1] Cleary et al. [2] LM, LSA and Eclipse JDT CA Lukins et al. [3] LDA Mozilla, Eclipse, Rhino and JEdit Drawbacks 1 None of the work reported has been evaluated on a standard dataset. 2 Inability to compare with the static and dynamic techniques. 3 Number of bugs is of the order 5-30 Mining Software Repositories, Hawaii, 2011
  • 7. Comparative Study Background iBUGS Created by Dallmeier and Zimmerman [4], iBUGS contains a large number of real bugs with corresponding test suites in order to generate failing and passing test runs ASPECTJ software Software Library Size (Number of files) 6546 Lines of Code 75 KLOC Vocabulary Size 7553 Number of bugs 291 Table: The iBUGS dataset after preprocessing Mining Software Repositories, Hawaii, 2011
  • 8. Comparative Study Background A typical bug report in the iBUGS repository Mining Software Repositories, Hawaii, 2011
  • 9. Comparative Study Text Models Text models VSM : Vector Space Model LSA : Latent Semantic Analysis Model UM : Unigram Model LDA : Latent Dirichlet Allocation Model CBDM : Cluster-Based Document Model Mining Software Repositories, Hawaii, 2011
  • 10. Comparative Study Text Models Vector Space Model If V is the vocabulary then queries and documents are |V|-dimensional vectors. wq .wm sim(q, dm ) = |w q ||w m | Sparse yet high dimensional space. Mining Software Repositories, Hawaii, 2011
  • 11. Comparative Study Text Models Latent semantic analysis: Eigen decomposition A = UΣV T Mining Software Repositories, Hawaii, 2011
  • 12. Comparative Study Text Models LSA based models Topic based representation: wk (m) which is a K -dimensional eigen vector that mth document wm . wK (m) = Σ−1 UK wm K T qK = Σ−1 UK q K T qK .wK (m) sim(q, dm ) = |qK ||wK (m)| LSA2: Fold back the K-dimensional representation to a smoothed |V| dimensional represenation and compare directly with the query q. w = UK ΣK wK ˜ T Combined Representation: combines the LSA2 with the VSM representation using the mixture parameter λ . ˜ Acombined = λA + (1 − λ)A Mining Software Repositories, Hawaii, 2011
  • 13. Comparative Study Text Models Unigram model to represent documents using probability distribution [5] The term frequencies in a document are considered to be its probability distribution The term frequencies in a query become the query’s probablity distribution The similarities are established by comparing the probability distributions using KL divergence. To add smoothing we add the probability distribution over the entire source library. |D| c(w , dm ) m=1 c(w , dm ) puni (w |Dm ) = µ + (1 − µ) |D| |dm | m=1 |dm | |D| c(w , q) m=1 c(w , dm ) puni (w |q) = µ + (1 − µ) |D| |q| m=1 |dm | Mining Software Repositories, Hawaii, 2011
  • 14. Comparative Study Text Models LDA: A mixture model to represent documents using topics/concepts [6] Mining Software Repositories, Hawaii, 2011
  • 15. Comparative Study Text Models LDA based models [7] Topic based representation θm which is a K -dimensional probability vector that indicates the topic proportions present in mth document. Maximum Likelihood Representation folds back to the |V| dimensional term space. t=K plda (w |Dm ) = p(w |z = t)p(z = t|Dm ) t=1 t=K = φ(t, w )θm (t) t=1 Combined Representation combines the Unigram representation of the document and the MLE-LDA representation of a document. pcombined (w |Dm ) = λplda (w |Dm ) + (1 − λ)puni (w |Dm ) Mining Software Repositories, Hawaii, 2011
  • 16. Comparative Study Text Models Cluster Based Document Model (CBDM) [8] Cluster the documents into K clusters using deterministic algorithms like K-means, hierarchical, agglomerative clustering and so on. Represent each of the clusters using a multinomial distribution over the terms in the vocabulary. This distribution is commonly denoted by pML (w |Clusterj ) and we can express probabilistic distribution for a words in a dm ∈ Clusterj by: wm (n) pcbdm (w |wm ) = λ1 × n=|V| + λ2 × pc (w ) + n=1 wm (n) λ3 × pML (w |Clusterj ) (1) Mining Software Repositories, Hawaii, 2011
  • 17. Comparative Study Text Models Summary of Text Models used in the comparative study Mining Software Repositories, Hawaii, 2011
  • 18. Comparative Study Text Models Summary of Text Models used in the comparative study (cont.) Model Representation Similarity Metric VSM frequency vector Cosine similarity LSA K dimensional vector in the Cosine similarity eigen space Unigram |V| dimensional probability vec- KL divergence tor (smoothed) LDA K dimensional probability vec- KL divergence tor CBDM |V| dimensional combined prob- KL divergence or likeli- ability vector hood Table: Generic models used in the comparative evaluation Mining Software Repositories, Hawaii, 2011
  • 19. Comparative Study Text Models Summary of Text Models used in the comparative study (cont.) Model Representation Similarity Metric LSA2 |V| dimensional representation Cosine similarity in term-space MLE- |V| dimensional MLE-LDA KL divergence or likeli- LDA probability vector hood Table: The variations on two of the generic models used in the comparative evaluation Mining Software Repositories, Hawaii, 2011
  • 20. Comparative Study Text Models Summary of Text Models used in the comparative study (cont.) Model Representation Similarity Metric Unigram |V| dimensional combined prob- KL divergence or likeli- + LDA ability vector hood VSM + |V| dimensional combined VSM Cosine similarity LSA and LSA representation Table: The two composite models used Mining Software Repositories, Hawaii, 2011
  • 21. Comparative Study Preprocessing of the source files Preprocessing of the source files If a patch file does not exist in the /trunk then it is searched and added to the source library from the other branches/tags of the ASPECTJ The source library consists of ”.java” files only. After this step, our library ended up with 6546 Java files. The repository.xml file documents all the information related to a bug. This includes the BugID, the bug description, the relevant source files, and so on. We shall call this ground-truth information as relevance judgements. The bugs that are documented in iBUGS and do not have any relevant software files in the source library that results from the previous step are eliminated. After this step, we are left with 291 bugs. Mining Software Repositories, Hawaii, 2011
  • 22. Comparative Study Preprocessing of the source files Preprocessing of the source files (contd) Hard-words, camel-case words and soft-words are handled by using popular identifier-splitting methods [9, 10]. Stop-list consists of most commonly occuring words. Example: “for,” “else,” “while,” “int,”, “double,” “long,” “public,” “void,” etc. There are 375 such words in iBUGS ASPECTJ software. We also drop from the vocabulary all unicode strings. The vocabulary is pruned further by calculating the relative importance of terms and eliminating ubiquitous and rarely-occuring terms. Mining Software Repositories, Hawaii, 2011
  • 23. Comparative Study Evaluation Metrics Mean Average Precision (MAP) Mean Average Precision (MAP) Calculated using the following two sets: retreived(Nr ) set consists of the top Nr documents from a ranked list of documents retrieved vis-a-vis the query. relevant set is extracted from relevance judgements available from repository.xml Precision and Recall: |{relevant} {retrieved}| Precision(P@Nr ) = |{retrieved}| |{relevant} {retrieved}| Recall(R@Nr ) = |{relevant}| Mining Software Repositories, Hawaii, 2011
  • 24. Comparative Study Evaluation Metrics Mean Average Precision (MAP) Mean Average Precision (MAP) (cont.) 1 If we were to plot a typical P-R curve from the values for P@Nr and R@Nr , we would get a monotonically decrceasing curve that has high values of Precision for low values of Recall and vice versa. 2 Area under the P-R curve is called the Average Precision. 3 Taking mean of the Average Precision over all the queries gives Mean Average Precision (MAP). 4 Physical significance of MAP: Same as that of Precision. Mining Software Repositories, Hawaii, 2011
  • 25. Comparative Study Evaluation Metrics Rank of Retrieved Files Rank of Retrieved Files [3] The number of queries/bugs for which relevant source files were retrieved with ranks rlow ≤ R ≤ rhigh is reported. For the retrieval performance reported in [3], ranks used are R = 1, 2 ≤ R ≤ 5, 6 ≤ R ≤ 10 and R > 10. Mining Software Repositories, Hawaii, 2011
  • 26. Comparative Study Evaluation Metrics SCORE SCORE [11] 1 Indicates the proportion of the program that need to be examined in order to locate or localize a fault 2 For each range of this proportion (example, 10 − 20%) the number of test-runs (bugs) is reported. Mining Software Repositories, Hawaii, 2011
  • 27. Comparative Study Results Models using LDA Figure: MAP using the three LDA models for different values of K, the experimental parameters for LDA+Unigram model are λ = 0.9 µ = 0.5, β = 0.01 and α = 50/K Mining Software Repositories, Hawaii, 2011
  • 28. Comparative Study Results The combined LDA+Unigram model Figure: MAP plotted for different values of mixture proportions (λ and µ) of the LDA+Unigram combined model. Mining Software Repositories, Hawaii, 2011
  • 29. Comparative Study Results Models using LSA Figure: MAP using LSA model and its variations and combinations for different values of K. The experimental parameter for the LSA+VSM combined model is λ = 0.5. Mining Software Repositories, Hawaii, 2011
  • 30. Comparative Study Results CBDM Model parameters K λ1 λ2 λ3 100 250 500 1000 0.25 0.25 0.5 0.093144 0.0914 0.08666 0.07664 0.15 0.35 0.5 0.0883 0.0897 0.0963 0.0932 0.81 0.09 0.1 0.143 0.102 0.108 0.09952 0.27 0.63 0.1 0.1306 0.117 0.111 0.0998 0.495 0.495 0.01 0.141 0.141 0.141 0.141 0.05 0.05 0.99 0.069 0.075 0.072 0.065 Table: Retrieval performance using MAP with the CBDM. λ1 + λ2 + λ3 = 1. λ1 Unigram model λ2 Collection Model λ3 Cluster model Mining Software Repositories, Hawaii, 2011
  • 31. Comparative Study Results Rank based metric Figure: The height of the bars shows the number of queries (bugs) for which at least one relevant source file was retrieved at rank 1. Mining Software Repositories, Hawaii, 2011
  • 32. Comparative Study Results SCORE: IR based bug localization tools Mining Software Repositories, Hawaii, 2011
  • 33. Comparative Study Results SCORE: Compare with AMPLE and FINDBUGS SCORE with FINDBUGS None of the bugs were localized correctly. Figure: SCORE values calculated over 44 bugs in iBUGS ASPECTJ using AMPLE [12] Mining Software Repositories, Hawaii, 2011
  • 34. Comparative Study Conclusion Conclusion IR based bug localization techniques are equally or more effective compared to static or dynamic bug localization tools. Sophisticated models like LDA, LSA or CBDM do not out-perform simpler models like Unigram or VSM for IR based bug localization on large software systems. An analysis of the spread of the word distributions over the source files with the help of measures such as tf and idf can give useful insights into the usability of topic and cluster based models for localization. Mining Software Repositories, Hawaii, 2011
  • 35. Comparative Study Conclusion End of Presentation Thanks to Questions? Mining Software Repositories, Hawaii, 2011
  • 36. Comparative Study Conclusion Threads to validity We have tested on a single database like iBUGS. How does this generalize? We have eliminated xml files among those that are indexed and queried. Maybe not a valid assumption? Mining Software Repositories, Hawaii, 2011
  • 37. Comparative Study Conclusion References A. Marcus, A. Sergeyev, V. Rajlich, and J. I. Maletic, “An Information Retrieval Approach to Concept Location in Source code,” in In Proceedings of the 11th Working Conference on Reverse Engineering (WCRE 2004, pp. 214–223, IEEE Computer Society, 2004. B. Cleary, C. Exton, J. Buckley, and M. English, “An Empirical Analysis of Information Retrieval based Concept Location Techniques in Software Comprehension,” Empirical Softw. Engg., vol. 14, no. 1, pp. 93–130, 2009. S. K. Lukins, N. A. Karft, and E. H. Letha, “Source Code Retrieval for Bug Localization using Latent Dirichlet Allocation,” in 15th Working Conference on Reverse Engineering, 2008. Mining Software Repositories, Hawaii, 2011
  • 38. Comparative Study Conclusion References (cont.) V. Dallmeier and T. Zimmermann, “Extraction of Bug Localization Benchmarks from History,” in ASE ’07: Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering, (New York, NY, USA), pp. 433–436, ACM, 2007. J. Lafferty and C. Zhai, “A Study of Smoothing Methods for Language Models Applied to information retrieval,” ACM Transactions Information Systems, pp. 179–214, 2004. D. M. Blei, A. V. Ng, and M. I. Jordan, “Latent Dirichlet Allocation,” Journal of Machine Learning, pp. 993–1022, 2003. Mining Software Repositories, Hawaii, 2011
  • 39. Comparative Study Conclusion References (cont.) X. Wei and W. B. Croft, “Lda-Based Document Models for Ad-hoc Retrieval,” in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, 2006. L. X and W. B. Croft, “Cluster-Based Retrieval Using Language Models,” in ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2004. D. B. H. Field and D. Lawrie., “An Empirical Comparison of Techniques for Extracting Concept Abbreviations from Identifiers.,” in Proceedings of IASTED International Conference on Software Engineering and Applications, 2006. Mining Software Repositories, Hawaii, 2011
  • 40. Comparative Study Conclusion References (cont.) E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker, “Mining Source Code to Automatically Split Identifiers for Software Analysis,” in Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories, MSR ’09, (Washington, DC, USA), pp. 71–80, IEEE Computer Society, 2009. J. A. Jones and M. J. Harrold, “Empirical Evaluation of the Tarantula Automatic Fault-Localization Technique,” in Automated Software Engineering, 2005. V. Dallmeier and T. Zimmermann, “Automatic Extraction of Bug Localization Benchmarks from History,” tech. rep., Universi¨t des Saarlandes, Saarbr¨cken, Germany, June 2007. a u Mining Software Repositories, Hawaii, 2011