SlideShare uma empresa Scribd logo
1 de 30
1




       Over the Horizon:
       ML+SBSE = What?
                        tim@menzies.us
                            wvu




                       30 - 31 January 2013
         The 24th CREST Open Workshop, UCL, London, UK
Machine Learning and Search Based Software Engineering (ML & SBSE)
2

             Data miners can find
             signals in SE artifacts
• apps store data
• recommender systems
• emails human networks
• process data  project effort
• process models  project changes
• bug databases  defect prediction
• execution traces  normal usage patterns
• operating system logs  software power consumption
• natural language requirements  connections between
  program components
• Etc

So what’s next?
3

     Better algorithms ≠ better mining
                   (yet ...)
•   Dejaeger, K.; Verbeke, W.;           •   Hall, T.; Beecham, S.; Bowes, D.;
    Martens, D.; Baesens, B.; , "Data        Gray, D.; Counsell, S.; , "A Systematic
    Mining Techniques for Software           Review of Fault Prediction Performance
    Effort Estimation: A Comparative         in Software Engineering," Software
    Study," Software Engineering, IEEE       Engineering, IEEE Transactions, doi:
    Transactions, doi:                       10.1109/TSE.2011.103
    10.1109/TSE.2011                           – Support Vector Machine (SVM)
                                                  perform less well.
     – Simple, understandable                  – Models based on C4.5 seem to
       techniques like Ordinary least             under-perform if they use
       squares regressions with log               imbalanced data.
       transformation of attributes            – Models performing comparatively
       and target perform as well as              well are relatively simple
       (or better than) nonlinear                 techniques that are easy to use and
       techniques.                                well understood.. E.g. Naïve Bayes
                                                  and Logistic regression
4



                What matters: sharing
        Tutorials : ICSE’13                        Workshops : ICSE’13

• Data Science for SE                      • DAPSE’13:
    •   How to share data and models           •   If you can’t share data, or models….
                                               •   At least, share our analysis methods
                                               •   Data analysis patterns in SE




• SE in the Age of Data Privacy
                                           • RAISE’13:
    •   If you do want to share data ...
    •   … how to privatize it                 •    realizing AI synergies in SE
                                              •    State of the art (archival)
                                              •    Over the horizon (short, non-archival)
5



              What else matters
• Tools, availability
   – Simpler, the better

• Not algorithms, but users:
   – CS discounts “user effects”
   – The notion of ‘user’ cannot be precisely defined and
     therefore has no place in CS and SE -- Edsger
     Dijkstra, ICSE’4, 1979

• Not predictive power
   – Need “insight”and “engagement”
No one thing matters
                        SE = locality effects


                                         6




  Microsoft research,    Other studios,
Redmond, Building 99     many other projects
7

     Localism:
Not general models
8

Goal: general methods for
  building local models




  Local models:
  • very simple,
  • very different to each other
“Discussion Mining” :     9



            guiding the walk across
           the space of local models
• Assumption #1:
  – Principle of rationality
    *Popper, Newell+ ; “If an
    agent has knowledge that
    one of its actions will lead
    to one of its goals, then
    the agent will select that
    action.”

• Assumption #2:
  – Agents walk clusters of
    data to make decisions
  – Typology of the data =
    space of possible
    decisions
10
1. Landscape mining:             find local regions &score them
2. Decision mining:              find intra-region deltas
3. Discussion mining:            help a team walk the regions
       A formal model for                     A successful “Bird”
         “engagement”                              session:

                                        • Knowledge engineers enter
                                          with sample data
                                        • Users take over the
                                          spreadsheet
• Christian Bird, knowledge             • Run many ad hoc queries
  engineering,                          • In such meetings, users often…
   – Microsoft Research, Redmond            • demolish the model
  – Assesses learners not by                • offer more data
     correlation, accuracy, recall, e       • demand you come back
     tc                                        next week with something
  – But by “engagement”                        better
11

        Over the Horizon:
    ML+SBSE = Discussion mining
yesterday                   today


0. algorithm            1. landscape               2. decision             3. discussion
   mining                   mining                   mining                   mining



                                                   tomorrow                     future




                              • A1: because all the primitives for the above are
 Q: why call it                 in the data mining literature
   mining?                        • So we know how to get from here to there

                              • A2: because data mining scales


               Beyond Data Mining, T. Menzies, IEEE Software, 2013, to appear
12



          Towards a science of localism
1.       Data
     –         that a community can share
2.       A spark
     –         a (possibly) repeatable effect
               that questions accepted thinking
           •        E.g. Rutherford scattering
3.       Maths
     –         E.g. a data structure
4.       Synthesis:
     –         N olde things are really 1 thing
5.       Baselines
     –         Tools a community can access
     –         Results that scream for
               extension, improvement
6.       Big data:
     –         scalability, stochastics
7.       Funding, industrial
         partners, grad students
13



          Towards a science of localism
1.       Data
     –         that a community can share
2.       A spark
     –         a (possibly) repeatable effect
               that questions accepted thinking
           •        E.g. Rutherford scattering
3.       Maths
     –         E.g. a data structure
4.       Synthesis:
     –         N olde things are really 1 thing
5.       Baselines
     –         Tools a community can access
     –         Results that scream for
               extension, improvement
6.       Big data:
     –         scalability, stochastics           STOP PRESS: ISBSG
7.       Funding, industrial                       samplers now in
         partners, grad students                      PROMSE
14



          Towards a science of localism
1.       Data
     –         that a community can share
2.       A spark
     –         a (possibly) repeatable effect
               that questions accepted thinking
           •        E.g. Rutherford scattering
3.       Maths
     –         E.g. a data structure
4.       Synthesis:
     –         N olde things are really 1 thing
5.       Baselines
     –         Tools a community can access
     –         Results that scream for
               extension, improvement
6.       Big data:
     –         scalability, stochastics
7.       Funding, industrial
         partners, grad students
15



          Towards a science of localism
1.       Data
     –         that a community can share
2.       A spark
     –         a (possibly) repeatable effect
               that questions accepted thinking
           •        E.g. Rutherford scattering
3.       Maths
     –         E.g. a data structure
4.       Synthesis:
     –         N olde things are really 1 thing
5.       Baselines                                Christian Bird:
     –         Tools a community can access       • The engagement pattern.
     –         Results that scream for            • users don’t want correct
               extension, improvement                models
6.       Big data:                                • They want to correct their
     –         scalability, stochastics              own models
7.       Funding, industrial
         partners, grad students
16



          Towards a science of localism
1.       Data                                        F = features = Observable+ | Controllable+
     –         that a community can share            O = objectives = O1, O2, O3,.. = score(Features)
2.       A spark
     –         a (possibly) repeatable effect        Eg = example = <F,0>
               that questions accepted thinking
           •        E.g. the photo-electric effect   C = bicluster of examples, clustered by F and O
3.       Maths                                       • Each cell has N examples
     –         E.g. a data structure
4.       Synthesis:                                    Features
     –         N olde things are really 1 thing
5.       Baselines
     –         Tools a community can access
     –         Results that scream for
               extension, improvement
6.       Big data:
     –         scalability, stochastics
7.       Funding, industrial                                                                Objectives
         partners, grad students
17



          Towards a science of localism
1.       Data                                        Menzies & Shepperd, EMSE 2012
     –         that a community can share            • Special issue on conclusion instability
                                                     • Of course solutions are not stable- they specific to
2.       A spark                                       each cell in the bicluster.
     –         a (possibly) repeatable effect
               that questions accepted thinking
           •        E.g. the photo-electric effect
3.       Maths
     –         E.g. a data structure
4.       Synthesis:                                   Features
     –         N olde things are really 1 thing
5.       Baselines
     –         Tools a community can access
     –         Results that scream for
               extension, improvement
6.       Big data:
     –         scalability, stochastics
7.       Funding, industrial                                                              Objectives
         partners, grad students
18



          Towards a science of localism
1.       Data                                        Applications to MOEA
     –         that a community can share
                                                     Krall & Menzies (in progress) :
2.       A spark                                     • Cluster on objectives using recursive Fastmap
     –         a (possibly) repeatable effect        • At each level, check dominance on N items from
               that questions accepted thinking
           •        E.g. the photo-electric effect       left& right branch
                                                            • Don’t recurse on dominated branches
3.       Maths                                       • Selects examples in clusters on Pareto frontier
     –         E.g. a data structure
4.       Synthesis:                                   Features
     –         N olde things are really 1 thing
5.       Baselines
     –         Tools a community can access
     –         Results that scream for
               extension, improvement
6.       Big data:
     –         scalability, stochastics
7.       Funding, industrial                                                            Objectives
         partners, grad students
19

20 to 50 times fewer
objective evaluations
•   Find a large variance dimension
    in O(2N) comparisons
     –   W= any point;
     –   West= furthest of W;
     –    East= furthest of West
     –   c = dist(X,Y)

• Each example X:
     –   a = dist(X ,West);
     –   b = dist(X, East)
     –   Falls x=(a2+c2-b2)/2c
     –   And at y =sqrt( x2 – a2)

•   If X or Y dominates, don’t recurse on the other
     –   Finds clusters on the Pareto frontier in linear time

•   Split on median points, & recurse
     –   Stop when leaf has less than, say, 30 items
     –   These are parents of next generation
20



         Towards a science of localism
1.       Data                                        Transfer learning
     –         that a community can share            • Domain = <examples, distribution>
                                                     • One data set may have many domains
2.       A spark                                           • i.e. . multiple clusters of features & objectives
     –         a (possibly) repeatable effect
               that questions accepted thinking      •    Kocaguneli, & Menzies EMSE’11
           •        E.g. the photo-electric effect   •    TEAK select best cluster for cross-company learning
                                                            • Cross company== within company learning
3.       Maths
     –         E.g. a data structure
4.       Synthesis:                                      Features
     –         N olde things are really 1 thing
5.       Baselines
     –         Tools a community can access
     –         Results that scream for
               extension, improvement
6.       Big data:
     –         scalability, stochastics
7.       Funding, industrial                                                               Objectives
         partners, grad students
21



          Towards a science of localism
1.       Data                                        Kocaguneli & Menzies et al, TSE’12 (March)
     –         that a community can share            • TEAK : recursively cluster on features
                                                     • Kill clusters with large objective variance
2.       A spark                                     • Recursively cluster the survivors
     –         a (possibly) repeatable effect        • kNN, select k via sub-tree inspection
               that questions accepted thinking
           •        E.g. the photo-electric effect   • Generated only a few clusters per data set
3.       Maths
     –         E.g. a data structure
4.       Synthesis:                                   Features
     –         N olde things are really 1 thing
5.       Baselines
     –         Tools a community can access
     –         Results that scream for
               extension, improvement
6.       Big data:
     –         scalability, stochastics
7.       Funding, industrial                                                             Objectives
         partners, grad students
22



          Towards a science of localism
1.       Data                                        Active learning
     –         that a community can share
                                                     Kocaguneli & Menzies et al, TSE’13 (pre-print)
2.       A spark                                     • Grow clusters via reverse nearest neighbor counts
     –         a (possibly) repeatable effect        • Stop at N+3 if no improvement over N examples
               that questions accepted thinking
           •        E.g. the photo-electric effect        • Evaluated via k=1 NN on a holdout set
                                                     • Finds the most informative next question
3.       Maths
     –         E.g. a data structure
4.       Synthesis:                                   Features
     –         N olde things are really 1 thing
5.       Baselines
     –         Tools a community can access
     –         Results that scream for
               extension, improvement
6.       Big data:
     –         scalability, stochastics
7.       Funding, industrial                                                            Objectives
         partners, grad students
23



          Towards a science of localism
1.       Data                                        Peters & Menzies & Zhang et al., TSE’13 (pre-print)
     –         that a community can share            • Find divisions of data that separate classes
                                                     • To effectively privatize data…
2.       A spark
                                                           • Find ranges that drive you to different classes
     –         a (possibly) repeatable effect
               that questions accepted thinking            • Remove examples without those ranges
           •        E.g. the photo-electric effect         • Mutate survivors, do not cross boundaries
3.       Maths
     –         E.g. a data structure
4.       Synthesis:                                    Features
     –         N olde things are really 1 thing
5.       Baselines
     –         Tools a community can access
     –         Results that scream for
               extension, improvement
6.       Big data:
     –         scalability, stochastics
7.       Funding, industrial                                                               Objectives
         partners, grad students
24



          Towards a science of localism
1.       Data                                        Menzies & Butcher & Marcus et al., TSE’13 (pre-print)
     –         that a community can share            • Cluster on features using recursive Fastmap
                                                     • Combine sibling leaves if their density not too low
2.       A spark                                     • Train from cluster you most “envy”; test on you
     –         a (possibly) repeatable effect        • Envy-based models out-perform global models
               that questions accepted thinking
           •        E.g. the photo-electric effect
3.       Maths
     –         E.g. a data structure
4.       Synthesis:                                   Features
     –         N olde things are really 1 thing
5.       Baselines
     –         Tools a community can access
     –         Results that scream for
               extension, improvement
6.       Big data:
     –         scalability, stochastics
7.       Funding, industrial                                                              Objectives
         partners, grad students
25



          Towards a science of localism
1.       Data                                        Baseline results: See above. Got a better clusterer?
     –         that a community can share            Source: http://unbox.org/things/var/timm/13/sbs/rrsl.py
                                                     Tutorial: http://unbox.org/things/var/timm/13/sbs/rrsl.pdf
2.       A spark
     –         a (possibly) repeatable effect        Open issues:
               that questions accepted thinking      • Can we track engagement.
           •        E.g. the photo-electric effect   • The acid test for structured reviews.
3.       Maths                                       • Is data mining and MOEA really different.
     –         E.g. a data structure
4.       Synthesis:                                   Features
     –         N olde things are really 1 thing
5.       Baselines
     –         Tools a community can access
     –         Results that scream for
               extension, improvement
6.       Big data:
     –         scalability, stochastics
7.       Funding, industrial                                                                Objectives
         partners, grad students
26



          Towards a science of localism
1.       Data                                        Stochastic recursive Fastmap: O(N.logN)
     –         that a community can share            • Can’t explore all data?
                                                     • Just use a random sample
2.       A spark
     –         a (possibly) repeatable effect        On-line learning:
               that questions accepted thinking      • Anomalies = examples that fall outside leaf clusters
           •        E.g. the photo-electric effect   • Re-learn, but just on sub-trees with many anomalies
3.       Maths
     –         E.g. a data structure
4.       Synthesis:                                   Features
     –         N olde things are really 1 thing
5.       Baselines
     –         Tools a community can access
     –         Results that scream for
               extension, improvement
6.       Big data:
     –         scalability, stochastics
7.       Funding, industrial                                                                   Objectives
         partners, grad students
27



          Towards a science of localism
1.       Data                                        New 4 year project:
                                                     • WVU + Fraunhofer , Maryland (director = Forrest Shull)
     –         that a community can share
2.       A spark                                     Transfer learning on data from
     –         a (possibly) repeatable effect        • PROMISE (open source)
               that questions accepted thinking      • Fraunhofer (proprietary data)
           •        E.g. the photo-electric effect
                                                     Core data structure? See below…
3.       Maths
     –         E.g. a data structure
4.       Synthesis:                                   Features
     –         N olde things are really 1 thing
5.       Baselines
     –         Tools a community can access
     –         Results that scream for
               extension, improvement
6.       Big data:
     –         scalability, stochastics
7.       Funding, industrial                                                               Objectives
         partners, grad students
28

        Over the Horizon:
    ML+SBSE = Discussion mining
yesterday                   today


0. algorithm            1. landscape               2. decision             3. discussion
   mining                   mining                   mining                   mining



                                                   tomorrow                     future




                              • A1: because all the primitives for the above are
 Q: why call it                 in the data mining literature
   mining?                        • So we know how to get from here to there

                              • A2: because data mining scales


               Beyond Data Mining, T. Menzies, IEEE Software, 2013, to appear
29



       Care to join a new cult?

Features




                            Objectives




   ML+SBSE = exploring clusters of features and objectives
30



Questions? Comments?

Mais conteúdo relacionado

Mais procurados

Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Deep Learning for Computer Vision: A comparision between Convolutional Neural...Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Vincenzo Lomonaco
 

Mais procurados (20)

Deep Learning for Artificial Intelligence (AI)
Deep Learning for Artificial Intelligence (AI)Deep Learning for Artificial Intelligence (AI)
Deep Learning for Artificial Intelligence (AI)
 
Artificial Neural Network Seminar - Google Brain
Artificial Neural Network Seminar - Google BrainArtificial Neural Network Seminar - Google Brain
Artificial Neural Network Seminar - Google Brain
 
AI Deep Learning - CF Machine Learning
AI Deep Learning - CF Machine LearningAI Deep Learning - CF Machine Learning
AI Deep Learning - CF Machine Learning
 
Machine Learning vs. Deep Learning
Machine Learning vs. Deep LearningMachine Learning vs. Deep Learning
Machine Learning vs. Deep Learning
 
Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Deep Learning for Computer Vision: A comparision between Convolutional Neural...Deep Learning for Computer Vision: A comparision between Convolutional Neural...
Deep Learning for Computer Vision: A comparision between Convolutional Neural...
 
Deep learning tutorial 9/2019
Deep learning tutorial 9/2019Deep learning tutorial 9/2019
Deep learning tutorial 9/2019
 
Intro to deep learning
Intro to deep learning Intro to deep learning
Intro to deep learning
 
Gervigreind
GervigreindGervigreind
Gervigreind
 
The Deep Learning Glossary
The Deep Learning GlossaryThe Deep Learning Glossary
The Deep Learning Glossary
 
AI/ML as an empirical science
AI/ML as an empirical scienceAI/ML as an empirical science
AI/ML as an empirical science
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
 
Deep Learning Explained
Deep Learning ExplainedDeep Learning Explained
Deep Learning Explained
 
Deep learning presentation
Deep learning presentationDeep learning presentation
Deep learning presentation
 
Intro deep learning
Intro deep learningIntro deep learning
Intro deep learning
 
An introduction to Deep Learning
An introduction to Deep LearningAn introduction to Deep Learning
An introduction to Deep Learning
 
Machine Reasoning at A2I2, Deakin University
Machine Reasoning at A2I2, Deakin UniversityMachine Reasoning at A2I2, Deakin University
Machine Reasoning at A2I2, Deakin University
 
Deep learning
Deep learningDeep learning
Deep learning
 
Deep Learning 2.0
Deep Learning 2.0Deep Learning 2.0
Deep Learning 2.0
 
Demystifying Ml, DL and AI
Demystifying Ml, DL and AIDemystifying Ml, DL and AI
Demystifying Ml, DL and AI
 
What is Deep Learning?
What is Deep Learning?What is Deep Learning?
What is Deep Learning?
 

Destaque (6)

world politics our team
world politics our teamworld politics our team
world politics our team
 
How to do better experiments in SE
How to do better experiments in SEHow to do better experiments in SE
How to do better experiments in SE
 
Learning Analytics in a Teachers' Social Network
Learning Analytics in a Teachers' Social NetworkLearning Analytics in a Teachers' Social Network
Learning Analytics in a Teachers' Social Network
 
A Short Swim through the Personal Learning Pool
A Short Swim through the Personal Learning PoolA Short Swim through the Personal Learning Pool
A Short Swim through the Personal Learning Pool
 
Think Health Presentation
Think Health PresentationThink Health Presentation
Think Health Presentation
 
Learning Analytics in a Mobile World - A Community Information Systems Perspe...
Learning Analytics in a Mobile World - A Community Information Systems Perspe...Learning Analytics in a Mobile World - A Community Information Systems Perspe...
Learning Analytics in a Mobile World - A Community Information Systems Perspe...
 

Semelhante a Ml pluss ejan2013

Dm sei-tutorial-v7
Dm sei-tutorial-v7Dm sei-tutorial-v7
Dm sei-tutorial-v7
CS, NcState
 
Jana Diesner, "Words and Networks: Considering the Content of Text Data for N...
Jana Diesner, "Words and Networks: Considering the Content of Text Data for N...Jana Diesner, "Words and Networks: Considering the Content of Text Data for N...
Jana Diesner, "Words and Networks: Considering the Content of Text Data for N...
summersocialwebshop
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
Alexandru Iosup
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1
CS, NcState
 
Icse 2013-tutorial-data-science-for-software-engineering
Icse 2013-tutorial-data-science-for-software-engineeringIcse 2013-tutorial-data-science-for-software-engineering
Icse 2013-tutorial-data-science-for-software-engineering
CS, NcState
 
SP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with GephiSP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with Gephi
John Breslin
 
Machine Learning of Natural Language
Machine Learning of Natural LanguageMachine Learning of Natural Language
Machine Learning of Natural Language
butest
 

Semelhante a Ml pluss ejan2013 (20)

Dm sei-tutorial-v7
Dm sei-tutorial-v7Dm sei-tutorial-v7
Dm sei-tutorial-v7
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Promise notes
Promise notesPromise notes
Promise notes
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
 
S.P.A.C.E. Exploration for Software Engineering
 S.P.A.C.E. Exploration for Software Engineering S.P.A.C.E. Exploration for Software Engineering
S.P.A.C.E. Exploration for Software Engineering
 
Jana Diesner, "Words and Networks: Considering the Content of Text Data for N...
Jana Diesner, "Words and Networks: Considering the Content of Text Data for N...Jana Diesner, "Words and Networks: Considering the Content of Text Data for N...
Jana Diesner, "Words and Networks: Considering the Content of Text Data for N...
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1
 
Icse 2013-tutorial-data-science-for-software-engineering
Icse 2013-tutorial-data-science-for-software-engineeringIcse 2013-tutorial-data-science-for-software-engineering
Icse 2013-tutorial-data-science-for-software-engineering
 
A Pragmatic Perspective on Software Visualization
A Pragmatic Perspective on Software VisualizationA Pragmatic Perspective on Software Visualization
A Pragmatic Perspective on Software Visualization
 
OAI7 Research Objects
OAI7 Research ObjectsOAI7 Research Objects
OAI7 Research Objects
 
Introduction to the FP7 CODE project @ BDBC
Introduction to the FP7 CODE project @ BDBCIntroduction to the FP7 CODE project @ BDBC
Introduction to the FP7 CODE project @ BDBC
 
SP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with GephiSP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with Gephi
 
Machine Learning of Natural Language
Machine Learning of Natural LanguageMachine Learning of Natural Language
Machine Learning of Natural Language
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data Science
 
Distributed data mining
Distributed data miningDistributed data mining
Distributed data mining
 
Spark
SparkSpark
Spark
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Text Mining : Experience
Text Mining : ExperienceText Mining : Experience
Text Mining : Experience
 

Mais de CS, NcState

Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9
CS, NcState
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab template
CS, NcState
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
CS, NcState
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?
CS, NcState
 

Mais de CS, NcState (20)

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdec
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab template
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software Engineering
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
 
Goldrush
GoldrushGoldrush
Goldrush
 
Know thy tools
Know thy toolsKnow thy tools
Know thy tools
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter?
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?
 
Sayyad slides ase13_v4
Sayyad slides ase13_v4Sayyad slides ase13_v4
Sayyad slides ase13_v4
 

Ml pluss ejan2013

  • 1. 1 Over the Horizon: ML+SBSE = What? tim@menzies.us wvu 30 - 31 January 2013 The 24th CREST Open Workshop, UCL, London, UK Machine Learning and Search Based Software Engineering (ML & SBSE)
  • 2. 2 Data miners can find signals in SE artifacts • apps store data • recommender systems • emails human networks • process data  project effort • process models  project changes • bug databases  defect prediction • execution traces  normal usage patterns • operating system logs  software power consumption • natural language requirements  connections between program components • Etc So what’s next?
  • 3. 3 Better algorithms ≠ better mining (yet ...) • Dejaeger, K.; Verbeke, W.; • Hall, T.; Beecham, S.; Bowes, D.; Martens, D.; Baesens, B.; , "Data Gray, D.; Counsell, S.; , "A Systematic Mining Techniques for Software Review of Fault Prediction Performance Effort Estimation: A Comparative in Software Engineering," Software Study," Software Engineering, IEEE Engineering, IEEE Transactions, doi: Transactions, doi: 10.1109/TSE.2011.103 10.1109/TSE.2011 – Support Vector Machine (SVM) perform less well. – Simple, understandable – Models based on C4.5 seem to techniques like Ordinary least under-perform if they use squares regressions with log imbalanced data. transformation of attributes – Models performing comparatively and target perform as well as well are relatively simple (or better than) nonlinear techniques that are easy to use and techniques. well understood.. E.g. Naïve Bayes and Logistic regression
  • 4. 4 What matters: sharing Tutorials : ICSE’13 Workshops : ICSE’13 • Data Science for SE • DAPSE’13: • How to share data and models • If you can’t share data, or models…. • At least, share our analysis methods • Data analysis patterns in SE • SE in the Age of Data Privacy • RAISE’13: • If you do want to share data ... • … how to privatize it • realizing AI synergies in SE • State of the art (archival) • Over the horizon (short, non-archival)
  • 5. 5 What else matters • Tools, availability – Simpler, the better • Not algorithms, but users: – CS discounts “user effects” – The notion of ‘user’ cannot be precisely defined and therefore has no place in CS and SE -- Edsger Dijkstra, ICSE’4, 1979 • Not predictive power – Need “insight”and “engagement”
  • 6. No one thing matters SE = locality effects 6 Microsoft research, Other studios, Redmond, Building 99 many other projects
  • 7. 7 Localism: Not general models
  • 8. 8 Goal: general methods for building local models Local models: • very simple, • very different to each other
  • 9. “Discussion Mining” : 9 guiding the walk across the space of local models • Assumption #1: – Principle of rationality *Popper, Newell+ ; “If an agent has knowledge that one of its actions will lead to one of its goals, then the agent will select that action.” • Assumption #2: – Agents walk clusters of data to make decisions – Typology of the data = space of possible decisions
  • 10. 10 1. Landscape mining: find local regions &score them 2. Decision mining: find intra-region deltas 3. Discussion mining: help a team walk the regions A formal model for A successful “Bird” “engagement” session: • Knowledge engineers enter with sample data • Users take over the spreadsheet • Christian Bird, knowledge • Run many ad hoc queries engineering, • In such meetings, users often… – Microsoft Research, Redmond • demolish the model – Assesses learners not by • offer more data correlation, accuracy, recall, e • demand you come back tc next week with something – But by “engagement” better
  • 11. 11 Over the Horizon: ML+SBSE = Discussion mining yesterday today 0. algorithm 1. landscape 2. decision 3. discussion mining mining mining mining tomorrow future • A1: because all the primitives for the above are Q: why call it in the data mining literature mining? • So we know how to get from here to there • A2: because data mining scales Beyond Data Mining, T. Menzies, IEEE Software, 2013, to appear
  • 12. 12 Towards a science of localism 1. Data – that a community can share 2. A spark – a (possibly) repeatable effect that questions accepted thinking • E.g. Rutherford scattering 3. Maths – E.g. a data structure 4. Synthesis: – N olde things are really 1 thing 5. Baselines – Tools a community can access – Results that scream for extension, improvement 6. Big data: – scalability, stochastics 7. Funding, industrial partners, grad students
  • 13. 13 Towards a science of localism 1. Data – that a community can share 2. A spark – a (possibly) repeatable effect that questions accepted thinking • E.g. Rutherford scattering 3. Maths – E.g. a data structure 4. Synthesis: – N olde things are really 1 thing 5. Baselines – Tools a community can access – Results that scream for extension, improvement 6. Big data: – scalability, stochastics STOP PRESS: ISBSG 7. Funding, industrial samplers now in partners, grad students PROMSE
  • 14. 14 Towards a science of localism 1. Data – that a community can share 2. A spark – a (possibly) repeatable effect that questions accepted thinking • E.g. Rutherford scattering 3. Maths – E.g. a data structure 4. Synthesis: – N olde things are really 1 thing 5. Baselines – Tools a community can access – Results that scream for extension, improvement 6. Big data: – scalability, stochastics 7. Funding, industrial partners, grad students
  • 15. 15 Towards a science of localism 1. Data – that a community can share 2. A spark – a (possibly) repeatable effect that questions accepted thinking • E.g. Rutherford scattering 3. Maths – E.g. a data structure 4. Synthesis: – N olde things are really 1 thing 5. Baselines Christian Bird: – Tools a community can access • The engagement pattern. – Results that scream for • users don’t want correct extension, improvement models 6. Big data: • They want to correct their – scalability, stochastics own models 7. Funding, industrial partners, grad students
  • 16. 16 Towards a science of localism 1. Data F = features = Observable+ | Controllable+ – that a community can share O = objectives = O1, O2, O3,.. = score(Features) 2. A spark – a (possibly) repeatable effect Eg = example = <F,0> that questions accepted thinking • E.g. the photo-electric effect C = bicluster of examples, clustered by F and O 3. Maths • Each cell has N examples – E.g. a data structure 4. Synthesis: Features – N olde things are really 1 thing 5. Baselines – Tools a community can access – Results that scream for extension, improvement 6. Big data: – scalability, stochastics 7. Funding, industrial Objectives partners, grad students
  • 17. 17 Towards a science of localism 1. Data Menzies & Shepperd, EMSE 2012 – that a community can share • Special issue on conclusion instability • Of course solutions are not stable- they specific to 2. A spark each cell in the bicluster. – a (possibly) repeatable effect that questions accepted thinking • E.g. the photo-electric effect 3. Maths – E.g. a data structure 4. Synthesis: Features – N olde things are really 1 thing 5. Baselines – Tools a community can access – Results that scream for extension, improvement 6. Big data: – scalability, stochastics 7. Funding, industrial Objectives partners, grad students
  • 18. 18 Towards a science of localism 1. Data Applications to MOEA – that a community can share Krall & Menzies (in progress) : 2. A spark • Cluster on objectives using recursive Fastmap – a (possibly) repeatable effect • At each level, check dominance on N items from that questions accepted thinking • E.g. the photo-electric effect left& right branch • Don’t recurse on dominated branches 3. Maths • Selects examples in clusters on Pareto frontier – E.g. a data structure 4. Synthesis: Features – N olde things are really 1 thing 5. Baselines – Tools a community can access – Results that scream for extension, improvement 6. Big data: – scalability, stochastics 7. Funding, industrial Objectives partners, grad students
  • 19. 19 20 to 50 times fewer objective evaluations • Find a large variance dimension in O(2N) comparisons – W= any point; – West= furthest of W; – East= furthest of West – c = dist(X,Y) • Each example X: – a = dist(X ,West); – b = dist(X, East) – Falls x=(a2+c2-b2)/2c – And at y =sqrt( x2 – a2) • If X or Y dominates, don’t recurse on the other – Finds clusters on the Pareto frontier in linear time • Split on median points, & recurse – Stop when leaf has less than, say, 30 items – These are parents of next generation
  • 20. 20 Towards a science of localism 1. Data Transfer learning – that a community can share • Domain = <examples, distribution> • One data set may have many domains 2. A spark • i.e. . multiple clusters of features & objectives – a (possibly) repeatable effect that questions accepted thinking • Kocaguneli, & Menzies EMSE’11 • E.g. the photo-electric effect • TEAK select best cluster for cross-company learning • Cross company== within company learning 3. Maths – E.g. a data structure 4. Synthesis: Features – N olde things are really 1 thing 5. Baselines – Tools a community can access – Results that scream for extension, improvement 6. Big data: – scalability, stochastics 7. Funding, industrial Objectives partners, grad students
  • 21. 21 Towards a science of localism 1. Data Kocaguneli & Menzies et al, TSE’12 (March) – that a community can share • TEAK : recursively cluster on features • Kill clusters with large objective variance 2. A spark • Recursively cluster the survivors – a (possibly) repeatable effect • kNN, select k via sub-tree inspection that questions accepted thinking • E.g. the photo-electric effect • Generated only a few clusters per data set 3. Maths – E.g. a data structure 4. Synthesis: Features – N olde things are really 1 thing 5. Baselines – Tools a community can access – Results that scream for extension, improvement 6. Big data: – scalability, stochastics 7. Funding, industrial Objectives partners, grad students
  • 22. 22 Towards a science of localism 1. Data Active learning – that a community can share Kocaguneli & Menzies et al, TSE’13 (pre-print) 2. A spark • Grow clusters via reverse nearest neighbor counts – a (possibly) repeatable effect • Stop at N+3 if no improvement over N examples that questions accepted thinking • E.g. the photo-electric effect • Evaluated via k=1 NN on a holdout set • Finds the most informative next question 3. Maths – E.g. a data structure 4. Synthesis: Features – N olde things are really 1 thing 5. Baselines – Tools a community can access – Results that scream for extension, improvement 6. Big data: – scalability, stochastics 7. Funding, industrial Objectives partners, grad students
  • 23. 23 Towards a science of localism 1. Data Peters & Menzies & Zhang et al., TSE’13 (pre-print) – that a community can share • Find divisions of data that separate classes • To effectively privatize data… 2. A spark • Find ranges that drive you to different classes – a (possibly) repeatable effect that questions accepted thinking • Remove examples without those ranges • E.g. the photo-electric effect • Mutate survivors, do not cross boundaries 3. Maths – E.g. a data structure 4. Synthesis: Features – N olde things are really 1 thing 5. Baselines – Tools a community can access – Results that scream for extension, improvement 6. Big data: – scalability, stochastics 7. Funding, industrial Objectives partners, grad students
  • 24. 24 Towards a science of localism 1. Data Menzies & Butcher & Marcus et al., TSE’13 (pre-print) – that a community can share • Cluster on features using recursive Fastmap • Combine sibling leaves if their density not too low 2. A spark • Train from cluster you most “envy”; test on you – a (possibly) repeatable effect • Envy-based models out-perform global models that questions accepted thinking • E.g. the photo-electric effect 3. Maths – E.g. a data structure 4. Synthesis: Features – N olde things are really 1 thing 5. Baselines – Tools a community can access – Results that scream for extension, improvement 6. Big data: – scalability, stochastics 7. Funding, industrial Objectives partners, grad students
  • 25. 25 Towards a science of localism 1. Data Baseline results: See above. Got a better clusterer? – that a community can share Source: http://unbox.org/things/var/timm/13/sbs/rrsl.py Tutorial: http://unbox.org/things/var/timm/13/sbs/rrsl.pdf 2. A spark – a (possibly) repeatable effect Open issues: that questions accepted thinking • Can we track engagement. • E.g. the photo-electric effect • The acid test for structured reviews. 3. Maths • Is data mining and MOEA really different. – E.g. a data structure 4. Synthesis: Features – N olde things are really 1 thing 5. Baselines – Tools a community can access – Results that scream for extension, improvement 6. Big data: – scalability, stochastics 7. Funding, industrial Objectives partners, grad students
  • 26. 26 Towards a science of localism 1. Data Stochastic recursive Fastmap: O(N.logN) – that a community can share • Can’t explore all data? • Just use a random sample 2. A spark – a (possibly) repeatable effect On-line learning: that questions accepted thinking • Anomalies = examples that fall outside leaf clusters • E.g. the photo-electric effect • Re-learn, but just on sub-trees with many anomalies 3. Maths – E.g. a data structure 4. Synthesis: Features – N olde things are really 1 thing 5. Baselines – Tools a community can access – Results that scream for extension, improvement 6. Big data: – scalability, stochastics 7. Funding, industrial Objectives partners, grad students
  • 27. 27 Towards a science of localism 1. Data New 4 year project: • WVU + Fraunhofer , Maryland (director = Forrest Shull) – that a community can share 2. A spark Transfer learning on data from – a (possibly) repeatable effect • PROMISE (open source) that questions accepted thinking • Fraunhofer (proprietary data) • E.g. the photo-electric effect Core data structure? See below… 3. Maths – E.g. a data structure 4. Synthesis: Features – N olde things are really 1 thing 5. Baselines – Tools a community can access – Results that scream for extension, improvement 6. Big data: – scalability, stochastics 7. Funding, industrial Objectives partners, grad students
  • 28. 28 Over the Horizon: ML+SBSE = Discussion mining yesterday today 0. algorithm 1. landscape 2. decision 3. discussion mining mining mining mining tomorrow future • A1: because all the primitives for the above are Q: why call it in the data mining literature mining? • So we know how to get from here to there • A2: because data mining scales Beyond Data Mining, T. Menzies, IEEE Software, 2013, to appear
  • 29. 29 Care to join a new cult? Features Objectives ML+SBSE = exploring clusters of features and objectives