1. 1
Over the Horizon:
ML+SBSE = What?
tim@menzies.us
wvu
30 - 31 January 2013
The 24th CREST Open Workshop, UCL, London, UK
Machine Learning and Search Based Software Engineering (ML & SBSE)
2. 2
Data miners can find
signals in SE artifacts
• apps store data
• recommender systems
• emails human networks
• process data project effort
• process models project changes
• bug databases defect prediction
• execution traces normal usage patterns
• operating system logs software power consumption
• natural language requirements connections between
program components
• Etc
So what’s next?
3. 3
Better algorithms ≠ better mining
(yet ...)
• Dejaeger, K.; Verbeke, W.; • Hall, T.; Beecham, S.; Bowes, D.;
Martens, D.; Baesens, B.; , "Data Gray, D.; Counsell, S.; , "A Systematic
Mining Techniques for Software Review of Fault Prediction Performance
Effort Estimation: A Comparative in Software Engineering," Software
Study," Software Engineering, IEEE Engineering, IEEE Transactions, doi:
Transactions, doi: 10.1109/TSE.2011.103
10.1109/TSE.2011 – Support Vector Machine (SVM)
perform less well.
– Simple, understandable – Models based on C4.5 seem to
techniques like Ordinary least under-perform if they use
squares regressions with log imbalanced data.
transformation of attributes – Models performing comparatively
and target perform as well as well are relatively simple
(or better than) nonlinear techniques that are easy to use and
techniques. well understood.. E.g. Naïve Bayes
and Logistic regression
4. 4
What matters: sharing
Tutorials : ICSE’13 Workshops : ICSE’13
• Data Science for SE • DAPSE’13:
• How to share data and models • If you can’t share data, or models….
• At least, share our analysis methods
• Data analysis patterns in SE
• SE in the Age of Data Privacy
• RAISE’13:
• If you do want to share data ...
• … how to privatize it • realizing AI synergies in SE
• State of the art (archival)
• Over the horizon (short, non-archival)
5. 5
What else matters
• Tools, availability
– Simpler, the better
• Not algorithms, but users:
– CS discounts “user effects”
– The notion of ‘user’ cannot be precisely defined and
therefore has no place in CS and SE -- Edsger
Dijkstra, ICSE’4, 1979
• Not predictive power
– Need “insight”and “engagement”
6. No one thing matters
SE = locality effects
6
Microsoft research, Other studios,
Redmond, Building 99 many other projects
8. 8
Goal: general methods for
building local models
Local models:
• very simple,
• very different to each other
9. “Discussion Mining” : 9
guiding the walk across
the space of local models
• Assumption #1:
– Principle of rationality
*Popper, Newell+ ; “If an
agent has knowledge that
one of its actions will lead
to one of its goals, then
the agent will select that
action.”
• Assumption #2:
– Agents walk clusters of
data to make decisions
– Typology of the data =
space of possible
decisions
10. 10
1. Landscape mining: find local regions &score them
2. Decision mining: find intra-region deltas
3. Discussion mining: help a team walk the regions
A formal model for A successful “Bird”
“engagement” session:
• Knowledge engineers enter
with sample data
• Users take over the
spreadsheet
• Christian Bird, knowledge • Run many ad hoc queries
engineering, • In such meetings, users often…
– Microsoft Research, Redmond • demolish the model
– Assesses learners not by • offer more data
correlation, accuracy, recall, e • demand you come back
tc next week with something
– But by “engagement” better
11. 11
Over the Horizon:
ML+SBSE = Discussion mining
yesterday today
0. algorithm 1. landscape 2. decision 3. discussion
mining mining mining mining
tomorrow future
• A1: because all the primitives for the above are
Q: why call it in the data mining literature
mining? • So we know how to get from here to there
• A2: because data mining scales
Beyond Data Mining, T. Menzies, IEEE Software, 2013, to appear
12. 12
Towards a science of localism
1. Data
– that a community can share
2. A spark
– a (possibly) repeatable effect
that questions accepted thinking
• E.g. Rutherford scattering
3. Maths
– E.g. a data structure
4. Synthesis:
– N olde things are really 1 thing
5. Baselines
– Tools a community can access
– Results that scream for
extension, improvement
6. Big data:
– scalability, stochastics
7. Funding, industrial
partners, grad students
13. 13
Towards a science of localism
1. Data
– that a community can share
2. A spark
– a (possibly) repeatable effect
that questions accepted thinking
• E.g. Rutherford scattering
3. Maths
– E.g. a data structure
4. Synthesis:
– N olde things are really 1 thing
5. Baselines
– Tools a community can access
– Results that scream for
extension, improvement
6. Big data:
– scalability, stochastics STOP PRESS: ISBSG
7. Funding, industrial samplers now in
partners, grad students PROMSE
14. 14
Towards a science of localism
1. Data
– that a community can share
2. A spark
– a (possibly) repeatable effect
that questions accepted thinking
• E.g. Rutherford scattering
3. Maths
– E.g. a data structure
4. Synthesis:
– N olde things are really 1 thing
5. Baselines
– Tools a community can access
– Results that scream for
extension, improvement
6. Big data:
– scalability, stochastics
7. Funding, industrial
partners, grad students
15. 15
Towards a science of localism
1. Data
– that a community can share
2. A spark
– a (possibly) repeatable effect
that questions accepted thinking
• E.g. Rutherford scattering
3. Maths
– E.g. a data structure
4. Synthesis:
– N olde things are really 1 thing
5. Baselines Christian Bird:
– Tools a community can access • The engagement pattern.
– Results that scream for • users don’t want correct
extension, improvement models
6. Big data: • They want to correct their
– scalability, stochastics own models
7. Funding, industrial
partners, grad students
16. 16
Towards a science of localism
1. Data F = features = Observable+ | Controllable+
– that a community can share O = objectives = O1, O2, O3,.. = score(Features)
2. A spark
– a (possibly) repeatable effect Eg = example = <F,0>
that questions accepted thinking
• E.g. the photo-electric effect C = bicluster of examples, clustered by F and O
3. Maths • Each cell has N examples
– E.g. a data structure
4. Synthesis: Features
– N olde things are really 1 thing
5. Baselines
– Tools a community can access
– Results that scream for
extension, improvement
6. Big data:
– scalability, stochastics
7. Funding, industrial Objectives
partners, grad students
17. 17
Towards a science of localism
1. Data Menzies & Shepperd, EMSE 2012
– that a community can share • Special issue on conclusion instability
• Of course solutions are not stable- they specific to
2. A spark each cell in the bicluster.
– a (possibly) repeatable effect
that questions accepted thinking
• E.g. the photo-electric effect
3. Maths
– E.g. a data structure
4. Synthesis: Features
– N olde things are really 1 thing
5. Baselines
– Tools a community can access
– Results that scream for
extension, improvement
6. Big data:
– scalability, stochastics
7. Funding, industrial Objectives
partners, grad students
18. 18
Towards a science of localism
1. Data Applications to MOEA
– that a community can share
Krall & Menzies (in progress) :
2. A spark • Cluster on objectives using recursive Fastmap
– a (possibly) repeatable effect • At each level, check dominance on N items from
that questions accepted thinking
• E.g. the photo-electric effect left& right branch
• Don’t recurse on dominated branches
3. Maths • Selects examples in clusters on Pareto frontier
– E.g. a data structure
4. Synthesis: Features
– N olde things are really 1 thing
5. Baselines
– Tools a community can access
– Results that scream for
extension, improvement
6. Big data:
– scalability, stochastics
7. Funding, industrial Objectives
partners, grad students
19. 19
20 to 50 times fewer
objective evaluations
• Find a large variance dimension
in O(2N) comparisons
– W= any point;
– West= furthest of W;
– East= furthest of West
– c = dist(X,Y)
• Each example X:
– a = dist(X ,West);
– b = dist(X, East)
– Falls x=(a2+c2-b2)/2c
– And at y =sqrt( x2 – a2)
• If X or Y dominates, don’t recurse on the other
– Finds clusters on the Pareto frontier in linear time
• Split on median points, & recurse
– Stop when leaf has less than, say, 30 items
– These are parents of next generation
20. 20
Towards a science of localism
1. Data Transfer learning
– that a community can share • Domain = <examples, distribution>
• One data set may have many domains
2. A spark • i.e. . multiple clusters of features & objectives
– a (possibly) repeatable effect
that questions accepted thinking • Kocaguneli, & Menzies EMSE’11
• E.g. the photo-electric effect • TEAK select best cluster for cross-company learning
• Cross company== within company learning
3. Maths
– E.g. a data structure
4. Synthesis: Features
– N olde things are really 1 thing
5. Baselines
– Tools a community can access
– Results that scream for
extension, improvement
6. Big data:
– scalability, stochastics
7. Funding, industrial Objectives
partners, grad students
21. 21
Towards a science of localism
1. Data Kocaguneli & Menzies et al, TSE’12 (March)
– that a community can share • TEAK : recursively cluster on features
• Kill clusters with large objective variance
2. A spark • Recursively cluster the survivors
– a (possibly) repeatable effect • kNN, select k via sub-tree inspection
that questions accepted thinking
• E.g. the photo-electric effect • Generated only a few clusters per data set
3. Maths
– E.g. a data structure
4. Synthesis: Features
– N olde things are really 1 thing
5. Baselines
– Tools a community can access
– Results that scream for
extension, improvement
6. Big data:
– scalability, stochastics
7. Funding, industrial Objectives
partners, grad students
22. 22
Towards a science of localism
1. Data Active learning
– that a community can share
Kocaguneli & Menzies et al, TSE’13 (pre-print)
2. A spark • Grow clusters via reverse nearest neighbor counts
– a (possibly) repeatable effect • Stop at N+3 if no improvement over N examples
that questions accepted thinking
• E.g. the photo-electric effect • Evaluated via k=1 NN on a holdout set
• Finds the most informative next question
3. Maths
– E.g. a data structure
4. Synthesis: Features
– N olde things are really 1 thing
5. Baselines
– Tools a community can access
– Results that scream for
extension, improvement
6. Big data:
– scalability, stochastics
7. Funding, industrial Objectives
partners, grad students
23. 23
Towards a science of localism
1. Data Peters & Menzies & Zhang et al., TSE’13 (pre-print)
– that a community can share • Find divisions of data that separate classes
• To effectively privatize data…
2. A spark
• Find ranges that drive you to different classes
– a (possibly) repeatable effect
that questions accepted thinking • Remove examples without those ranges
• E.g. the photo-electric effect • Mutate survivors, do not cross boundaries
3. Maths
– E.g. a data structure
4. Synthesis: Features
– N olde things are really 1 thing
5. Baselines
– Tools a community can access
– Results that scream for
extension, improvement
6. Big data:
– scalability, stochastics
7. Funding, industrial Objectives
partners, grad students
24. 24
Towards a science of localism
1. Data Menzies & Butcher & Marcus et al., TSE’13 (pre-print)
– that a community can share • Cluster on features using recursive Fastmap
• Combine sibling leaves if their density not too low
2. A spark • Train from cluster you most “envy”; test on you
– a (possibly) repeatable effect • Envy-based models out-perform global models
that questions accepted thinking
• E.g. the photo-electric effect
3. Maths
– E.g. a data structure
4. Synthesis: Features
– N olde things are really 1 thing
5. Baselines
– Tools a community can access
– Results that scream for
extension, improvement
6. Big data:
– scalability, stochastics
7. Funding, industrial Objectives
partners, grad students
25. 25
Towards a science of localism
1. Data Baseline results: See above. Got a better clusterer?
– that a community can share Source: http://unbox.org/things/var/timm/13/sbs/rrsl.py
Tutorial: http://unbox.org/things/var/timm/13/sbs/rrsl.pdf
2. A spark
– a (possibly) repeatable effect Open issues:
that questions accepted thinking • Can we track engagement.
• E.g. the photo-electric effect • The acid test for structured reviews.
3. Maths • Is data mining and MOEA really different.
– E.g. a data structure
4. Synthesis: Features
– N olde things are really 1 thing
5. Baselines
– Tools a community can access
– Results that scream for
extension, improvement
6. Big data:
– scalability, stochastics
7. Funding, industrial Objectives
partners, grad students
26. 26
Towards a science of localism
1. Data Stochastic recursive Fastmap: O(N.logN)
– that a community can share • Can’t explore all data?
• Just use a random sample
2. A spark
– a (possibly) repeatable effect On-line learning:
that questions accepted thinking • Anomalies = examples that fall outside leaf clusters
• E.g. the photo-electric effect • Re-learn, but just on sub-trees with many anomalies
3. Maths
– E.g. a data structure
4. Synthesis: Features
– N olde things are really 1 thing
5. Baselines
– Tools a community can access
– Results that scream for
extension, improvement
6. Big data:
– scalability, stochastics
7. Funding, industrial Objectives
partners, grad students
27. 27
Towards a science of localism
1. Data New 4 year project:
• WVU + Fraunhofer , Maryland (director = Forrest Shull)
– that a community can share
2. A spark Transfer learning on data from
– a (possibly) repeatable effect • PROMISE (open source)
that questions accepted thinking • Fraunhofer (proprietary data)
• E.g. the photo-electric effect
Core data structure? See below…
3. Maths
– E.g. a data structure
4. Synthesis: Features
– N olde things are really 1 thing
5. Baselines
– Tools a community can access
– Results that scream for
extension, improvement
6. Big data:
– scalability, stochastics
7. Funding, industrial Objectives
partners, grad students
28. 28
Over the Horizon:
ML+SBSE = Discussion mining
yesterday today
0. algorithm 1. landscape 2. decision 3. discussion
mining mining mining mining
tomorrow future
• A1: because all the primitives for the above are
Q: why call it in the data mining literature
mining? • So we know how to get from here to there
• A2: because data mining scales
Beyond Data Mining, T. Menzies, IEEE Software, 2013, to appear
29. 29
Care to join a new cult?
Features
Objectives
ML+SBSE = exploring clusters of features and objectives