Ml pluss ejan2013

1

Over the Horizon:
ML+SBSE = What?
tim@menzies.us
wvu

30 - 31 January 2013
The 24th CREST Open Workshop, UCL, London, UK
Machine Learning and Search Based Software Engineering (ML & SBSE)

2

Data miners can find
signals in SE artifacts
• apps store data
• recommender systems
• emails human networks
• process data  project effort
• process models  project changes
• bug databases  defect prediction
• execution traces  normal usage patterns
• operating system logs  software power consumption
• natural language requirements  connections between
program components
• Etc

So what’s next?

3

Better algorithms ≠ better mining
(yet ...)
• Dejaeger, K.; Verbeke, W.; • Hall, T.; Beecham, S.; Bowes, D.;
Martens, D.; Baesens, B.; , "Data Gray, D.; Counsell, S.; , "A Systematic
Mining Techniques for Software Review of Fault Prediction Performance
Effort Estimation: A Comparative in Software Engineering," Software
Study," Software Engineering, IEEE Engineering, IEEE Transactions, doi:
Transactions, doi: 10.1109/TSE.2011.103
10.1109/TSE.2011 – Support Vector Machine (SVM)
perform less well.
– Simple, understandable – Models based on C4.5 seem to
techniques like Ordinary least under-perform if they use
squares regressions with log imbalanced data.
transformation of attributes – Models performing comparatively
and target perform as well as well are relatively simple
(or better than) nonlinear techniques that are easy to use and
techniques. well understood.. E.g. Naïve Bayes
and Logistic regression

4

What matters: sharing
Tutorials : ICSE’13 Workshops : ICSE’13

• Data Science for SE • DAPSE’13:
• How to share data and models • If you can’t share data, or models….
• At least, share our analysis methods
• Data analysis patterns in SE

• SE in the Age of Data Privacy
• RAISE’13:
• If you do want to share data ...
• … how to privatize it • realizing AI synergies in SE
• State of the art (archival)
• Over the horizon (short, non-archival)

5

What else matters
• Tools, availability
– Simpler, the better

• Not algorithms, but users:
– CS discounts “user effects”
– The notion of ‘user’ cannot be precisely defined and
therefore has no place in CS and SE -- Edsger
Dijkstra, ICSE’4, 1979

• Not predictive power
– Need “insight”and “engagement”

No one thing matters
SE = locality effects

6

Microsoft research, Other studios,
Redmond, Building 99 many other projects

7

Localism:
Not general models

8

Goal: general methods for
building local models

Local models:
• very simple,
• very different to each other

“Discussion Mining” : 9

guiding the walk across
the space of local models
• Assumption #1:
– Principle of rationality
*Popper, Newell+ ; “If an
agent has knowledge that
one of its actions will lead
to one of its goals, then
the agent will select that
action.”

• Assumption #2:
– Agents walk clusters of
data to make decisions
– Typology of the data =
space of possible
decisions

10
1. Landscape mining: find local regions &score them
2. Decision mining: find intra-region deltas
3. Discussion mining: help a team walk the regions
A formal model for A successful “Bird”
“engagement” session:

• Knowledge engineers enter
with sample data
• Users take over the
spreadsheet
• Christian Bird, knowledge • Run many ad hoc queries
engineering, • In such meetings, users often…
– Microsoft Research, Redmond • demolish the model
– Assesses learners not by • offer more data
correlation, accuracy, recall, e • demand you come back
tc next week with something
– But by “engagement” better

11

Over the Horizon:
ML+SBSE = Discussion mining
yesterday today

0. algorithm 1. landscape 2. decision 3. discussion
mining mining mining mining

tomorrow future

• A1: because all the primitives for the above are
Q: why call it in the data mining literature
mining? • So we know how to get from here to there

• A2: because data mining scales

Beyond Data Mining, T. Menzies, IEEE Software, 2013, to appear

12

Towards a science of localism
1. Data
– that a community can share
2. A spark
– a (possibly) repeatable effect
that questions accepted thinking
• E.g. Rutherford scattering
3. Maths
– E.g. a data structure
4. Synthesis:
– N olde things are really 1 thing
5. Baselines
– Tools a community can access
– Results that scream for
extension, improvement
6. Big data:
– scalability, stochastics
7. Funding, industrial
partners, grad students

13

1. Data
2. A spark
3. Maths
4. Synthesis:
5. Baselines
6. Big data:
– scalability, stochastics STOP PRESS: ISBSG
7. Funding, industrial samplers now in
partners, grad students PROMSE

14

1. Data
2. A spark
3. Maths
4. Synthesis:
5. Baselines
6. Big data:

15

1. Data
2. A spark
3. Maths
4. Synthesis:
5. Baselines Christian Bird:
– Tools a community can access • The engagement pattern.
– Results that scream for • users don’t want correct
extension, improvement models
6. Big data: • They want to correct their
– scalability, stochastics own models

16

1. Data F = features = Observable+ | Controllable+
– that a community can share O = objectives = O1, O2, O3,.. = score(Features)
2. A spark
– a (possibly) repeatable effect Eg = example = <F,0>
• E.g. the photo-electric effect C = bicluster of examples, clustered by F and O
3. Maths • Each cell has N examples
4. Synthesis: Features
5. Baselines
6. Big data:
7. Funding, industrial Objectives

17

1. Data Menzies & Shepperd, EMSE 2012
– that a community can share • Special issue on conclusion instability
• Of course solutions are not stable- they specific to
2. A spark each cell in the bicluster.
• E.g. the photo-electric effect
3. Maths
5. Baselines
6. Big data:

18

1. Data Applications to MOEA
Krall & Menzies (in progress) :
2. A spark • Cluster on objectives using recursive Fastmap
– a (possibly) repeatable effect • At each level, check dominance on N items from
• E.g. the photo-electric effect left& right branch
• Don’t recurse on dominated branches
3. Maths • Selects examples in clusters on Pareto frontier
5. Baselines
6. Big data:

19

20 to 50 times fewer
objective evaluations
• Find a large variance dimension
in O(2N) comparisons
– W= any point;
– West= furthest of W;
– East= furthest of West
– c = dist(X,Y)

• Each example X:
– a = dist(X ,West);
– b = dist(X, East)
– Falls x=(a2+c2-b2)/2c
– And at y =sqrt( x2 – a2)

• If X or Y dominates, don’t recurse on the other
– Finds clusters on the Pareto frontier in linear time

• Split on median points, & recurse
– Stop when leaf has less than, say, 30 items
– These are parents of next generation

20

1. Data Transfer learning
– that a community can share • Domain = <examples, distribution>
• One data set may have many domains
2. A spark • i.e. . multiple clusters of features & objectives
that questions accepted thinking • Kocaguneli, & Menzies EMSE’11
• E.g. the photo-electric effect • TEAK select best cluster for cross-company learning
• Cross company== within company learning
3. Maths
5. Baselines
6. Big data:

21

1. Data Kocaguneli & Menzies et al, TSE’12 (March)
– that a community can share • TEAK : recursively cluster on features
• Kill clusters with large objective variance
2. A spark • Recursively cluster the survivors
– a (possibly) repeatable effect • kNN, select k via sub-tree inspection
• E.g. the photo-electric effect • Generated only a few clusters per data set
3. Maths
5. Baselines
6. Big data:

22

1. Data Active learning
Kocaguneli & Menzies et al, TSE’13 (pre-print)
2. A spark • Grow clusters via reverse nearest neighbor counts
– a (possibly) repeatable effect • Stop at N+3 if no improvement over N examples
• E.g. the photo-electric effect • Evaluated via k=1 NN on a holdout set
• Finds the most informative next question
3. Maths
5. Baselines
6. Big data:

23

1. Data Peters & Menzies & Zhang et al., TSE’13 (pre-print)
– that a community can share • Find divisions of data that separate classes
• To effectively privatize data…
2. A spark
• Find ranges that drive you to different classes
that questions accepted thinking • Remove examples without those ranges
• E.g. the photo-electric effect • Mutate survivors, do not cross boundaries
3. Maths
5. Baselines
6. Big data:

24

1. Data Menzies & Butcher & Marcus et al., TSE’13 (pre-print)
– that a community can share • Cluster on features using recursive Fastmap
• Combine sibling leaves if their density not too low
2. A spark • Train from cluster you most “envy”; test on you
– a (possibly) repeatable effect • Envy-based models out-perform global models
3. Maths
5. Baselines
6. Big data:

25

1. Data Baseline results: See above. Got a better clusterer?
– that a community can share Source: http://unbox.org/things/var/timm/13/sbs/rrsl.py
Tutorial: http://unbox.org/things/var/timm/13/sbs/rrsl.pdf
2. A spark
– a (possibly) repeatable effect Open issues:
that questions accepted thinking • Can we track engagement.
• E.g. the photo-electric effect • The acid test for structured reviews.
3. Maths • Is data mining and MOEA really different.
5. Baselines
6. Big data:

26

1. Data Stochastic recursive Fastmap: O(N.logN)
– that a community can share • Can’t explore all data?
• Just use a random sample
2. A spark
– a (possibly) repeatable effect On-line learning:
that questions accepted thinking • Anomalies = examples that fall outside leaf clusters
• E.g. the photo-electric effect • Re-learn, but just on sub-trees with many anomalies
3. Maths
5. Baselines
6. Big data:

27

1. Data New 4 year project:
• WVU + Fraunhofer , Maryland (director = Forrest Shull)
2. A spark Transfer learning on data from
– a (possibly) repeatable effect • PROMISE (open source)
that questions accepted thinking • Fraunhofer (proprietary data)
Core data structure? See below…
3. Maths
5. Baselines
6. Big data:

28

Over the Horizon:
ML+SBSE = Discussion mining
yesterday today

0. algorithm 1. landscape 2. decision 3. discussion
mining mining mining mining

tomorrow future

• A1: because all the primitives for the above are
Q: why call it in the data mining literature
mining? • So we know how to get from here to there

• A2: because data mining scales

Beyond Data Mining, T. Menzies, IEEE Software, 2013, to appear

29

Care to join a new cult?

Features

Objectives

ML+SBSE = exploring clusters of features and objectives

Ml pluss ejan2013

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (6)

Semelhante a Ml pluss ejan2013

Semelhante a Ml pluss ejan2013 (20)

Mais de CS, NcState

Mais de CS, NcState (20)

Ml pluss ejan2013