SlideShare uma empresa Scribd logo
1 de 124
ICSE’13 Tutorial: Data Science for
Software Engineering
Tim Menzies, West Virginia University
Ekrem Kocaguneli, West Virginia University
Fayola Peters, West Virginia University
Burak Turhan, University of Oulu
Leandro L. Minku, The University of Birmingham
ICSE 2013
May 18th - 26th, 2013
San Francisco, CA
http://bit.ly/icse13tutorial
Who we are…
1
Tim Menzies
West Virginia University
tim@menzies.us
Ekrem Kocaguneli
West Virginia University
ekrem@kocaguneli.com
Fayola Peters
West Virginia University
fayolapeters@gmail.com
Burak Turhan
University of Oulu
turhanb@computer.org
Leandro L. Minku
The University of Birmingham
L.L.Minku@cs.bham.ac.uk
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Rule #1: Talk to the users
– Rule #2: Know your domain
– Rule #3: Suspect your data
– Rule #4: Data science is cyclic
• PART 2: Data Issues
– How to solve lack or scarcity of data
– How to prune data, simpler & smarter
– How to advance simple CBR methods
– How to keep your data private
• PART 3: Model Issues
– Problems of SE models
– Solutions
• Envy-based learning
• Ensembles
2
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Rule #1: Talk to the users
– Rule #2: Know your domain
– Rule #3: Suspect your data
– Rule #4: Data science is cyclic
• PART 2: Data Issues
– How to solve lack or scarcity of data
– How to prune data, simpler & smarter
– How to advance simple CBR methods
– How to keep your data private
• PART 3: Model Issues
– Problems of SE models
– Solutions
• Envy-based learning
• Ensembles
3
What can we share?
• Two software project
managers meet
– What can they learn
from each other?
• They can share
1. Data
2. Models
3. Methods
• techniques for turning
data into models
4. Insight into the domain
• The standard mistake
– Generally assumed that
models can be shared,
without modification.
– Yeah, right…
4
SE research = sparse sample of a
very diverse set of activities
5
Microsoft research,
Redmond, Building 99
Other studios,
many other projects
And they are all different.
Models may not move
(effort estimation)
• 20 * 66% samples of
data from NASA
• Linear regression on
each sample to learn
effort = a*LOCb *Σiβixi
• Back select to remove
useless xi
• Result?
– Wide βivariance
6* T. Menzies, A.Butcher, D.Cok, A.Marcus, L.Layman, F.Shull, B.Turhan, T.Zimmermann, "Local vs. Global Lessons for Defect Prediction and
Effort Estimation," IEEE TSE pre-print 2012. http://menzies.us/pdf/12gense.pdf
Models may not move
(defect prediction)
7* T. Menzies, A.Butcher, D.Cok, A.Marcus, L.Layman, F.Shull, B.Turhan, T.Zimmermann, "Local vs. Global Lessons for Defect Prediction and
Effort Estimation," IEEE TSE pre-print 2012. http://menzies.us/pdf/12gense.pdf
Oh woe is me
• No generality in SE?
• Nothing we can learn
from each other?
• Forever doomed to never
make a conclusion?
– Always, laboriously,
tediously, slowly, learning
specific lessons that hold
only for specific projects?
• No: 3 things we might
want to share
– Models, methods, data
• If no general models, then
– Share methods
• general methods for
quickly turning local data
into local models.
– Share data
• Find and transfer relevant
data from other projects to
us
8
The rest of this tutorial
• Data science
– How to share data
– How to share methods
• Maybe one day, in the future,
– after we’ve shared enough data and methods
– We’ll be able to report general models
– ICSE 2020?
• But first,
– Some general notes on data mining
9
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Rule #1: Talk to the users
– Rule #2: Know your domain
– Rule #3: Suspect your data
– Rule #4: Data science is cyclic
• PART 2: Data Issues
– How to solve lack or scarcity of data
– How to prune data, simpler & smarter
– How to advance simple CBR methods
– How to keep your data private
• PART 3: Model Issues
– Problems of SE models
– Solutions
• Envy-based learning
• Ensembles
10
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
–Rule #1: Talk to the users
– Rule #2: Know your domain
– Rule #3: Suspect your data
– Rule #4: Data science is cyclic
• PART 2: Data Issues
– How to solve lack or scarcity of data
– How to prune data, simpler & smarter
– How to advance simple CBR methods
– How to keep your data private
• PART 3: Model Issues
– Problems of SE models
– Solutions
• Envy-based learning
• Ensembles
11
The great myth
• Let’s face it:
– Humans are a pest
– And experts doubly so.
• “The notion of ‘user’ cannot be
precisely defined and therefore
has no place in CS and SE”
– EdsgerDijkstra, ICSE’4, 1979
• http://en.wikipedia.org/wiki/List_
of_cognitive_biases
• 96 Decision-making, belief and
behavioral biases
– Attentional bias – paying more
attention to emotionally dominant
stimuli in one's environment and to
neglect relevant data
• 23 Social biases
– Worse-than-average effect –
believing we are worse than others at
tasks which are difficult
• 52 Memory errors and biases
– Illusory correlation – inaccurately
remembering a relationship between
two event
12
The great myth
• Wouldn’t it be
wonderful if we did not
have to listen to them
– The dream of
oldeworlde machine
learning
• Circa 1980s
– Dispense with live
experts and resurrect
dead ones.
• But any successful
learner needs biases
– Ways to know what’s
important
• What’s dull
• What can be ignored
– No bias? Can’t ignore
anything
• No summarization
• No generalization
• No way to predict the future
13
Christian Bird, data miner,
Msoft research, Redmond
• Microsoft Research,
Redmond
– Assesses learnersby
“engagement”
A successful “Bird”
session:
• Knowledge engineers enter
with sample data
• Users take over the
spreadsheet
• Run many ad hoc queries
• In such meetings, users often…
• demolish the model
• offer more data
• demand you come back
next week with something
better 14
Expert data scientists spend more time
with users than algorithms
Also: Users control budgets
• Why talk to users?
– Cause they own the wallet
• As the Mercury astronauts used to say
– No bucks, no Buck Rodgers
• We need to give users a sense of comfort that
we know what we are doing
– That they are part of the process
– That we understand their problem and processes
– Else, budget = $0
15
The Inductive
Engineering Manifesto
• Users before algorithms:
– Mining algorithms are only useful in industry if
users fund their use in real-world applications.
• Data science
– Understanding user goals to inductively generate
the models that most matter to the user.
16
• T. Menzies, C. Bird, T. Zimmermann, W. Schulte, and E. Kocaganeli.
The inductive software engineering manifesto. (MALETS '11).
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Rule #1: Talk to the users
–Rule #2: Know your domain
– Rule #3: Suspect your data
– Rule #4: Data science is cyclic
• PART 2: Data Issues
– How to solve lack or scarcity of data
– How to prune data, simpler & smarter
– How to advance simple CBR methods
– How to keep your data private
• PART 3: Model Issues
– Problems of SE models
– Solutions
• Envy-based learning
• Ensembles
17
Algorithms is only part of the story
18
• Drew Conway, The Data Science Venn Diagram, 2009,
• http://www.dataists.com/2010/09/the-data-science-venn-diagram/
• Dumb data miners miss important
domains semantics
• An ounce of domain knowledge is
worth a ton to algorithms.
• Math and statistics only gets you
machine learning,
• Science is about discovery and building
knowledge, which requires some
motivating questions about the world
• The culture of academia, does not
reward researchers for understanding
domains.
Case Study #1: NASA
• NASA’s Software Engineering Lab, 1990s
– Gave free access to all comers to their data
– But you had to come to get it (to Learn the domain)
– Otherwise: mistakes
• E.g. one class of software module with far more errors that
anything else.
– Dumb data mining algorithms: might learn that this kind of module in
inherently more data prone
• Smart data scientists might question “what kind of
programmer work that module”
– A: we always give that stuff to our beginners as a learning exercise
19* F. Shull, M. Mendonsa, V. Basili, J. Carver, J. Maldonado, S. Fabbri, G. Travassos, and M. Ferreira, "Knowledge-
Sharing Issues in Experimental Software Engineering", EMSE 9(1): 111-137, March 2004.
Case Study #2: Microsoft
• Distributed vs centralized
development
• Who owns the files?
– Who owns the files with most bugs
• Result #1 (which was wrong)
– A very small number of people
produce most of the core changes to
a “certain Microsoft product”.
– Kind of an uber-programmer result
– I.e. given thousands of programmers
working on a project
• Most are just re-arrange deck chairs
• TO improve software process, ignore
the drones and focus mostly on the
queen bees
• WRONG:
– Microsoft does much auto-
generation of intermediary build
files.
– And only a small number of people
are responsible for the builds
– And that core build team “owns”
those auto-generated files
– Skewed the results. Send us down
the wrong direction
• Needed to spend weeks/months
understanding build practices
– BEFORE doing the defect studies
20* E. Kocaganeli, T. Zimmermann, C.Bird, N.Nagappan, T.Menzies. Distributed Development
Considered Harmful?. ICSE 2013 SEIP Track, San Francisco, CA, USA, May 2013.
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Rule #1: Talk to the users
– Rule #2: Know your domain
–Rule #3: Suspect your data
– Rule #4: Data science is cyclic
• PART 2: Data Issues
– How to solve lack or scarcity of data
– How to prune data, simpler & smarter
– How to advance simple CBR methods
– How to keep your data private
• PART 3: Model Issues
– Problems of SE models
– Solutions
• Envy-based learning
• Ensembles
21
You go mining with the data you have—not
the data you might want
• In the usual case, you cannot control data
collection.
– For example, data mining at NASA 1999 – 2008
• Information collected from layers of sub-contractors and
sub-sub-contractors.
• Any communication to data owners had to be mediated by
up to a dozen account managers, all of whom had much
higher priority tasks to perform.
• Hence, we caution that usually you must:
– Live with the data you have or dream of accessing at
some later time.
22
Rinse before use
• Data quality tests (*)
– Linear time checks for (e.g.) repeated rows
• Column and row pruning for tabular data
– Bad columns contain noise, irrelevancies
– Bad rows contain confusing outliers
– Repeated results:
• Signal is a small nugget within the whole data
• R rows and C cols can be pruned back to R/5 and C0.5
• Without losing signal
23* M. Shepperd, Q. Song, Z. Sun, C. Mair,
"Data Quality: Some Comments on the NASA Software Defect Data Sets," IEEE TSE, 2013, pre-prints
e.g. NASA
effort data
24
Nasa data: most
Projects highly complex
i.e. no information in saying
“complex”
The more features we
remove for smaller
projects the better
the predictions.
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Rule #1: Talk to the users
– Rule #2: Know your domain
– Rule #3: Suspect your data
–Rule #4: Data science is cyclic
• PART 2: Data Issues
– How to solve lack or scarcity of data
– How to prune data, simpler & smarter
– How to advance simple CBR methods
– How to keep your data private
• PART 3: Model Issues
– Problems of SE models
– Solutions
• Envy-based learning
• Ensembles
25
Do it again, and again,
and again, and …
26
In any industrial
application, data science
is repeated multiples
time to either answer an
extra user question,
make some
enhancement and/or
bug fix to the method,
or to deploy it to a
different set of users.
Thou shall not click
• For serious data science studies,
– to ensure repeatability,
– the entire analysis should be automated
– using some high level scripting language;
• e.g. R-script, Matlab, Bash, ….
27
The feedback process
28
The feedback process
29
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Rule #1: Talk to the users
– Rule #2: Know your domain
– Rule #3: Suspect your data
– Rule #4: Data science is cyclic
• PART 2: Data Issues
– How to solve lack or scarcity of data
– How to prune data, simpler & smarter
– How to advance simple CBR methods
– How to keep your data private
• PART 3: Model Issues
– Problems of SE models
– Solutions
• Envy-based learning
• Ensembles
30
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Rule #1: Talk to the users
– Rule #2: Know your domain
– Rule #3: Suspect your data
– Rule #4: Data science is cyclic
• PART 2: Data Issues
–How to solve lack or scarcity of data
– How to prune data, simpler & smarter
– How to advance simple CBR methods
– How to keep your data private
• PART 3: Model Issues
– Problems of SE models
– Solutions
• Envy-based learning
• Ensembles
31
How to Solve Lack or Scarcity of Local
Data
32
What are my options?
Isn’t local (within) data better?
It may not be available
It may be scarce
Tedious data collection effort
Too slow to collect
The verdict with global (cross) data?
Effort estimation1:
No clear winners, either way
Defect Prediction2:
Can use global data as a stop gap
33
1 Barbara A. Kitchenham, Emilia Mendes, GuilhermeHortaTravassos: Cross versus Within-Company Cost Estimation Studies: A Systematic
Review. IEEE Trans. Software Eng. 33(5): 316-329 (2007)
2 B. Turhan, T. Menzies, A. Bener and J. Distefano, “On the relative value of cross-company and within-company data for defect
prediction”, Empirical Software Engineering Journal, Vol.14/5, pp.540-578, 2009.
Comparing options
• For NASA data
– Seven test sets from 10% of
each source
• Treatment CC (using global)
– Train on the 6 other data sets
• Treatment WC (using local)
– Train on the remaining 90% of
the local data
34
NN-Filtering
Step 1: Calculate the pairwise
Euclidean distances between
the local (test) set and the
candidate (global) training set.
Step 2: For each test datum,
pick its k nearest neighbors
from global set.
Step 3: Pick unique instances
from the union of those
selected across all local set to
construct the final training set
35
Now, train your favorite model on the
filtered training set!
B. Turhan, A. Bener, and T. Menzies, “Nearest Neighbor Sampling for Cross Company Defect Predictors”, in Proceedings of the 1st
International Workshop on Defects in Large Software Systems (DEFECTS 2008), pp. 26, 2008.
More Comparisons: PD
• For NASA data
– Seven test sets from 10% of each
source
• Treatment CC (using global)
– Train on the 6 other data sets
• Treatment WC (using local)
– Train on the remaining 90% of the
local data
• Treatment NN (using global+NN)
– Initialize train set with 6 other data
sets,
– Prune the train set to just the 10
nearest neighbors (Euclidean)
of the test set (discarding repeats)
36
B. Turhan, T. Menzies, A. Bener and J. Distefano, “On the
relative value of cross-company and within-company data for
defect prediction”, Empirical Software Engineering Journal,
Vol.14/5, pp.540-578, 2009.
More Comparisons: PF
• For NASA data
– Seven test sets from 10% of each
source
• Treatment CC (using global)
– Train on the 6 other data sets
• Treatment WC (using local)
– Train on the remaining 90% of the
local data
• Treatment NN (using global+NN)
– Initialize train set with 6 other data
sets,
– Prune the train set to just the 10
nearest neighbors (Euclidean)
of the test set (discarding repeats)
37
B. Turhan, T. Menzies, A. Bener and J. Distefano, “On the
relative value of cross-company and within-company data for
defect prediction”, Empirical Software Engineering Journal,
Vol.14/5, pp.540-578, 2009.
B. Turhan, T. Menzies, A. Bener and J. Distefano, “On the
relative value of cross-company and within-company data for
defect prediction”, Empirical Software Engineering Journal,
Vol.14/5, pp.540-578, 2009.
• For SOFTLAB data
– Three test sets from embedded
systems
• Treatment CC (using global)
– Train on the seven NASA data sets
• Treatment WC (using local)
– Train on the remaining two local
test data
• Treatment NN (using global+NN)
– Initialize train set with 7 NASA data
sets,
– Prune the train set to just the 10
nearest neighbors (Euclidean)
of the test set (discarding repeats)
External Validity
39
“Theoriescan be learnedfrom a
verysmallsample of availabledata”
Microsampling
• GivenN defectivemodules:
– M = {25, 50, 75, ...} <= N
– Select M defectiveand M
defect-freemodules.
– Learntheories on 2M
instances
• Undersampling: M=N
• 8/12 datasets -> M = 25
• 1/12 datasets -> M = 75
• 3/12 datasets -> M = {200,
575, 1025}
T. Menzies, B. Turhan, A. Bener, G. Gay, B. Cukic, Y. Jiang, “Implications of Ceiling Effects in Defect Predictors”, in Proceedings of the 4th
International Workshop on Predictor Models in Software Engineering (PROMISE 2008), pp. 47-54, 2008.
How about mixing local and global?
• Is it feasible to use additional data from other projects:
– (Case 1) When there is limited local project history, i.e. no prior releases
– (Case 2) When there is existing local project history, i.e. many releases over some period
42B. Turhan, A. T. Mısırlı, A. Bener, “Empirical Evaluation of The Effects of Mixed Project Data on Learning Defect Predictors”, (in print)
Journal of Information and Software Technology, 2013
• For 73 versions of 41 projects
– Reserve test sets from 10% of each
project
– Additional test sets if the project has
multiple releases
• Treatment WP (using local)
– Train on 10%..90% of the local data
– Train on the previous releases
• Treatment WP+CP (using global)
– Enrich training sets above with NN-
filtered data from all other projects
Case 1: WP(10%) + CP is as good
as WP(90%)
Case 2: WP+CP is significantly
better than WP (with small effect
size)
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Rule #1: Talk to the users
– Rule #2: Know your domain
– Rule #3: Suspect your data
– Rule #4: Data science is cyclic
• PART 2: Data Issues
– How to solve lack or scarcity of data
–How to prune data, simpler & smarter
– How to advance simple CBR methods
– How to keep your data private
• PART 3: Model Issues
– Problems of SE models
– Solutions
• Envy-based learning
• Ensembles
43
How to Prune Data,
Simpler and Smarter
44
Data is the new
oil
And it has a cost too
45
e.g. $1.5M spent by NASA in the period 1987 to 1990
to understand the historical records of all their
software in support of the planning activities for the
International Space Station [1]
Do we need to discuss all
the projects and all the
features in a client meeting
or in a Delphi session?
Similarly, do we need all
the labels for supervised
methods?
[1] E. Kocaguneli, T. Menzies, J. Keung, D. Cok, and R. Madachy, “Active learning and effort estimation: Finding the essential
content of software effort estimation data,” IEEE Trans. on Softw. Eng., vol. Preprints, 2013.
Data for Industry / Active Learning
46
Concepts of E(k) matrices and popularity…
Let’s see it in action: Point to the person closest to you
Data for Industry / Active Learning
47
Instance pruning
1. Calculate “popularity” of
instances
2. Sorting by popularity,
3. Label one instance at a time
4. Find the stopping point
5. Return closest neighbor from
active pool as estimate
1. Calculate the popularity of
features
2. Select non-popular features
Synonym pruning
We want to find the dissimilar
features, that are unlike others
We want the instances that are
similar to others
Data for Industry / Active Learning
48
Finding the stopping point
• If all popular instances are exhausted.
Stop asking for labels if one of the rules fire
• Or if there is no MRE (magnitude of relative error = abs(actual-
predicted)/actual) improvement for n consecutive times.
• Or if the ∆ between the best and the worst error of the last n
times is very small. (∆ = 0.1; n = 3)
Data for Industry / Active Learning
49
QUICK: An active learning solution, i.e. unsupervised
Instances are labeled with a cost by the expert
• We want to stop before all the instances are labeled
50
Picking random
training instance is
not a good idea
More popular instances
in the active pool
decrease error
One of the stopping
point conditions fires
Data for Industry / Active Learning
X-axis: Instances sorted in decreasing popularity numbers
Y-axis:MedianMRE
51
Data for Industry / Active Learning
At most 31% of all
the cells
On median 10%
Intrinsic dimensionality: There is a consensus in
the high-dimensional data analysis community
that the only reason any methods work in very
high dimensions is that, in fact, the data is not
truly high-dimensional[1]
[1] E. Levina and P.J. Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances in Neural Information
Processing Systems, volume 17, Cambridge, MA, USA, 2004. The MIT Press.
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Rule #1: Talk to the users
– Rule #2: Know your domain
– Rule #3: Suspect your data
– Rule #4: Data science is cyclic
• PART 2: Data Issues
– How to solve lack or scarcity of data
– How to prune data, simpler & smarter
–How to advance simple CBR methods
– How to keep your data private
• PART 3: Model Issues
– Problems of SE models
– Solutions
• Envy-based learning
• Ensembles
52
Case-based reasoning
(CBR) methods make use
of similar past projects
for estimation
53
They are very widely used as [1]:
• No model-calibration to local data
• Can better handle outliers
• Can work with 1 or more attributes
• Easy to explain
Two promising research areas
• weighting the selected analogies[2]
• improving design options [3]
How to Advance Simple CBR Methods
[1] F. Walkerden and R. Jeffery, “An empirical study of analogy-based software effort estimation,” Empirical Software
Engineering, vol. 4, no. 2, pp. 135–158, 1999.
[2] E. Mendes, I. D. Watson, C. Triggs, N. Mosley, and S. Counsell, “A comparative study of cost estimation models for web
hypermedia applications,” Empirical Software Engineering, vol. 8, no. 2, pp. 163–196, 2003.
[3] J. W. Keung, “Theoretical Maximum Prediction Accuracy for Analogy-Based Software Cost Estimation,” 15th Asia-Pacific
Software Engineering Conference, pp. 495– 502, 2008.
In none of the scenarios did we
see a significant improvement
54
Compare performance of
each k-value with and
without weighting.
Building on the previous research [1], we adopted two different
strategies[2]
We used kernel weighting to
weigh selected analogies
a) Weighting analogies [3]
How to Advance Simple CBR Methods
[1] E. Mendes, I. D. Watson, C. Triggs, N. Mosley, and S. Counsell, “A comparative study of cost estimation models for web
hypermedia applications,” Empirical Software Engineering, vol. 8, no. 2, pp. 163–196, 2003.
*2+ W. Keung, “Theoretical Maximum Prediction Accuracy for Analogy-Based Software Cost Estimation,” 15th Asia-Pacific
Software Engineering Conference, pp. 495– 502, 2008.
[3] Kocaguneli, Ekrem, Tim Menzies, and Jacky W. Keung. "Kernel methods for software effort estimation." Empirical Software
Engineering 18.1 (2013): 1-24.
55
D-ABE
• Get best estimates of all training
instances
• Remove all the training instances
within half of the worst MRE (acc.
to TMPA).
• Return closest neighbor’s estimate
to the test instance.
c
t
db
e
a
f
Test instance
Training Instances
Worst MRE
Close to the
worst MRE
Return the
closest
neighbor’s
estimate
b) Designing ABE methods
Easy-path: Remove training
instance that violate assumptions
TEAK will be discussed later.
D-ABE: Built on theoretical
maximum prediction accuracy
(TMPA) [1]
How to Advance Simple CBR Methods
[1] W. Keung, “Theoretical Maximum Prediction Accuracy for Analogy-
Based Software Cost Estimation,” 15th Asia-Pacific Software Engineering
Conference, pp. 495– 502, 2008.
D-ABE Comparison to
static k w.r.t. MMRE
56
D-ABE Comparison to
static k w.r.t. win, tie, loss
How to Advance Simple CBR Methods
Finding enough local training data is
a fundamental problem [1]
Merits of using cross-data
from another company is
questionable [2]
We use a relevancy filtering method called TEAK
on public and proprietary data sets.
How to Advance Simple CBR Methods/
Using CBR for cross company learning
[1] B. Turhan, T. Menzies, A. Bener, and J. Di Stefano, “On the relative value of cross-company and within-company data for defect
prediction,” Empirical Software Engineering, vol. 14, no. 5, pp. 540–578, 2009.
*2+ E. Kocaguneli and T. Menzies, “How to find relevant data for effort estimation,” in ESEM’11: International Symposium on Empirical
Software Engineering and Measurement, 2011.
[3] B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company cost estimation studies: A systematic review,” IEEE
Trans. Softw. Eng., vol. 33, no. 5, pp. 316–329, 2007.
*4+ T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy, “Cross-project defect prediction: A large scale experiment on data
vs. domain vs. process,” ESEC/FSE, pp. 91–100, 2009.
Similar amounts of evidence for
and against the performance of
cross-data [3, 4]
58
Cross data works as well as within data
for 6 out of 8 proprietary data sets, 19 out
of 21 public data sets after TEAK’s
relevancy filtering
Similar projects,
dissimilar effort
values, hence
high variance
Similar projects,
similar effort
values, hence
low variance
How to Advance Simple CBR Methods/
Using CBR for cross company learning
Build a second GAC
tree with low-
variance instances
Return closest neighbor’s
value from the lowest
variance region
In summary: Design options of CBR helps, but not
fiddling with single instances and weights!
[1] E. Kocaguneli and T. Menzies, “How to find relevant data for effort estimation,” in ESEM’11: International Symposium on Empirical
Software Engineering and Measurement, 2011.
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Rule #1: Talk to the users
– Rule #2: Know your domain
– Rule #3: Suspect your data
– Rule #4: Data science is cyclic
• PART 2: Data Issues
– How to solve lack or scarcity of data
– How to prune data, simpler & smarter
– How to advance simple CBR methods
–How to keep your data private
• PART 3: Model Issues
– Problems of SE models
– Solutions
• Envy-based learning
• Ensembles
59
Is Data Sharing Worth the Risk to
Individual Privacy
• Former Governor Massachusetts.
• Victim of re-identification privacy breach.
• Led to sensitive attribute disclosure of his medical records.
What would William Weld say?
Is Data Sharing Worth the Risk to
Individual Privacy
What about NASA contractors?
Subject to competitive bidding
every 2 years.
Unwilling to share data
that would lead to
sensitive attribute disclosure.
e.g. actual software
development times
When To Share – How To Share
So far we cannot guarantee
100% privacy.
What we have is a directive
as to whether data is private
and useful enough to share...
We have a lot of privacy
algorithms geared toward
minimizing risk.
Old School
K-anonymity
L-diversity
T-closeness
But What About Maximizing Benefits (Utility)?
The degree of risk to the
data sharing entity must
not exceed the benefits of
sharing.
Balancing Privacy and Utility
or...
Minimize risk of privacy disclosure while maximizing utility.
Instance Selection with CLIFF
Small random moves with MORPH
= CLIFF + MORPH
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
CLIFF
Don't share all the data.
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
CLIFF
Don't share all the data.
"a=r1"
powerful for selection for
class=yes
more common in "yes"
than "no"
CLIFF
step1:
for each class find ranks
of all values
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
CLIFF
Don't share all the data.
"a=r1"
powerful for selection for
class=yes
more common in "yes"
than "no"
CLIFF
step2:
multiply ranks of each
row
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
CLIFF
Don't share all the data.
CLIFF
step3: select the most powerful
rows of each class
Note linear time
Can reduce N rows to 0.1N
So an O(N2) NUN algorithm
now
takes time O(0.01)
Scalability
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
MORPH
Push the CLIFF data from their original position.
y = x ± (x − z) ∗ r
x ∈ D, the original
instance
z ∈ D the NUN of x
y the resulting
MORPHed
instance
F. Peters and T. Menzies, “Privacy and utility for defect prediction: Experiments with morph,” in Software Engineering (ICSE), 2012 34th
International Conference on, june 2012, pp. 189 –199.
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction,"
IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
Case Study: Cross-Company Defect Prediction (CCDP)
Sharing Required.
Zimmermann et al.
Local data not always
available
• companies too small
• product in first release, so
no past data.
Kitchenham et al.
• no time for collection
• new technology can make all
data irrelevant
T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy, “Cross-project defect prediction: a large scale experiment on data vs. domain vs. process.”
in ESEC/SIGSOFT FSE’09,2009
B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company cost estimation studies: A systematic review,”
IEEE Transactions on Software Engineering, vol. 33, pp. 316–329, 2007
- Company B has little or no data to build a defect model;
- Company B uses data from Company A to build defect models;
CCDP
Better with data filtering
Initial results with cross-company defect prediction
- negative(Zimmerman FSE '09)
- or inconclusive (Kitchenham TSE '07)
More recent work show better results
- Turhan et al. 2009 (The Burak Filter)
B. Turhan, T. Menzies, A. Bener, and J. Di Stefano, “On the relative value of cross-company and within-company data for defect prediction,”
Empirical Software Engineering, vol. 14, pp. 540–578, 2009.
F. Peters, T. Menzies, and A. Marcus, “Better Cross Company Defect Prediction,” Mining Software Repositories (MSR), 2013 10th IEEE Working Conference
on, (to appear)
Making Data Private for CCDP
Here is how we look at the data
Terms
Non-Sensitive Attribute (NSA)
Sensitive Attribute
Class Attribute
Measuring the Risk
IPR = Increased Privacy Ratio
Queries Original Privatized Privacy Breach
Q1 0 0 yes
Q2 0 1 no
Q3 1 1 yes
yes = 2/3
IPR = 1- 2/3 = 0.33
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
Measuring the Utility
The g-measure
Probability of detection (pd)
Probability of False alarm (pf)
Actual
yes no
Predicted yes TP FP
no FN TN
pd TP/(TP+FN)
pf FP/(FP+TN)
g-measure 2*pd*(1-pf)/(pd+(1-pf))
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
Making Data Private for CCDP
Comparing CLIFF+MORPH to Data Swapping and K-anonymity
Data Swapping (s10, s20, s40)
A standard perturbation
technique used for privacy
To implement...
• For each NSA a certainpercent
of the values areswapped with
anyothervalue in that NSA.
• For our experiments,these
percentages are 10, 20 and 40.
k-anonymity (k2, k4)
The Datafly Algorithm.
To implement...
• Make a generalizationhierarchy.
• Replace values in the
NSAaccording to thehierarchy.
• Continue until there are k or
fewer distinct instancesand
suppress them.
K. Taneja, M. Grechanik, R. Ghani, and T. Xie, “Testing software in age of data privacy: a balancing act,” in Proceedings of the 19th ACM SIGSOFT symposium and the 13th European
conference on Foundations of software engineering, ser. ESEC/FSE ’11. New York, NY, USA: ACM, 2011, pp. 201–211.
L. Sweeney, “Achieving k-anonymity privacy protection using generalization and suppression,” Int. J. Uncertain. Fuzziness Knowl.-Based Syst., vol. 10, no. 5, pp. 571–588, Oct. 2002.
Making Data Private for CCDP
Comparing CLIFF+MORPH to Data Swapping and K-anonymity
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
Making Data Private for CCDP
Comparing CLIFF+MORPH to Data Swapping and K-anonymity
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
Making Data Private for CCDP
F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering,
24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Rule #1: Talk to the users
– Rule #2: Know your domain
– Rule #3: Suspect your data
– Rule #4: Data science is cyclic
• PART 2: Data Issues
– How to solve lack or scarcity of data
– How to prune data, simpler & smarter
– How to advance simple CBR methods
– How to keep your data private
• PART 3: Model Issues
– Problems of SE models
– Solutions
• Envy-based learning
• Ensembles
80
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Rule #1: Talk to the users
– Rule #2: Know your domain
– Rule #3: Suspect your data
– Rule #4: Data science is cyclic
• PART 2: Data Issues
– How to solve lack or scarcity of data
– How to prune data, simpler & smarter
– How to advance simple CBR methods
– How to keep your data private
• PART 3: Model Issues
–Problems of SE models
– Solutions
• Envy-based learning
• Ensembles
81
Problems of SE Models
• Instability is the problem of not being able to
elicit same/similar results under changing
conditions
– E.g. data set, performance measure etc.
82
• We will look at instability in 2 areas
– Instability in Effort Estimation
– Instability in Process
83
There is no agreed upon
best estimation method [1]
Methods change ranking w.r.t.
conditions such as data sets, error
measures [2]
Experimenting with: 90 solo-
methods, 20 public data sets, 7
error measures
Problems of SE Models/
Instability in Effort
[1] M. Jorgensen and M. Shepperd, “A systematic review of software development cost estimation studies,” IEEE Trans. Softw.
Eng., vol. 33, no. 1, pp. 33–53, 2007.
[2] I. Myrtveit, E. Stensrud, and M. Shepperd, “Reliability and validity in comparative studies of software prediction models,”
IEEE Trans. Softw. Eng., vol. 31, no. 5, pp. 380–391, May 2005.
84
Problems of SE Models/
Instability in Effort
1. Rank methods acc. to win, loss
and win-loss values
2. δr is the max. rank change
3. Sort methods acc. to loss and
observe δr values
85
We have a set of
superior methods to
recommend
Assembling solo-methods
may be a good idea
Baker et al. [1], Kocaguneli et al.
[2], Khoshgoftaaret al. [3]failed to
outperform solo-methods
But the previous evidence of
assembling multiple methods in
SEE is discouraging
Problems of SE Models/
Instability in Effort
Top 13 methods are CART & ABE
methods (1NN, 5NN)
[1] D. Baker, “A hybrid approach to expert and model-based effort esti- mation,” Master’s thesis, Lane Department of Computer
Science and Electrical Engineering, West Virginia University, 2007, available from
https://eidr.wvu.edu/etd/documentdata.eTD?documentid=5443.
[2] E. Kocaguneli, Y. Kultur, and A. Bener, “Combining multiple learners induced on multiple datasets for software effort
prediction,” in Interna- tional Symposium on Software Reliability Engineering (ISSRE), 2009, student Paper.
[3] T. M. Khoshgoftaar, P. Rebours, and N. Seliya, “Software quality analysis by combining multiple projects and learners,”
Software Quality Control, vol. 17, no. 1, pp. 25–49, 2009.
86
Combine top 2,4,8,13 solo-methods
via mean, median and IRWM
Problems of SE Models/
Instability in Effort
Re-rank solo and multi-methods
together
Problems of SE Models/
Instability in Process: Dataset Shift/Concept Drift
87Candela JQ, Sugiyama M, Schwaighofer A, Lawrence ND (eds) (2009) Dataset shift in machine learning. The MIT Press, Cambridge, MA
Dataset Shift: Covariate Shift
• Consider a size-based effort
estimation model
– Effective for projects within the
traditional operational
boundaries of a company
• What if a change with impact
on products’ size:
– new business domains
– change in technologies
– change in development
techniques
Before
After
Effort
89
p(Xtrain) ≠ p(Xtest)
Size
B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1-2,
pp.62-74, 2012.
Dataset Shift: Prior Probability Shift
• Now, consider a defect
prediction model…
• … and again, what if defect
characteristics change:
– Process improvement
– More QA resources
– Increased experience over
time
– Basically you improve over
time!
Before
After
%Defects
90
p(Ytrain) ≠ p(Ytest)
kLOC
B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1-2,
pp.62-74, 2012.
Dataset Shift: Usual Suspects
Sample Selection Bias & Imbalanced Data
91B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1-
2, pp.62-74, 2012.
Dataset Shift: Usual Suspects
Sample Selection Bias &Imbalanced Data
92B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1-2,
pp.62-74, 2012.
Dataset Shift
Domain shift
• Be consistent in the way you
measure concepts for
model training and testing!
• *: “…the metrics based
assessment of a software
system and measures taken
to improve its design differ
considerably from tool to
tool.”
Source Component Shift
• a.k.a. Data Heterogeneity
• Ex: ISBSG contains data
from 6000+ projects from
30+ countries.
Where do the training data
come from?
vs.
Where do the test data come
from?
93
* RüdigerLincke, Jonas Lundberg, and WelfLöwe. “Comparing software metrics tools”, ISSTA '08
B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1-
2, pp.62-74, 2012.
94B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1-
2, pp.62-74, 2012.
Outlier
‘Detection
’
Relevancy
Filtering
Instance
Weighting
Stratification
Cost
Curves
Mixture
Models
Managing Dataset Shift
Covariate
Shift
Prior
Probability
Shift
Sampling
Imbalanced
Data
Domain
Shift
Source
Component
Shift
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Rule #1: Talk to the users
– Rule #2: Know your domain
– Rule #3: Suspect your data
– Rule #4: Data science is cyclic
• PART 2: Data Issues
– How to solve lack or scarcity of data
– How to prune data, simpler & smarter
– How to advance simple CBR methods
– How to keep your data private
• PART 3: Model Issues
– Problems of SE models
–Solutions
• Envy-based learning
• Ensembles 95
• Seek the fence
where the grass
is greener on the
other side.
• Learn from
there
• Test on here
• Cluster to find
“here” and
“there”
12/1/2011 96
Envy =
The WisDOM Of
the COWs
12/1/2011 97
@attribute recordnumber real
@attribute projectname {de,erb,gal,X,hst,slp,spl,Y}
@attribute cat2 {Avionics, application_ground, avionicsmonitoring, … }
@attribute center {1,2,3,4,5,6}
@attribute year real
@attribute mode {embedded,organic,semidetached}
@attribute rely {vl,l,n,h,vh,xh}
@attribute data {vl,l,n,h,vh,xh}
…
@attribute equivphyskloc real
@attribute act_effort real
@data
1,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,25.9,117.6
2,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,24.6,117.6
3,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,7.7,31.2
4,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,8.2,36
5,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,9.7,25.2
6,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,2.2,8.4
….
DATA = MULTI-DIMENSIONAL VECTORS
CAUTION: data may not divide neatly
on raw dimensions
• The best description for SE projects may be
synthesize dimensions extracted from the raw
dimensions
12/1/2011 98
Fastmap
12/1/2011 99
Fastmap: Faloutsos [1995]
O(2N) generation of axis of large variability
• Pick any point W;
• Find X furthest from W,
• Find Y furthest from Y.
c = dist(X,Y)
All points have distance a,b to (X,Y)
• x = (a2 + c2 − b2)/2c
• y= sqrt(a2 – x2)
Find median(x), median(y)
Recurse on four quadrants
Hierarchical partitioning
Prune
• Find two orthogonal dimensions
• Find median(x), median(y)
• Recurse on four quadrants
• Combine quadtree leaves
with similar densities
• Score each cluster by median
score of class variable
100
Grow
Q: why cluster Via FASTMAP?
• A1: Circular methods (e.g. k-means)
assume round clusters.
• But density-based clustering allows
clusters to be any shape
• A2: No need to pre-set the number of
clusters
• A3: cause other methods
(e.g. PCA) are much slower
• Fastmap is the O(2N)
• Unoptimized Python:
12/1/2011 101
12/1/2011 102
Learning via “envy”
• Seek the fence
where the grass
is greener on the
other side.
• Learn from
there
• Test on here
• Cluster to find
“here” and
“there”
12/1/2011 103
Envy =
The WisDOM Of
the COWs
Hierarchical partitioning
Prune
• Find two orthogonal dimensions
• Find median(x), median(y)
• Recurse on four quadrants
• Combine quadtree leaves
with similar densities
• Score each cluster by median
score of class variable
104
Grow
Hierarchical partitioning
Prune
• Find two orthogonal dimensions
• Find median(x), median(y)
• Recurse on four quadrants
• Combine quadtree leaves
with similar densities
• Score each cluster by median
score of class variable
• This cluster envies its neighbor with
better score and max
abs(score(this) - score(neighbor))
105
Grow
Where is grass greenest?
Q: How to learn rules from
neighboring clusters
• A: it doesn’t really matter
– Many competent rule learners
• But to evaluate global vs local rules:
– Use the same rule learner for local vs global rule learning
• This study uses WHICH (Menzies [2010])
– Customizable scoring operator
– Faster termination
– Generates very small rules (good for explanation)
106
Data from
http://promisedata.googlecode.com
• Effort reduction =
{ NasaCoc, China } :
COCOMO or function points
• Defect reduction =
{lucene,xalanjedit,synapse,etc } :
CK metrics(OO)
• Clusters have untreated class
distribution.
• Rules select a subset of the
examples:
– generate a treated class
distribution
107
0 20 40 60 80 100
25th
50th
75th
100th
untreated global local
Distributions have percentiles:
Treated with rules
learned from all data
Treated with rules learned
from neighboring cluster
• Lower median efforts/defects (50th percentile)
• Greater stability (75th – 25th percentile)
• Decreased worst case (100th percentile)
By any measure,
Local BETTER THAN GLOBAL
108
Rules learned in each cluster
• What works best “here” does not work “there”
– Misguided to try and tame conclusion instability
– Inherent in the data
• Can’t tame conclusion instability.
• Instead, you can exploit it
• Learn local lessons that do better than overly generalized global theories
12/1/2011 109
OUTLINE
• PART 0: Introduction
• PART 1: Organization Issues
– Rule #1: Talk to the users
– Rule #2: Know your domain
– Rule #3: Suspect your data
– Rule #4: Data science is cyclic
• PART 2: Data Issues
– How to solve lack or scarcity of data
– How to prune data, simpler & smarter
– How to advance simple CBR methods
– How to keep your data private
• PART 3: Model Issues
– Problems of SE models
– Solutions
• Envy-based learning
• Ensembles
110
Solutions to SE Model Problems/
Ensembles of Learning Machines*
 Sets of learning machines grouped together.
 Aim: to improve predictive performance.
...
estimation1 estimation2 estimationN
Base learners
E.g.: ensemble estimation = Σ wi estimationi
B1 B2 BN
* T. Dietterich. Ensemble Methods in Machine Learning. Proceedings of the First International Workshop in
Multiple Classifier Systems. 2000.
Solutions to SE Model Problems/
Ensembles of Learning Machines
 One of the keys:
Diverse* ensemble: “base learners” make different
errors on the same instances.
* G. Brown, J. Wyatt, R. Harris, X. Yao. Diversity Creation Methods: A Survey and Categorisation. Journal of
Information Fusion 6(1): 5-20, 2005.
Solutions to SE Model Problems/
Ensembles of Learning Machines
 One of the keys:
Diverse ensemble: “base learners” make different
errors on the same instances.
 Three different types of ensembles that have
been applied for software effort estimation will
be presented in the next slides.
Different ensemble approaches can be seen as different ways to
generate diversity among base learners!
Solutions to SE Model Problems/
Static Ensembles
Training data
(completed projects)
training
Ensemble
 An existing training set is used for
creating/training the ensemble.
BNB1 B2
...
Solutions to SE Model Problems/
Static Ensembles
 Bagging ensembles of Regression Trees (Bag+RTs)*
Study with 13 data sets from PROMISE and ISBSG
repositories.
Bag+RTs:
 Obtained the highest rank across data set in terms of Mean
Absolute Error (MAE).
 Rarely performed considerably worse (>0.1SA, SA = 1 – MAE /
MAErguess) than the best approach:
* L. Minku, X. Yao. Ensembles and Locality: Insight on Improving Software Effort Estimation. Information and
Software Technology, Special Issue on Best Papers from PROMISE 2011, 2012 (in
press), http://dx.doi.org/10.1016/j.infsof.2012.09.012.
Solutions to SE Model Problems/
Static Ensembles
 Bagging* ensembles of regression trees
* L. Breiman. Bagging Predictors. Machine Learning 24(2):123-140, 1996.
Training data
(completed projects)
Ensemble
RT1 RT2 RTN...
Sample
uniformly with
replacement
Solutions to SE Model Problems/
Static Ensembles
 Bagging ensembles of regression trees
Functional Size
Functional Size Effort = 5376
Effort = 1086 Effort = 2798
>= 253< 253
< 151 >= 151
Regression trees:
 Estimation by analogy.
 Divide projects
according to attribute
value.
 Most impactful
attributes are in higher
levels.
 Attributes with
insignificant impact are
not used.
 E.g., REPTrees*.
* M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten.
The WEKA Data Mining Software: An Update. SIGKDD Explorations 11(1), 2009.
http://www.cs.waikato.ac.nz/ml/weka.
Solutions to SE Model Problems/
Static Ensembles
 Bagging ensembles of regression trees
 Weka: classifiers – meta – bagging
 classifiers – trees – REPTree
Solutions to SE Model Problems/
Static Ensembles
 Multiple-objective Pareto ensembles
There are different measures/metrics of performance for
evaluating SEE models.
Different measures capture different quality features of the
models.
 E.g.: MAE, standard
deviation, PRED, etc.
 There is no agreed
single measure.
 A model doing well
for a certain
measure may not do
so well for another.
Multilayer
Perceptron (MLP)
models created
using Cocomo81.
Solutions to SE Model Problems/
Static Ensembles
 Multi-objective Pareto ensembles*
We can view SEE as a multi-objective learning
problem.
A multi-objective approach (e.g. Multi-Objective
Evolutionary Algorithm (MOEA)) can be used to:
 Better understand the relationship among measures.
 Create ensembles that do well for a set of measures, in
particular for larger data sets (>=60).
Sample result: Pareto ensemble of MLPs (ISBSG):
* L. Minku, X. Yao. Software Effort Estimation as a Multi-objective Learning Problem. ACM Transactions on
Software Engineering and Methodology, 2012 (accepted). Author's final version:
http://www.cs.bham.ac.uk/~minkull/publications/MinkuYaoTOSEM12.pdf.
Solutions to SE Model Problems/
Static Ensembles
 Multi-objective Pareto ensembles
Training data
(completed projects)
Ensemble
B1 B2 B3
Multi-objective evolutionary
algorithm creates nondominated
models with several different trade-
offs.
The model with the best performance
in terms of each particular measure
can be picked to form an ensemble
with a good trade-off.
Solutions to SE Model Problems/
Dynamic Adaptive Ensembles
 Companies are not
static entities – they
can change with time
(data set shift /
concept drift*).
Models need to learn new
information and adapt to
changes.
Companies can start
behaving more or less
similarly to other
companies.
* L. Minku, A. White, X. Yao. The Impact of Diversity on On-line Ensemble Learning in the Presence of Concept
Drift. IEEE Transactions on Knowledge and Data Engineering, 22(5):730-742, 2010.
Predicting effort for a single company from ISBSG based
on its projects and other companies' projects.
Solutions to SE Model Problems/
Dynamic Adaptive Ensembles
 Dynamic Cross-company Learning (DCL)*
Cross-company
Training Set 1
(completed projects)
Cross-company
Training Set 1
(completed projects)
Cross-company (CC)
m training sets with
different productivity
(completed projects)
CC model 1 CC model m
w1 wm
...
...
...
Within-company (WC)
training data
(projects arriving with
time)
CC
model
1
CC
model
m
...
WC
model
1
w1 wm
wm+1
* L. Minku, X. Yao. Can Cross-company Data Improve Performance in Software Effort Estimation? Proceedings
of the 8th International Conference on Predictive Models in Software Engineering, p. 69-78, 2012.
http://dx.doi.org/10.1145/2365324.2365334.
• Dynamic weights control how much a
certain model contributes to predictions:
 At each time step, “loser” models
have weight multiplied by Beta.
 Models trained with “very different”
projects from the one to be predicted can
be filtered out.
Solutions to SE Model Problems/
Dynamic Adaptive Ensembles
 Dynamic Cross-company Learning (DCL)
DCL uses new completed projects that arrive with time.
DCL determines when CC data is useful.
DCL adapts to changes by using CC data.
Predicting effort for a single company from ISBSG based on its projects and other companies' projects.
What have we covered?
125
Organizational Issues
Data Issues
Model Issues
126

Mais conteúdo relacionado

Mais procurados

How machines learn to talk. Machine Learning for Conversational AI
How machines learn to talk. Machine Learning for Conversational AIHow machines learn to talk. Machine Learning for Conversational AI
How machines learn to talk. Machine Learning for Conversational AIVerena Rieser
 
Chapter1 introduction
Chapter1 introductionChapter1 introduction
Chapter1 introductionDinesh K
 
When is a digital link a network edge? Exploring ways to construct social net...
When is a digital link a network edge? Exploring ways to construct social net...When is a digital link a network edge? Exploring ways to construct social net...
When is a digital link a network edge? Exploring ways to construct social net...Derek Weber
 
2022 AAAI DSTC10 Invited Talk
2022 AAAI DSTC10 Invited Talk2022 AAAI DSTC10 Invited Talk
2022 AAAI DSTC10 Invited TalkVerena Rieser
 
What it is like to be a Woman and a Leader in STEM
What it is like to be a Woman and a Leader in STEMWhat it is like to be a Woman and a Leader in STEM
What it is like to be a Woman and a Leader in STEMNancy Rausch
 
Engineering is Elementary: The Bridge to Engineering Partnership with SFSU an...
Engineering is Elementary: The Bridge to Engineering Partnership with SFSU an...Engineering is Elementary: The Bridge to Engineering Partnership with SFSU an...
Engineering is Elementary: The Bridge to Engineering Partnership with SFSU an...California STEM Learning Network
 
MIT Program on Information Science Talk -- Ophir Frieder on Searching in Hars...
MIT Program on Information Science Talk -- Ophir Frieder on Searching in Hars...MIT Program on Information Science Talk -- Ophir Frieder on Searching in Hars...
MIT Program on Information Science Talk -- Ophir Frieder on Searching in Hars...Micah Altman
 

Mais procurados (9)

How machines learn to talk. Machine Learning for Conversational AI
How machines learn to talk. Machine Learning for Conversational AIHow machines learn to talk. Machine Learning for Conversational AI
How machines learn to talk. Machine Learning for Conversational AI
 
Chapter1 introduction
Chapter1 introductionChapter1 introduction
Chapter1 introduction
 
Cyber Ethics
Cyber EthicsCyber Ethics
Cyber Ethics
 
Cyber Ethics
Cyber EthicsCyber Ethics
Cyber Ethics
 
When is a digital link a network edge? Exploring ways to construct social net...
When is a digital link a network edge? Exploring ways to construct social net...When is a digital link a network edge? Exploring ways to construct social net...
When is a digital link a network edge? Exploring ways to construct social net...
 
2022 AAAI DSTC10 Invited Talk
2022 AAAI DSTC10 Invited Talk2022 AAAI DSTC10 Invited Talk
2022 AAAI DSTC10 Invited Talk
 
What it is like to be a Woman and a Leader in STEM
What it is like to be a Woman and a Leader in STEMWhat it is like to be a Woman and a Leader in STEM
What it is like to be a Woman and a Leader in STEM
 
Engineering is Elementary: The Bridge to Engineering Partnership with SFSU an...
Engineering is Elementary: The Bridge to Engineering Partnership with SFSU an...Engineering is Elementary: The Bridge to Engineering Partnership with SFSU an...
Engineering is Elementary: The Bridge to Engineering Partnership with SFSU an...
 
MIT Program on Information Science Talk -- Ophir Frieder on Searching in Hars...
MIT Program on Information Science Talk -- Ophir Frieder on Searching in Hars...MIT Program on Information Science Talk -- Ophir Frieder on Searching in Hars...
MIT Program on Information Science Talk -- Ophir Frieder on Searching in Hars...
 

Semelhante a Icse 2013-tutorial-data-science-for-software-engineering

The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataCS, NcState
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...CS, NcState
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceCS, NcState
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesKathirvel Ayyaswamy
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013CS, NcState
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantLynne Thomas
 
Hector Guerrero- Road to Business Analytics
Hector Guerrero- Road to Business AnalyticsHector Guerrero- Road to Business Analytics
Hector Guerrero- Road to Business AnalyticsErika Marr
 
GTU GeekDay 2019 Limitations of Artificial Intelligence
GTU GeekDay 2019 Limitations of Artificial IntelligenceGTU GeekDay 2019 Limitations of Artificial Intelligence
GTU GeekDay 2019 Limitations of Artificial IntelligenceKürşat İNCE
 
Will Data Science Approaches Impact Our Science?
Will Data Science Approaches Impact Our Science?Will Data Science Approaches Impact Our Science?
Will Data Science Approaches Impact Our Science?Philip Bourne
 
Service and Support for Science IT -Peter Kunzst, University of Zurich
Service and Support for Science IT-Peter Kunzst, University of ZurichService and Support for Science IT-Peter Kunzst, University of Zurich
Service and Support for Science IT -Peter Kunzst, University of ZurichMind the Byte
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkVivian S. Zhang
 
An Obligatory Introduction to Data Science
An Obligatory Introduction to Data ScienceAn Obligatory Introduction to Data Science
An Obligatory Introduction to Data ScienceWesley Eldridge
 
Building Data Scientists
Building Data ScientistsBuilding Data Scientists
Building Data ScientistsMitch Sanders
 
The Search for Truth in Objective & Subject Crowdsourcing
The Search for Truth in Objective & Subject CrowdsourcingThe Search for Truth in Objective & Subject Crowdsourcing
The Search for Truth in Objective & Subject CrowdsourcingMatthew Lease
 
Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdfpaijitk
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptxshalini s
 
Data_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfData_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfvishal choudhary
 

Semelhante a Icse 2013-tutorial-data-science-for-software-engineering (20)

The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
 
DBMS
DBMSDBMS
DBMS
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
 
Hector Guerrero- Road to Business Analytics
Hector Guerrero- Road to Business AnalyticsHector Guerrero- Road to Business Analytics
Hector Guerrero- Road to Business Analytics
 
GTU GeekDay 2019 Limitations of Artificial Intelligence
GTU GeekDay 2019 Limitations of Artificial IntelligenceGTU GeekDay 2019 Limitations of Artificial Intelligence
GTU GeekDay 2019 Limitations of Artificial Intelligence
 
Will Data Science Approaches Impact Our Science?
Will Data Science Approaches Impact Our Science?Will Data Science Approaches Impact Our Science?
Will Data Science Approaches Impact Our Science?
 
Service and Support for Science IT -Peter Kunzst, University of Zurich
Service and Support for Science IT-Peter Kunzst, University of ZurichService and Support for Science IT-Peter Kunzst, University of Zurich
Service and Support for Science IT -Peter Kunzst, University of Zurich
 
lecture_1.pptx
lecture_1.pptxlecture_1.pptx
lecture_1.pptx
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 
An Obligatory Introduction to Data Science
An Obligatory Introduction to Data ScienceAn Obligatory Introduction to Data Science
An Obligatory Introduction to Data Science
 
Building Data Scientists
Building Data ScientistsBuilding Data Scientists
Building Data Scientists
 
The Search for Truth in Objective & Subject Crowdsourcing
The Search for Truth in Objective & Subject CrowdsourcingThe Search for Truth in Objective & Subject Crowdsourcing
The Search for Truth in Objective & Subject Crowdsourcing
 
Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdf
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
Data_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfData_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdf
 

Mais de CS, NcState

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdecCS, NcState
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringCS, NcState
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest linkCS, NcState
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9CS, NcState
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).CS, NcState
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits CS, NcState
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab templateCS, NcState
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUCS, NcState
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements EngineeringCS, NcState
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginiaCS, NcState
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software EngineeringCS, NcState
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)CS, NcState
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceCS, NcState
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1CS, NcState
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter? CS, NcState
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?CS, NcState
 
Sayyad slides ase13_v4
Sayyad slides ase13_v4Sayyad slides ase13_v4
Sayyad slides ase13_v4CS, NcState
 

Mais de CS, NcState (20)

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdec
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab template
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software Engineering
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
 
Goldrush
GoldrushGoldrush
Goldrush
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1
 
Know thy tools
Know thy toolsKnow thy tools
Know thy tools
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter?
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?
 
Sayyad slides ase13_v4
Sayyad slides ase13_v4Sayyad slides ase13_v4
Sayyad slides ase13_v4
 

Último

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Último (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Icse 2013-tutorial-data-science-for-software-engineering

  • 1. ICSE’13 Tutorial: Data Science for Software Engineering Tim Menzies, West Virginia University Ekrem Kocaguneli, West Virginia University Fayola Peters, West Virginia University Burak Turhan, University of Oulu Leandro L. Minku, The University of Birmingham ICSE 2013 May 18th - 26th, 2013 San Francisco, CA http://bit.ly/icse13tutorial
  • 2. Who we are… 1 Tim Menzies West Virginia University tim@menzies.us Ekrem Kocaguneli West Virginia University ekrem@kocaguneli.com Fayola Peters West Virginia University fayolapeters@gmail.com Burak Turhan University of Oulu turhanb@computer.org Leandro L. Minku The University of Birmingham L.L.Minku@cs.bham.ac.uk
  • 3. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Rule #1: Talk to the users – Rule #2: Know your domain – Rule #3: Suspect your data – Rule #4: Data science is cyclic • PART 2: Data Issues – How to solve lack or scarcity of data – How to prune data, simpler & smarter – How to advance simple CBR methods – How to keep your data private • PART 3: Model Issues – Problems of SE models – Solutions • Envy-based learning • Ensembles 2
  • 4. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Rule #1: Talk to the users – Rule #2: Know your domain – Rule #3: Suspect your data – Rule #4: Data science is cyclic • PART 2: Data Issues – How to solve lack or scarcity of data – How to prune data, simpler & smarter – How to advance simple CBR methods – How to keep your data private • PART 3: Model Issues – Problems of SE models – Solutions • Envy-based learning • Ensembles 3
  • 5. What can we share? • Two software project managers meet – What can they learn from each other? • They can share 1. Data 2. Models 3. Methods • techniques for turning data into models 4. Insight into the domain • The standard mistake – Generally assumed that models can be shared, without modification. – Yeah, right… 4
  • 6. SE research = sparse sample of a very diverse set of activities 5 Microsoft research, Redmond, Building 99 Other studios, many other projects And they are all different.
  • 7. Models may not move (effort estimation) • 20 * 66% samples of data from NASA • Linear regression on each sample to learn effort = a*LOCb *Σiβixi • Back select to remove useless xi • Result? – Wide βivariance 6* T. Menzies, A.Butcher, D.Cok, A.Marcus, L.Layman, F.Shull, B.Turhan, T.Zimmermann, "Local vs. Global Lessons for Defect Prediction and Effort Estimation," IEEE TSE pre-print 2012. http://menzies.us/pdf/12gense.pdf
  • 8. Models may not move (defect prediction) 7* T. Menzies, A.Butcher, D.Cok, A.Marcus, L.Layman, F.Shull, B.Turhan, T.Zimmermann, "Local vs. Global Lessons for Defect Prediction and Effort Estimation," IEEE TSE pre-print 2012. http://menzies.us/pdf/12gense.pdf
  • 9. Oh woe is me • No generality in SE? • Nothing we can learn from each other? • Forever doomed to never make a conclusion? – Always, laboriously, tediously, slowly, learning specific lessons that hold only for specific projects? • No: 3 things we might want to share – Models, methods, data • If no general models, then – Share methods • general methods for quickly turning local data into local models. – Share data • Find and transfer relevant data from other projects to us 8
  • 10. The rest of this tutorial • Data science – How to share data – How to share methods • Maybe one day, in the future, – after we’ve shared enough data and methods – We’ll be able to report general models – ICSE 2020? • But first, – Some general notes on data mining 9
  • 11. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Rule #1: Talk to the users – Rule #2: Know your domain – Rule #3: Suspect your data – Rule #4: Data science is cyclic • PART 2: Data Issues – How to solve lack or scarcity of data – How to prune data, simpler & smarter – How to advance simple CBR methods – How to keep your data private • PART 3: Model Issues – Problems of SE models – Solutions • Envy-based learning • Ensembles 10
  • 12. OUTLINE • PART 0: Introduction • PART 1: Organization Issues –Rule #1: Talk to the users – Rule #2: Know your domain – Rule #3: Suspect your data – Rule #4: Data science is cyclic • PART 2: Data Issues – How to solve lack or scarcity of data – How to prune data, simpler & smarter – How to advance simple CBR methods – How to keep your data private • PART 3: Model Issues – Problems of SE models – Solutions • Envy-based learning • Ensembles 11
  • 13. The great myth • Let’s face it: – Humans are a pest – And experts doubly so. • “The notion of ‘user’ cannot be precisely defined and therefore has no place in CS and SE” – EdsgerDijkstra, ICSE’4, 1979 • http://en.wikipedia.org/wiki/List_ of_cognitive_biases • 96 Decision-making, belief and behavioral biases – Attentional bias – paying more attention to emotionally dominant stimuli in one's environment and to neglect relevant data • 23 Social biases – Worse-than-average effect – believing we are worse than others at tasks which are difficult • 52 Memory errors and biases – Illusory correlation – inaccurately remembering a relationship between two event 12
  • 14. The great myth • Wouldn’t it be wonderful if we did not have to listen to them – The dream of oldeworlde machine learning • Circa 1980s – Dispense with live experts and resurrect dead ones. • But any successful learner needs biases – Ways to know what’s important • What’s dull • What can be ignored – No bias? Can’t ignore anything • No summarization • No generalization • No way to predict the future 13
  • 15. Christian Bird, data miner, Msoft research, Redmond • Microsoft Research, Redmond – Assesses learnersby “engagement” A successful “Bird” session: • Knowledge engineers enter with sample data • Users take over the spreadsheet • Run many ad hoc queries • In such meetings, users often… • demolish the model • offer more data • demand you come back next week with something better 14 Expert data scientists spend more time with users than algorithms
  • 16. Also: Users control budgets • Why talk to users? – Cause they own the wallet • As the Mercury astronauts used to say – No bucks, no Buck Rodgers • We need to give users a sense of comfort that we know what we are doing – That they are part of the process – That we understand their problem and processes – Else, budget = $0 15
  • 17. The Inductive Engineering Manifesto • Users before algorithms: – Mining algorithms are only useful in industry if users fund their use in real-world applications. • Data science – Understanding user goals to inductively generate the models that most matter to the user. 16 • T. Menzies, C. Bird, T. Zimmermann, W. Schulte, and E. Kocaganeli. The inductive software engineering manifesto. (MALETS '11).
  • 18. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Rule #1: Talk to the users –Rule #2: Know your domain – Rule #3: Suspect your data – Rule #4: Data science is cyclic • PART 2: Data Issues – How to solve lack or scarcity of data – How to prune data, simpler & smarter – How to advance simple CBR methods – How to keep your data private • PART 3: Model Issues – Problems of SE models – Solutions • Envy-based learning • Ensembles 17
  • 19. Algorithms is only part of the story 18 • Drew Conway, The Data Science Venn Diagram, 2009, • http://www.dataists.com/2010/09/the-data-science-venn-diagram/ • Dumb data miners miss important domains semantics • An ounce of domain knowledge is worth a ton to algorithms. • Math and statistics only gets you machine learning, • Science is about discovery and building knowledge, which requires some motivating questions about the world • The culture of academia, does not reward researchers for understanding domains.
  • 20. Case Study #1: NASA • NASA’s Software Engineering Lab, 1990s – Gave free access to all comers to their data – But you had to come to get it (to Learn the domain) – Otherwise: mistakes • E.g. one class of software module with far more errors that anything else. – Dumb data mining algorithms: might learn that this kind of module in inherently more data prone • Smart data scientists might question “what kind of programmer work that module” – A: we always give that stuff to our beginners as a learning exercise 19* F. Shull, M. Mendonsa, V. Basili, J. Carver, J. Maldonado, S. Fabbri, G. Travassos, and M. Ferreira, "Knowledge- Sharing Issues in Experimental Software Engineering", EMSE 9(1): 111-137, March 2004.
  • 21. Case Study #2: Microsoft • Distributed vs centralized development • Who owns the files? – Who owns the files with most bugs • Result #1 (which was wrong) – A very small number of people produce most of the core changes to a “certain Microsoft product”. – Kind of an uber-programmer result – I.e. given thousands of programmers working on a project • Most are just re-arrange deck chairs • TO improve software process, ignore the drones and focus mostly on the queen bees • WRONG: – Microsoft does much auto- generation of intermediary build files. – And only a small number of people are responsible for the builds – And that core build team “owns” those auto-generated files – Skewed the results. Send us down the wrong direction • Needed to spend weeks/months understanding build practices – BEFORE doing the defect studies 20* E. Kocaganeli, T. Zimmermann, C.Bird, N.Nagappan, T.Menzies. Distributed Development Considered Harmful?. ICSE 2013 SEIP Track, San Francisco, CA, USA, May 2013.
  • 22. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Rule #1: Talk to the users – Rule #2: Know your domain –Rule #3: Suspect your data – Rule #4: Data science is cyclic • PART 2: Data Issues – How to solve lack or scarcity of data – How to prune data, simpler & smarter – How to advance simple CBR methods – How to keep your data private • PART 3: Model Issues – Problems of SE models – Solutions • Envy-based learning • Ensembles 21
  • 23. You go mining with the data you have—not the data you might want • In the usual case, you cannot control data collection. – For example, data mining at NASA 1999 – 2008 • Information collected from layers of sub-contractors and sub-sub-contractors. • Any communication to data owners had to be mediated by up to a dozen account managers, all of whom had much higher priority tasks to perform. • Hence, we caution that usually you must: – Live with the data you have or dream of accessing at some later time. 22
  • 24. Rinse before use • Data quality tests (*) – Linear time checks for (e.g.) repeated rows • Column and row pruning for tabular data – Bad columns contain noise, irrelevancies – Bad rows contain confusing outliers – Repeated results: • Signal is a small nugget within the whole data • R rows and C cols can be pruned back to R/5 and C0.5 • Without losing signal 23* M. Shepperd, Q. Song, Z. Sun, C. Mair, "Data Quality: Some Comments on the NASA Software Defect Data Sets," IEEE TSE, 2013, pre-prints
  • 25. e.g. NASA effort data 24 Nasa data: most Projects highly complex i.e. no information in saying “complex” The more features we remove for smaller projects the better the predictions.
  • 26. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Rule #1: Talk to the users – Rule #2: Know your domain – Rule #3: Suspect your data –Rule #4: Data science is cyclic • PART 2: Data Issues – How to solve lack or scarcity of data – How to prune data, simpler & smarter – How to advance simple CBR methods – How to keep your data private • PART 3: Model Issues – Problems of SE models – Solutions • Envy-based learning • Ensembles 25
  • 27. Do it again, and again, and again, and … 26 In any industrial application, data science is repeated multiples time to either answer an extra user question, make some enhancement and/or bug fix to the method, or to deploy it to a different set of users.
  • 28. Thou shall not click • For serious data science studies, – to ensure repeatability, – the entire analysis should be automated – using some high level scripting language; • e.g. R-script, Matlab, Bash, …. 27
  • 31. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Rule #1: Talk to the users – Rule #2: Know your domain – Rule #3: Suspect your data – Rule #4: Data science is cyclic • PART 2: Data Issues – How to solve lack or scarcity of data – How to prune data, simpler & smarter – How to advance simple CBR methods – How to keep your data private • PART 3: Model Issues – Problems of SE models – Solutions • Envy-based learning • Ensembles 30
  • 32. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Rule #1: Talk to the users – Rule #2: Know your domain – Rule #3: Suspect your data – Rule #4: Data science is cyclic • PART 2: Data Issues –How to solve lack or scarcity of data – How to prune data, simpler & smarter – How to advance simple CBR methods – How to keep your data private • PART 3: Model Issues – Problems of SE models – Solutions • Envy-based learning • Ensembles 31
  • 33. How to Solve Lack or Scarcity of Local Data 32
  • 34. What are my options? Isn’t local (within) data better? It may not be available It may be scarce Tedious data collection effort Too slow to collect The verdict with global (cross) data? Effort estimation1: No clear winners, either way Defect Prediction2: Can use global data as a stop gap 33 1 Barbara A. Kitchenham, Emilia Mendes, GuilhermeHortaTravassos: Cross versus Within-Company Cost Estimation Studies: A Systematic Review. IEEE Trans. Software Eng. 33(5): 316-329 (2007) 2 B. Turhan, T. Menzies, A. Bener and J. Distefano, “On the relative value of cross-company and within-company data for defect prediction”, Empirical Software Engineering Journal, Vol.14/5, pp.540-578, 2009.
  • 35. Comparing options • For NASA data – Seven test sets from 10% of each source • Treatment CC (using global) – Train on the 6 other data sets • Treatment WC (using local) – Train on the remaining 90% of the local data 34
  • 36. NN-Filtering Step 1: Calculate the pairwise Euclidean distances between the local (test) set and the candidate (global) training set. Step 2: For each test datum, pick its k nearest neighbors from global set. Step 3: Pick unique instances from the union of those selected across all local set to construct the final training set 35 Now, train your favorite model on the filtered training set! B. Turhan, A. Bener, and T. Menzies, “Nearest Neighbor Sampling for Cross Company Defect Predictors”, in Proceedings of the 1st International Workshop on Defects in Large Software Systems (DEFECTS 2008), pp. 26, 2008.
  • 37. More Comparisons: PD • For NASA data – Seven test sets from 10% of each source • Treatment CC (using global) – Train on the 6 other data sets • Treatment WC (using local) – Train on the remaining 90% of the local data • Treatment NN (using global+NN) – Initialize train set with 6 other data sets, – Prune the train set to just the 10 nearest neighbors (Euclidean) of the test set (discarding repeats) 36 B. Turhan, T. Menzies, A. Bener and J. Distefano, “On the relative value of cross-company and within-company data for defect prediction”, Empirical Software Engineering Journal, Vol.14/5, pp.540-578, 2009.
  • 38. More Comparisons: PF • For NASA data – Seven test sets from 10% of each source • Treatment CC (using global) – Train on the 6 other data sets • Treatment WC (using local) – Train on the remaining 90% of the local data • Treatment NN (using global+NN) – Initialize train set with 6 other data sets, – Prune the train set to just the 10 nearest neighbors (Euclidean) of the test set (discarding repeats) 37 B. Turhan, T. Menzies, A. Bener and J. Distefano, “On the relative value of cross-company and within-company data for defect prediction”, Empirical Software Engineering Journal, Vol.14/5, pp.540-578, 2009.
  • 39. B. Turhan, T. Menzies, A. Bener and J. Distefano, “On the relative value of cross-company and within-company data for defect prediction”, Empirical Software Engineering Journal, Vol.14/5, pp.540-578, 2009. • For SOFTLAB data – Three test sets from embedded systems • Treatment CC (using global) – Train on the seven NASA data sets • Treatment WC (using local) – Train on the remaining two local test data • Treatment NN (using global+NN) – Initialize train set with 7 NASA data sets, – Prune the train set to just the 10 nearest neighbors (Euclidean) of the test set (discarding repeats) External Validity
  • 40. 39
  • 41. “Theoriescan be learnedfrom a verysmallsample of availabledata” Microsampling • GivenN defectivemodules: – M = {25, 50, 75, ...} <= N – Select M defectiveand M defect-freemodules. – Learntheories on 2M instances • Undersampling: M=N • 8/12 datasets -> M = 25 • 1/12 datasets -> M = 75 • 3/12 datasets -> M = {200, 575, 1025} T. Menzies, B. Turhan, A. Bener, G. Gay, B. Cukic, Y. Jiang, “Implications of Ceiling Effects in Defect Predictors”, in Proceedings of the 4th International Workshop on Predictor Models in Software Engineering (PROMISE 2008), pp. 47-54, 2008.
  • 42. How about mixing local and global? • Is it feasible to use additional data from other projects: – (Case 1) When there is limited local project history, i.e. no prior releases – (Case 2) When there is existing local project history, i.e. many releases over some period 42B. Turhan, A. T. Mısırlı, A. Bener, “Empirical Evaluation of The Effects of Mixed Project Data on Learning Defect Predictors”, (in print) Journal of Information and Software Technology, 2013 • For 73 versions of 41 projects – Reserve test sets from 10% of each project – Additional test sets if the project has multiple releases • Treatment WP (using local) – Train on 10%..90% of the local data – Train on the previous releases • Treatment WP+CP (using global) – Enrich training sets above with NN- filtered data from all other projects Case 1: WP(10%) + CP is as good as WP(90%) Case 2: WP+CP is significantly better than WP (with small effect size)
  • 43. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Rule #1: Talk to the users – Rule #2: Know your domain – Rule #3: Suspect your data – Rule #4: Data science is cyclic • PART 2: Data Issues – How to solve lack or scarcity of data –How to prune data, simpler & smarter – How to advance simple CBR methods – How to keep your data private • PART 3: Model Issues – Problems of SE models – Solutions • Envy-based learning • Ensembles 43
  • 44. How to Prune Data, Simpler and Smarter 44 Data is the new oil
  • 45. And it has a cost too 45 e.g. $1.5M spent by NASA in the period 1987 to 1990 to understand the historical records of all their software in support of the planning activities for the International Space Station [1] Do we need to discuss all the projects and all the features in a client meeting or in a Delphi session? Similarly, do we need all the labels for supervised methods? [1] E. Kocaguneli, T. Menzies, J. Keung, D. Cok, and R. Madachy, “Active learning and effort estimation: Finding the essential content of software effort estimation data,” IEEE Trans. on Softw. Eng., vol. Preprints, 2013.
  • 46. Data for Industry / Active Learning 46 Concepts of E(k) matrices and popularity… Let’s see it in action: Point to the person closest to you
  • 47. Data for Industry / Active Learning 47 Instance pruning 1. Calculate “popularity” of instances 2. Sorting by popularity, 3. Label one instance at a time 4. Find the stopping point 5. Return closest neighbor from active pool as estimate 1. Calculate the popularity of features 2. Select non-popular features Synonym pruning We want to find the dissimilar features, that are unlike others We want the instances that are similar to others
  • 48. Data for Industry / Active Learning 48 Finding the stopping point • If all popular instances are exhausted. Stop asking for labels if one of the rules fire • Or if there is no MRE (magnitude of relative error = abs(actual- predicted)/actual) improvement for n consecutive times. • Or if the ∆ between the best and the worst error of the last n times is very small. (∆ = 0.1; n = 3)
  • 49. Data for Industry / Active Learning 49 QUICK: An active learning solution, i.e. unsupervised Instances are labeled with a cost by the expert • We want to stop before all the instances are labeled
  • 50. 50 Picking random training instance is not a good idea More popular instances in the active pool decrease error One of the stopping point conditions fires Data for Industry / Active Learning X-axis: Instances sorted in decreasing popularity numbers Y-axis:MedianMRE
  • 51. 51 Data for Industry / Active Learning At most 31% of all the cells On median 10% Intrinsic dimensionality: There is a consensus in the high-dimensional data analysis community that the only reason any methods work in very high dimensions is that, in fact, the data is not truly high-dimensional[1] [1] E. Levina and P.J. Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances in Neural Information Processing Systems, volume 17, Cambridge, MA, USA, 2004. The MIT Press.
  • 52. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Rule #1: Talk to the users – Rule #2: Know your domain – Rule #3: Suspect your data – Rule #4: Data science is cyclic • PART 2: Data Issues – How to solve lack or scarcity of data – How to prune data, simpler & smarter –How to advance simple CBR methods – How to keep your data private • PART 3: Model Issues – Problems of SE models – Solutions • Envy-based learning • Ensembles 52
  • 53. Case-based reasoning (CBR) methods make use of similar past projects for estimation 53 They are very widely used as [1]: • No model-calibration to local data • Can better handle outliers • Can work with 1 or more attributes • Easy to explain Two promising research areas • weighting the selected analogies[2] • improving design options [3] How to Advance Simple CBR Methods [1] F. Walkerden and R. Jeffery, “An empirical study of analogy-based software effort estimation,” Empirical Software Engineering, vol. 4, no. 2, pp. 135–158, 1999. [2] E. Mendes, I. D. Watson, C. Triggs, N. Mosley, and S. Counsell, “A comparative study of cost estimation models for web hypermedia applications,” Empirical Software Engineering, vol. 8, no. 2, pp. 163–196, 2003. [3] J. W. Keung, “Theoretical Maximum Prediction Accuracy for Analogy-Based Software Cost Estimation,” 15th Asia-Pacific Software Engineering Conference, pp. 495– 502, 2008.
  • 54. In none of the scenarios did we see a significant improvement 54 Compare performance of each k-value with and without weighting. Building on the previous research [1], we adopted two different strategies[2] We used kernel weighting to weigh selected analogies a) Weighting analogies [3] How to Advance Simple CBR Methods [1] E. Mendes, I. D. Watson, C. Triggs, N. Mosley, and S. Counsell, “A comparative study of cost estimation models for web hypermedia applications,” Empirical Software Engineering, vol. 8, no. 2, pp. 163–196, 2003. *2+ W. Keung, “Theoretical Maximum Prediction Accuracy for Analogy-Based Software Cost Estimation,” 15th Asia-Pacific Software Engineering Conference, pp. 495– 502, 2008. [3] Kocaguneli, Ekrem, Tim Menzies, and Jacky W. Keung. "Kernel methods for software effort estimation." Empirical Software Engineering 18.1 (2013): 1-24.
  • 55. 55 D-ABE • Get best estimates of all training instances • Remove all the training instances within half of the worst MRE (acc. to TMPA). • Return closest neighbor’s estimate to the test instance. c t db e a f Test instance Training Instances Worst MRE Close to the worst MRE Return the closest neighbor’s estimate b) Designing ABE methods Easy-path: Remove training instance that violate assumptions TEAK will be discussed later. D-ABE: Built on theoretical maximum prediction accuracy (TMPA) [1] How to Advance Simple CBR Methods [1] W. Keung, “Theoretical Maximum Prediction Accuracy for Analogy- Based Software Cost Estimation,” 15th Asia-Pacific Software Engineering Conference, pp. 495– 502, 2008.
  • 56. D-ABE Comparison to static k w.r.t. MMRE 56 D-ABE Comparison to static k w.r.t. win, tie, loss How to Advance Simple CBR Methods
  • 57. Finding enough local training data is a fundamental problem [1] Merits of using cross-data from another company is questionable [2] We use a relevancy filtering method called TEAK on public and proprietary data sets. How to Advance Simple CBR Methods/ Using CBR for cross company learning [1] B. Turhan, T. Menzies, A. Bener, and J. Di Stefano, “On the relative value of cross-company and within-company data for defect prediction,” Empirical Software Engineering, vol. 14, no. 5, pp. 540–578, 2009. *2+ E. Kocaguneli and T. Menzies, “How to find relevant data for effort estimation,” in ESEM’11: International Symposium on Empirical Software Engineering and Measurement, 2011. [3] B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company cost estimation studies: A systematic review,” IEEE Trans. Softw. Eng., vol. 33, no. 5, pp. 316–329, 2007. *4+ T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy, “Cross-project defect prediction: A large scale experiment on data vs. domain vs. process,” ESEC/FSE, pp. 91–100, 2009. Similar amounts of evidence for and against the performance of cross-data [3, 4]
  • 58. 58 Cross data works as well as within data for 6 out of 8 proprietary data sets, 19 out of 21 public data sets after TEAK’s relevancy filtering Similar projects, dissimilar effort values, hence high variance Similar projects, similar effort values, hence low variance How to Advance Simple CBR Methods/ Using CBR for cross company learning Build a second GAC tree with low- variance instances Return closest neighbor’s value from the lowest variance region In summary: Design options of CBR helps, but not fiddling with single instances and weights! [1] E. Kocaguneli and T. Menzies, “How to find relevant data for effort estimation,” in ESEM’11: International Symposium on Empirical Software Engineering and Measurement, 2011.
  • 59. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Rule #1: Talk to the users – Rule #2: Know your domain – Rule #3: Suspect your data – Rule #4: Data science is cyclic • PART 2: Data Issues – How to solve lack or scarcity of data – How to prune data, simpler & smarter – How to advance simple CBR methods –How to keep your data private • PART 3: Model Issues – Problems of SE models – Solutions • Envy-based learning • Ensembles 59
  • 60. Is Data Sharing Worth the Risk to Individual Privacy • Former Governor Massachusetts. • Victim of re-identification privacy breach. • Led to sensitive attribute disclosure of his medical records. What would William Weld say?
  • 61. Is Data Sharing Worth the Risk to Individual Privacy What about NASA contractors? Subject to competitive bidding every 2 years. Unwilling to share data that would lead to sensitive attribute disclosure. e.g. actual software development times
  • 62. When To Share – How To Share So far we cannot guarantee 100% privacy. What we have is a directive as to whether data is private and useful enough to share... We have a lot of privacy algorithms geared toward minimizing risk. Old School K-anonymity L-diversity T-closeness But What About Maximizing Benefits (Utility)? The degree of risk to the data sharing entity must not exceed the benefits of sharing.
  • 63.
  • 64. Balancing Privacy and Utility or... Minimize risk of privacy disclosure while maximizing utility. Instance Selection with CLIFF Small random moves with MORPH = CLIFF + MORPH F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
  • 65. CLIFF Don't share all the data. F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
  • 66. CLIFF Don't share all the data. "a=r1" powerful for selection for class=yes more common in "yes" than "no" CLIFF step1: for each class find ranks of all values F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
  • 67. CLIFF Don't share all the data. "a=r1" powerful for selection for class=yes more common in "yes" than "no" CLIFF step2: multiply ranks of each row F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
  • 68. CLIFF Don't share all the data. CLIFF step3: select the most powerful rows of each class Note linear time Can reduce N rows to 0.1N So an O(N2) NUN algorithm now takes time O(0.01) Scalability F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
  • 69. MORPH Push the CLIFF data from their original position. y = x ± (x − z) ∗ r x ∈ D, the original instance z ∈ D the NUN of x y the resulting MORPHed instance F. Peters and T. Menzies, “Privacy and utility for defect prediction: Experiments with morph,” in Software Engineering (ICSE), 2012 34th International Conference on, june 2012, pp. 189 –199. F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
  • 70. Case Study: Cross-Company Defect Prediction (CCDP) Sharing Required. Zimmermann et al. Local data not always available • companies too small • product in first release, so no past data. Kitchenham et al. • no time for collection • new technology can make all data irrelevant T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy, “Cross-project defect prediction: a large scale experiment on data vs. domain vs. process.” in ESEC/SIGSOFT FSE’09,2009 B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company cost estimation studies: A systematic review,” IEEE Transactions on Software Engineering, vol. 33, pp. 316–329, 2007 - Company B has little or no data to build a defect model; - Company B uses data from Company A to build defect models;
  • 71. CCDP Better with data filtering Initial results with cross-company defect prediction - negative(Zimmerman FSE '09) - or inconclusive (Kitchenham TSE '07) More recent work show better results - Turhan et al. 2009 (The Burak Filter) B. Turhan, T. Menzies, A. Bener, and J. Di Stefano, “On the relative value of cross-company and within-company data for defect prediction,” Empirical Software Engineering, vol. 14, pp. 540–578, 2009. F. Peters, T. Menzies, and A. Marcus, “Better Cross Company Defect Prediction,” Mining Software Repositories (MSR), 2013 10th IEEE Working Conference on, (to appear)
  • 72. Making Data Private for CCDP Here is how we look at the data Terms Non-Sensitive Attribute (NSA) Sensitive Attribute Class Attribute
  • 73. Measuring the Risk IPR = Increased Privacy Ratio Queries Original Privatized Privacy Breach Q1 0 0 yes Q2 0 1 no Q3 1 1 yes yes = 2/3 IPR = 1- 2/3 = 0.33 F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
  • 74. Measuring the Utility The g-measure Probability of detection (pd) Probability of False alarm (pf) Actual yes no Predicted yes TP FP no FN TN pd TP/(TP+FN) pf FP/(FP+TN) g-measure 2*pd*(1-pf)/(pd+(1-pf)) F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
  • 75. Making Data Private for CCDP Comparing CLIFF+MORPH to Data Swapping and K-anonymity Data Swapping (s10, s20, s40) A standard perturbation technique used for privacy To implement... • For each NSA a certainpercent of the values areswapped with anyothervalue in that NSA. • For our experiments,these percentages are 10, 20 and 40. k-anonymity (k2, k4) The Datafly Algorithm. To implement... • Make a generalizationhierarchy. • Replace values in the NSAaccording to thehierarchy. • Continue until there are k or fewer distinct instancesand suppress them. K. Taneja, M. Grechanik, R. Ghani, and T. Xie, “Testing software in age of data privacy: a balancing act,” in Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, ser. ESEC/FSE ’11. New York, NY, USA: ACM, 2011, pp. 201–211. L. Sweeney, “Achieving k-anonymity privacy protection using generalization and suppression,” Int. J. Uncertain. Fuzziness Knowl.-Based Syst., vol. 10, no. 5, pp. 571–588, Oct. 2002.
  • 76. Making Data Private for CCDP Comparing CLIFF+MORPH to Data Swapping and K-anonymity F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
  • 77. Making Data Private for CCDP Comparing CLIFF+MORPH to Data Swapping and K-anonymity F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
  • 78. Making Data Private for CCDP F. Peters, T. Menzies, L. Gong, H. Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction," IEEE Transactions on Software Engineering, 24 Jan. 2013. IEEE computer Society Digital Library. IEEE Computer Society
  • 79. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Rule #1: Talk to the users – Rule #2: Know your domain – Rule #3: Suspect your data – Rule #4: Data science is cyclic • PART 2: Data Issues – How to solve lack or scarcity of data – How to prune data, simpler & smarter – How to advance simple CBR methods – How to keep your data private • PART 3: Model Issues – Problems of SE models – Solutions • Envy-based learning • Ensembles 80
  • 80. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Rule #1: Talk to the users – Rule #2: Know your domain – Rule #3: Suspect your data – Rule #4: Data science is cyclic • PART 2: Data Issues – How to solve lack or scarcity of data – How to prune data, simpler & smarter – How to advance simple CBR methods – How to keep your data private • PART 3: Model Issues –Problems of SE models – Solutions • Envy-based learning • Ensembles 81
  • 81. Problems of SE Models • Instability is the problem of not being able to elicit same/similar results under changing conditions – E.g. data set, performance measure etc. 82 • We will look at instability in 2 areas – Instability in Effort Estimation – Instability in Process
  • 82. 83 There is no agreed upon best estimation method [1] Methods change ranking w.r.t. conditions such as data sets, error measures [2] Experimenting with: 90 solo- methods, 20 public data sets, 7 error measures Problems of SE Models/ Instability in Effort [1] M. Jorgensen and M. Shepperd, “A systematic review of software development cost estimation studies,” IEEE Trans. Softw. Eng., vol. 33, no. 1, pp. 33–53, 2007. [2] I. Myrtveit, E. Stensrud, and M. Shepperd, “Reliability and validity in comparative studies of software prediction models,” IEEE Trans. Softw. Eng., vol. 31, no. 5, pp. 380–391, May 2005.
  • 83. 84 Problems of SE Models/ Instability in Effort 1. Rank methods acc. to win, loss and win-loss values 2. δr is the max. rank change 3. Sort methods acc. to loss and observe δr values
  • 84. 85 We have a set of superior methods to recommend Assembling solo-methods may be a good idea Baker et al. [1], Kocaguneli et al. [2], Khoshgoftaaret al. [3]failed to outperform solo-methods But the previous evidence of assembling multiple methods in SEE is discouraging Problems of SE Models/ Instability in Effort Top 13 methods are CART & ABE methods (1NN, 5NN) [1] D. Baker, “A hybrid approach to expert and model-based effort esti- mation,” Master’s thesis, Lane Department of Computer Science and Electrical Engineering, West Virginia University, 2007, available from https://eidr.wvu.edu/etd/documentdata.eTD?documentid=5443. [2] E. Kocaguneli, Y. Kultur, and A. Bener, “Combining multiple learners induced on multiple datasets for software effort prediction,” in Interna- tional Symposium on Software Reliability Engineering (ISSRE), 2009, student Paper. [3] T. M. Khoshgoftaar, P. Rebours, and N. Seliya, “Software quality analysis by combining multiple projects and learners,” Software Quality Control, vol. 17, no. 1, pp. 25–49, 2009.
  • 85. 86 Combine top 2,4,8,13 solo-methods via mean, median and IRWM Problems of SE Models/ Instability in Effort Re-rank solo and multi-methods together
  • 86. Problems of SE Models/ Instability in Process: Dataset Shift/Concept Drift 87Candela JQ, Sugiyama M, Schwaighofer A, Lawrence ND (eds) (2009) Dataset shift in machine learning. The MIT Press, Cambridge, MA
  • 87. Dataset Shift: Covariate Shift • Consider a size-based effort estimation model – Effective for projects within the traditional operational boundaries of a company • What if a change with impact on products’ size: – new business domains – change in technologies – change in development techniques Before After Effort 89 p(Xtrain) ≠ p(Xtest) Size B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1-2, pp.62-74, 2012.
  • 88. Dataset Shift: Prior Probability Shift • Now, consider a defect prediction model… • … and again, what if defect characteristics change: – Process improvement – More QA resources – Increased experience over time – Basically you improve over time! Before After %Defects 90 p(Ytrain) ≠ p(Ytest) kLOC B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1-2, pp.62-74, 2012.
  • 89. Dataset Shift: Usual Suspects Sample Selection Bias & Imbalanced Data 91B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1- 2, pp.62-74, 2012.
  • 90. Dataset Shift: Usual Suspects Sample Selection Bias &Imbalanced Data 92B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1-2, pp.62-74, 2012.
  • 91. Dataset Shift Domain shift • Be consistent in the way you measure concepts for model training and testing! • *: “…the metrics based assessment of a software system and measures taken to improve its design differ considerably from tool to tool.” Source Component Shift • a.k.a. Data Heterogeneity • Ex: ISBSG contains data from 6000+ projects from 30+ countries. Where do the training data come from? vs. Where do the test data come from? 93 * RüdigerLincke, Jonas Lundberg, and WelfLöwe. “Comparing software metrics tools”, ISSTA '08 B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1- 2, pp.62-74, 2012.
  • 92. 94B. Turhan, “On the Dataset Shift Problem in Software Engineering Prediction Models”, Empirical Software Engineering Journal, Vol.17/1- 2, pp.62-74, 2012. Outlier ‘Detection ’ Relevancy Filtering Instance Weighting Stratification Cost Curves Mixture Models Managing Dataset Shift Covariate Shift Prior Probability Shift Sampling Imbalanced Data Domain Shift Source Component Shift
  • 93. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Rule #1: Talk to the users – Rule #2: Know your domain – Rule #3: Suspect your data – Rule #4: Data science is cyclic • PART 2: Data Issues – How to solve lack or scarcity of data – How to prune data, simpler & smarter – How to advance simple CBR methods – How to keep your data private • PART 3: Model Issues – Problems of SE models –Solutions • Envy-based learning • Ensembles 95
  • 94. • Seek the fence where the grass is greener on the other side. • Learn from there • Test on here • Cluster to find “here” and “there” 12/1/2011 96 Envy = The WisDOM Of the COWs
  • 95. 12/1/2011 97 @attribute recordnumber real @attribute projectname {de,erb,gal,X,hst,slp,spl,Y} @attribute cat2 {Avionics, application_ground, avionicsmonitoring, … } @attribute center {1,2,3,4,5,6} @attribute year real @attribute mode {embedded,organic,semidetached} @attribute rely {vl,l,n,h,vh,xh} @attribute data {vl,l,n,h,vh,xh} … @attribute equivphyskloc real @attribute act_effort real @data 1,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,25.9,117.6 2,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,24.6,117.6 3,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,7.7,31.2 4,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,8.2,36 5,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,9.7,25.2 6,de,avionicsmonitoring,g,2,1979,semidetached,h,l,h,n,n,l,l,n,n,n,n,h,h,n,l,2.2,8.4 …. DATA = MULTI-DIMENSIONAL VECTORS
  • 96. CAUTION: data may not divide neatly on raw dimensions • The best description for SE projects may be synthesize dimensions extracted from the raw dimensions 12/1/2011 98
  • 97. Fastmap 12/1/2011 99 Fastmap: Faloutsos [1995] O(2N) generation of axis of large variability • Pick any point W; • Find X furthest from W, • Find Y furthest from Y. c = dist(X,Y) All points have distance a,b to (X,Y) • x = (a2 + c2 − b2)/2c • y= sqrt(a2 – x2) Find median(x), median(y) Recurse on four quadrants
  • 98. Hierarchical partitioning Prune • Find two orthogonal dimensions • Find median(x), median(y) • Recurse on four quadrants • Combine quadtree leaves with similar densities • Score each cluster by median score of class variable 100 Grow
  • 99. Q: why cluster Via FASTMAP? • A1: Circular methods (e.g. k-means) assume round clusters. • But density-based clustering allows clusters to be any shape • A2: No need to pre-set the number of clusters • A3: cause other methods (e.g. PCA) are much slower • Fastmap is the O(2N) • Unoptimized Python: 12/1/2011 101
  • 101. • Seek the fence where the grass is greener on the other side. • Learn from there • Test on here • Cluster to find “here” and “there” 12/1/2011 103 Envy = The WisDOM Of the COWs
  • 102. Hierarchical partitioning Prune • Find two orthogonal dimensions • Find median(x), median(y) • Recurse on four quadrants • Combine quadtree leaves with similar densities • Score each cluster by median score of class variable 104 Grow
  • 103. Hierarchical partitioning Prune • Find two orthogonal dimensions • Find median(x), median(y) • Recurse on four quadrants • Combine quadtree leaves with similar densities • Score each cluster by median score of class variable • This cluster envies its neighbor with better score and max abs(score(this) - score(neighbor)) 105 Grow Where is grass greenest?
  • 104. Q: How to learn rules from neighboring clusters • A: it doesn’t really matter – Many competent rule learners • But to evaluate global vs local rules: – Use the same rule learner for local vs global rule learning • This study uses WHICH (Menzies [2010]) – Customizable scoring operator – Faster termination – Generates very small rules (good for explanation) 106
  • 105. Data from http://promisedata.googlecode.com • Effort reduction = { NasaCoc, China } : COCOMO or function points • Defect reduction = {lucene,xalanjedit,synapse,etc } : CK metrics(OO) • Clusters have untreated class distribution. • Rules select a subset of the examples: – generate a treated class distribution 107 0 20 40 60 80 100 25th 50th 75th 100th untreated global local Distributions have percentiles: Treated with rules learned from all data Treated with rules learned from neighboring cluster
  • 106. • Lower median efforts/defects (50th percentile) • Greater stability (75th – 25th percentile) • Decreased worst case (100th percentile) By any measure, Local BETTER THAN GLOBAL 108
  • 107. Rules learned in each cluster • What works best “here” does not work “there” – Misguided to try and tame conclusion instability – Inherent in the data • Can’t tame conclusion instability. • Instead, you can exploit it • Learn local lessons that do better than overly generalized global theories 12/1/2011 109
  • 108. OUTLINE • PART 0: Introduction • PART 1: Organization Issues – Rule #1: Talk to the users – Rule #2: Know your domain – Rule #3: Suspect your data – Rule #4: Data science is cyclic • PART 2: Data Issues – How to solve lack or scarcity of data – How to prune data, simpler & smarter – How to advance simple CBR methods – How to keep your data private • PART 3: Model Issues – Problems of SE models – Solutions • Envy-based learning • Ensembles 110
  • 109. Solutions to SE Model Problems/ Ensembles of Learning Machines*  Sets of learning machines grouped together.  Aim: to improve predictive performance. ... estimation1 estimation2 estimationN Base learners E.g.: ensemble estimation = Σ wi estimationi B1 B2 BN * T. Dietterich. Ensemble Methods in Machine Learning. Proceedings of the First International Workshop in Multiple Classifier Systems. 2000.
  • 110. Solutions to SE Model Problems/ Ensembles of Learning Machines  One of the keys: Diverse* ensemble: “base learners” make different errors on the same instances. * G. Brown, J. Wyatt, R. Harris, X. Yao. Diversity Creation Methods: A Survey and Categorisation. Journal of Information Fusion 6(1): 5-20, 2005.
  • 111. Solutions to SE Model Problems/ Ensembles of Learning Machines  One of the keys: Diverse ensemble: “base learners” make different errors on the same instances.  Three different types of ensembles that have been applied for software effort estimation will be presented in the next slides. Different ensemble approaches can be seen as different ways to generate diversity among base learners!
  • 112. Solutions to SE Model Problems/ Static Ensembles Training data (completed projects) training Ensemble  An existing training set is used for creating/training the ensemble. BNB1 B2 ...
  • 113. Solutions to SE Model Problems/ Static Ensembles  Bagging ensembles of Regression Trees (Bag+RTs)* Study with 13 data sets from PROMISE and ISBSG repositories. Bag+RTs:  Obtained the highest rank across data set in terms of Mean Absolute Error (MAE).  Rarely performed considerably worse (>0.1SA, SA = 1 – MAE / MAErguess) than the best approach: * L. Minku, X. Yao. Ensembles and Locality: Insight on Improving Software Effort Estimation. Information and Software Technology, Special Issue on Best Papers from PROMISE 2011, 2012 (in press), http://dx.doi.org/10.1016/j.infsof.2012.09.012.
  • 114. Solutions to SE Model Problems/ Static Ensembles  Bagging* ensembles of regression trees * L. Breiman. Bagging Predictors. Machine Learning 24(2):123-140, 1996. Training data (completed projects) Ensemble RT1 RT2 RTN... Sample uniformly with replacement
  • 115. Solutions to SE Model Problems/ Static Ensembles  Bagging ensembles of regression trees Functional Size Functional Size Effort = 5376 Effort = 1086 Effort = 2798 >= 253< 253 < 151 >= 151 Regression trees:  Estimation by analogy.  Divide projects according to attribute value.  Most impactful attributes are in higher levels.  Attributes with insignificant impact are not used.  E.g., REPTrees*. * M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten. The WEKA Data Mining Software: An Update. SIGKDD Explorations 11(1), 2009. http://www.cs.waikato.ac.nz/ml/weka.
  • 116. Solutions to SE Model Problems/ Static Ensembles  Bagging ensembles of regression trees  Weka: classifiers – meta – bagging  classifiers – trees – REPTree
  • 117. Solutions to SE Model Problems/ Static Ensembles  Multiple-objective Pareto ensembles There are different measures/metrics of performance for evaluating SEE models. Different measures capture different quality features of the models.  E.g.: MAE, standard deviation, PRED, etc.  There is no agreed single measure.  A model doing well for a certain measure may not do so well for another. Multilayer Perceptron (MLP) models created using Cocomo81.
  • 118. Solutions to SE Model Problems/ Static Ensembles  Multi-objective Pareto ensembles* We can view SEE as a multi-objective learning problem. A multi-objective approach (e.g. Multi-Objective Evolutionary Algorithm (MOEA)) can be used to:  Better understand the relationship among measures.  Create ensembles that do well for a set of measures, in particular for larger data sets (>=60). Sample result: Pareto ensemble of MLPs (ISBSG): * L. Minku, X. Yao. Software Effort Estimation as a Multi-objective Learning Problem. ACM Transactions on Software Engineering and Methodology, 2012 (accepted). Author's final version: http://www.cs.bham.ac.uk/~minkull/publications/MinkuYaoTOSEM12.pdf.
  • 119. Solutions to SE Model Problems/ Static Ensembles  Multi-objective Pareto ensembles Training data (completed projects) Ensemble B1 B2 B3 Multi-objective evolutionary algorithm creates nondominated models with several different trade- offs. The model with the best performance in terms of each particular measure can be picked to form an ensemble with a good trade-off.
  • 120. Solutions to SE Model Problems/ Dynamic Adaptive Ensembles  Companies are not static entities – they can change with time (data set shift / concept drift*). Models need to learn new information and adapt to changes. Companies can start behaving more or less similarly to other companies. * L. Minku, A. White, X. Yao. The Impact of Diversity on On-line Ensemble Learning in the Presence of Concept Drift. IEEE Transactions on Knowledge and Data Engineering, 22(5):730-742, 2010. Predicting effort for a single company from ISBSG based on its projects and other companies' projects.
  • 121. Solutions to SE Model Problems/ Dynamic Adaptive Ensembles  Dynamic Cross-company Learning (DCL)* Cross-company Training Set 1 (completed projects) Cross-company Training Set 1 (completed projects) Cross-company (CC) m training sets with different productivity (completed projects) CC model 1 CC model m w1 wm ... ... ... Within-company (WC) training data (projects arriving with time) CC model 1 CC model m ... WC model 1 w1 wm wm+1 * L. Minku, X. Yao. Can Cross-company Data Improve Performance in Software Effort Estimation? Proceedings of the 8th International Conference on Predictive Models in Software Engineering, p. 69-78, 2012. http://dx.doi.org/10.1145/2365324.2365334. • Dynamic weights control how much a certain model contributes to predictions:  At each time step, “loser” models have weight multiplied by Beta.  Models trained with “very different” projects from the one to be predicted can be filtered out.
  • 122. Solutions to SE Model Problems/ Dynamic Adaptive Ensembles  Dynamic Cross-company Learning (DCL) DCL uses new completed projects that arrive with time. DCL determines when CC data is useful. DCL adapts to changes by using CC data. Predicting effort for a single company from ISBSG based on its projects and other companies' projects.
  • 123. What have we covered? 125 Organizational Issues Data Issues Model Issues
  • 124. 126

Notas do Editor

  1. Burak
  2. Tim, Ekrem
  3. Tim, Ekrem
  4. Burak
  5. Burak