Keynote presentation from ECBS conference. The talk is about how to use machine learning and AI in improving software engineering. Experiences from our project in Software Center (www.software-center.se).
2. Everything
software!
• Software is eating the
world, in all sectors
In the future, all
companies will be
software companies
Marc Andreessen,
founder of Netscape
3.
4. Take-aways from this talk
• Big data is the most important enabler in AI4SE
• AI4SE is closer than we think
• We will still be needed to teach ML/AI
5. Who am I?
• Professor in Software Engineering at
Chalmers | University of Gothenburg
• Specialization in software measurement
– Machine learning in software engineering
– Autonomous artificial intelligence based measurement
– Measurement knowledge discovery
– Simulation of outcome before decision formulation
– Metrological foundations of measurement reference etalons
• Actively working with the standards
– ISO/IEC 15939 - Software and Systems Engineering - Measurement Processes
– ISO/IEC 25000 (series) - Software Quality Requirements and Evaluation
(SQuaRE)
– ISO/IEC 14598 - Information Technology - Software Product Evaluation
• Software Center – a collaboration between 13 companies and 5
universities
6. Challenges of modern SE
• Need for Speed
– New releases are expected by the market almost on a daily basis
– Years -> Months -> Weeks
• Data driven development
– Development decisions are taken based on data from software development
• Empowerment
– The teams who have the data should make the decisions
• Ecosystems
– Services grow around products
– Products grow around platforms
7. Why AI and ML is a paradigm shift…
5
This is number 5
There is:
- 60 % probability that this is number 5
- 30 % probability that this is number 3
- 10 % probability that this is number 1
8. AI4SE is already here, we just did not know it yet
• Intelligent software development environments1
– Visual Studio (IntelliCode), Kite (Python), Codota
• Requirements Engineering2
– Algorithms for natural language processing, Hill Climbing for requirement
evolution
• Automated testing3
– Test automation, test identification, test orchestration
1 https://livablesoftware.com/smart-intelligent-ide-programming/
2 Groen, E.C., Harrison, R., Murukannaiah, P.K. et al. Autom Softw Eng (2019).
3 T. M. King, J. Arbon, D. Santiago, D. Adamo, W. Chin and R. Shanmugam, "AI for Testing Today and Tomorrow:
Industry Perspectives," 2019 IEEE International Conference On Artificial Intelligence Testing (AITest)
10. Data source ML Methods Difficulty level ROI/Impact Examples of visualization
Defect prediction - JIRA
- ClearQuest
- BugZilla
- Regression
[Excel, R, Weka, Python]
- Classification
[R, Weka, Python]
Low High/decision support
CCFlex ML metrics - Git
- SVN
- ClearCase
- Decision trees
[CCFlex, R, Weka, Python]
Medium Medium/data collection
Test optimization - Test tools
- Portals
- Test DBs
- Classification
[R, Weka, Python]
- Cluster analysis
[R, Weka, Python]
- Reinforced learning
[R, Weka, Python]
High High/development practices
Customer data analysis - Field data DB - Classification
[R, Weka, Python]
- Cluster analysis
[R, Weka, Python]
- Decision trees
[R, Weka, Python]
High High/decision support
KPI trend analysis - Metrics DB - Classification
[R, Weka, Python]
- Regression
[R, Weka, Python]
Medium Medium/dissemination
Requirements quality
assessment
- Requirements DB
- ReqPro
- DOORS
- Classification
[R, Weka, Python]
- Clustering
[R, Weka, Python]
Low Medium/development practices
Dashboard support - Metrics DB - Classification
[R, Weka, Python]
- Time series
[R, Weka, Python]
Low Medium/decision support
Defect classification - JIRA
- ClearQuest
- Bugzilla
- Decision trees
[R, Weka, Python]
- Clustering
[R, Weka, Python]
Medium Medium/development practices
Speed / CI - Gerrit
- Jenkins
- Deep learning
[R, Weka, Python]
- Decision trees
[R, Weka, Python]
High Medium/development practices
11. Typical application of AI in SE
Data mining
Raw data
exports
Feature
acquisition
Scaling,
cleaning,
wrangling
Machine
learning
Decision
support / AI
Image by Gerd Altmann from Pixabay
12. Machine learning / AI is just a small part of the whole
pipeline
• Production ML systems needed for
software engineering are still away
– Lack of high quality, labelled data
– Limited analysis capabilities due to non-
obfuscated data sets
– Non-standardized feature extraction
– Manual configuration of data workflows
Source: https://developers.google.com/machine-learning/crash-course/production-ml-systems
13. One of the fundamental challenges of applying ML in
software engineering – feature extraction
5
How we see the number
0 1 1 1 1 0
0 1 0 0 0 0
0 1 0 0 0 0
0 1 1 1 0.5 0
0 1 0 0.5 1 0
0 1 0 0 1 0
0 1 0 0.5 1 0
0 0.5 1 1 0.5 0
How the AI sees the number
14. One of the fundamental challenges of applying ML in
software engineering – feature extraction (requirement)
How we see the requirement
How the AI sees the requirement
When ContainerType changes to “not available” then ContainerCapacity
should be set to the last value as long as ContainerReset is requested.
Keyword: system Keyword: should Keyword: can Keyword: and Has_reference
0 1 0 0 0
AI’s ability to distinguish two requirements strongly depends on which features we extract.
15. Another Fundamental Challenge - lack of high quality
labelled data
When ContainerType changes to “not available” then ContainerCapacity
should be set to the last value as long as ContainerReset is requested.
The xxxxx concept shall allow changes in the configuration of the yyyyy
modules after the software has been built. For detailed specification of which
modules and parameters are changeable see reference zzzzz configuration
specification.
Example of a good requirement
Example of a “bad” requirement
To train an aNN we need 100.000 ++ data points, which we need to label manually.
16. Lack of high quality labelled data – human inconsistency
When ContainerType changes to “not available” then ContainerCapacity
should be set to the last value as long as ContainerReset is requested.
The xxxxx concept shall allow changes in the configuration of the yyyyy
modules after the software has been built. For detailed specification of which
modules and parameters are changeable see reference zzzzz configuration
specification.
Example of a good requirement
Example of a “bad” requirement
Tool Reviewer 1 Reviewer 2
78 4 4
67 5 3
62 4 4
62 5 4
62 4 4
60 4 4
60 4 4
58 4 3
55 4 5
53 4 4
49 3 4
49 4 3
47 1 1
46 4 3
42 3 3
Tool Reviewer 1 Reviewer 2
-65 4 2
-15 2 2
-14 1 2
-13 2 2
-5 2 3
0 4 2
1 Not req 1
1 2 1
1 5 3
2 3 2
7 3 2
8 3 3
9 4 3
10 2 1
11 4 1
Green => good requirement, Red => ”bad” requirement
18. Modern SW architecture: Computer on wheels
18
• Industry (practice)
– Automotive sofware architectures are moving from
federated (distributed) to integrated (centralized,
virtualized)
-> execution of more computationally
demanding algorithms
– Modern automotive software combine stochastic
and probabilistic algorithms
-> new methods for safety assurance, fault
detection/correction and diagnostics are needed
• Academia (theory)
– Data quality measures (consistency) are not
related to quality of AI algorithms (precision/recall)
-> novel data quality measures are needed to
well our data sets reflect the entire solution space
– ML and AI are difficult to test (development) and
diagnose (runtime)
-> new methods for testing and diagnostics
are needed
19. No other cars
(35%)
There is snow
(66.7%)
There is an animal
(99%)
You can drive here
(99%)
False negative False positive True positive True positive
???
20. Way forward with ML/AI and Automotive Software
• We need new ways to create/develop sustainable architectural designs.
• Automotive software architectures are moving from Federated to Integrated.
• Computationally demanding execution
• Automotive software development is moving to Agile (post-deploy, adaptive AUTOSAR)
• We need new ways to assure quality of such systems.
• Existing data quality measures are not related to quality of AI algorithm
• ML and AI are difficult to test (development) and diagnose (runtime)
• Traditional assertions do not accommodate stochastic nature of modern algorithms
• There are no systematic ways of handling training/test datasets for QA
22. • How to quantify entities without
predefined patterns?
• How to flexibly define measurement
instruments based on machine
learning?
• How to discover the patterns of
countable attributes using machine
learning?
• How to discover new data patterns (e.g.
anomalies)?
• How to define the measurement
functions using machine learning
algorithms?
• How to discover new patterns in data
which can be communicated to the
stakeholders?
• How to use machine learning to
describe the patterns?
• How to use machine learning in visual
analytics?
• How can we use machine
learning to mine for standard
models?
Machine Learning
AI/ML-based
measurement
• We study the use of machine learning
to
– Identify behavior of SW code finding
where the relevant code is
– Classify which defects are important,
based on their description, to save time
for analysis
– Identify bottlenecks in continuous
integration, based on integration stop-
patterns
– Identify which KPIs should be removed
because they do not provide any value
• How can we generate new
decision criteria using
machine learning?
23. OUR EXPERIENCES FROM USING MACHINE LEARNING IN SE
WHICH DEFECT SHOULD WE FIX FIRST?
25. Defects database
• Product: large > 10 MLOC
• Period: 2010-17
• Total records: ~14K
• Different filters …
Defects
Main tools:
26. Problem formulation
• How can we predict the severity of the defect?
– Imagine we discover a bug
– We need to quickly assess if this bug should be fixed in this release or not
– We need to assess if this is going to be a lot of work
• Today’s solution
– Architect and quality engineer make the assessment
• We can do better!
27. Mining association rules for defect prioritization
supp=0.0016 confidence=0.83 lift=9.95
{phaseFound=PRODUCT VALIDATION TESTING
answerCode=B2 - To be corrected in this release,
Importance=30}
=> {Severity=A}
supp=0.0011 confidence=0.88 lift=10.45
{phaseFound=Customer,
answerCode=B2 - To be corrected in this release,
submittedOnSystemPart=VERY IMPORTANT PART}
=> {Severity=A}
supp=0.0013 confidence=0.80 lift=9.55
{phaseFound=PRODUCT VALIDATION TESTING
answerCode=B2 - To be corrected in this release,
FollowUpOn=,
ClonedToAllReleases=YES
submittedOnSystemPart=LI}
=> {Severity=A}
28. Can we distinguish Severity A defects from others?
Decision tree: J48 (Weka) + ClassBalancer
J48 pruned tree (example)
------------------
VerificationLevelRequired =
| phaseFound = : A (1.62)
| phaseFound = Customer: A (60.88/12.3)
| phaseFound = Design Test (DT): Other (38.48/8.1)
| phaseFound = Document review (CPI): Other (11.75/1.62)
| phaseFound = FOA: A (28.66/10.85)
| phaseFound = Function Test (FT): Other (228.56/40.48)
| phaseFound = PRODUCT VALIDATION TEST: Other (6.86/3.24)
| phaseFound = INTERNAL TEST: Other (5.79)
| phaseFound = Requirement Review: Other (5.06)
| phaseFound = System Test (ST): Other (148.34/61.53)
VerificationLevelRequired = Customer: A (3.24)
VerificationLevelRequired = Design Test (DT): A (22.67)
VerificationLevelRequired = Function Test (FT): A (66.39)
VerificationLevelRequired = PRODUCT VALIDATION TEST: A (6.48)
VerificationLevelRequired = Requirement Review: A (4.86)
VerificationLevelRequired = System Test (ST): A (66.39)
Number of Leaves : 16
Size of the tree : 18
Accuracy = 77.70 %
True Positive(A) = 0.642
False Positive(A) = 0.088
F-Score(A) = 0.742
True Positive (Other) = 0.912
False Positive (Other) = 0.358
F-Score(B) = 0.804
29. Can we distinguish Severity A defects from others?
• Potentially valuable features (using filter):
• phaseFound
• Keywords headline: branch, test case, underscore
• Kyewords desc: descr_info, descr_requirement, descr_test,
descr_debug, descr_log…
• DaysUntilAssigned
• Records = 6342
• Features = 49
• Directly available
• Time periods between changes of states
• Keywords appearance in description and header
30. How many parameters do we need to make good
classifications?
observations 14K, supp=0.001, conf=0.8 => 263 rules; 37 prunned
32. Practical implications
• We can get much faster with ML
– Human assessment is deferred to later phases
• We need to learn how to work with probabilities
– We cannot say that something is digital any more
• Machine programming
– In the next few years we can see the programs that will repair and even write themselves using ML
approaches
33. EXAMPLE OF OUR RESEARCH
SPEED UP SOFTWARE DEVELOPMENT
USING MACHINE LEARNING
IN COLLABORATION WITH M. OCHODEK (POZNAN UNIV. OF TECHNOLOGY), R. HEBIG (CHALMERS | UNIV. OF GOTHENBURG), W. MEDING (ERICSSON), G. FROST (GRUNDFOS)
34. • How to quantify entities without predefined
patterns?
• How to flexibly define measurement
instruments based on machine learning?
• How to discover the patterns of countable
attributes using machine learning?
Initial diagnosis:
Recognizing coding
violations
• Problem
– How can we measure the quality of
source code based on arbitrary coding
guidelines
• Solutions
– Manual code reviews
– Static analysis
– Manual coding of new rules for static
analysis
– Machine learning of arbitrary coding
guidelines
35. Measuring code quality
Cycle 1: manual examples
• Problem
– How can we detect violations of coding styles in a dynamic way?
Dynamic == the rules can change without the need for tool
reconfiguration
• Solution at a glance
– Teach the code counter to recognize coding standards (e.g. use the
examples from company’s coding standard tutorials)
– Use machine learning as the tool’s engine to define the formal rules
– Apply the tool on the code base to find violations
• Results
– 95% - 99% accuracy of violation detection on open source projects
Violations
Coding standard
examples
Product
code base
Machine
learning
36. Feature acquisition
36
File type #Characters If … Decision class
java 25 TRUE … Violation
… … … … …
Feature engineering
and extraction engine
Source code: training set
Source code: ML encoded training set
@
37. Example features
• Plain text (F01-F04):
– File extension
– Full and trimmed length (characters)
– Tokens
• Programming language (F05-F19):
– Assignment,
– Brackets,
– Class,
– Comment,
– Semicolons,
– …
37
38. Company 1: Proprietary code (pilot)
• Set-up:
– Code base of ca. 7 MLOC
– One guideline:
• Top diagram:
– The size of the training set
(example) is one of two major
factors determining accuracy.
– The other factor is the algorithm
(not shown in the diagram)
• Bottom diagram:
– The first trials did not find anything
– Trial #5 resulted in finding
all violations
some false-positive (non-violation)
39. Results in the context of evolving code and guidelines
Company 2: preprocessor directive should start at the
beginning of the line
40. Recognizing more rules on larger code base
Company 1 (again): 7 different violations
1,00
0,35
0,98
0,77
0,82
0,91
0,65
1,00
0,97
1,00
0,99
1,00
0,97
0,98
1,00
0,21
0,97
0,63
0,69
0,86
0,49
0,00
0,20
0,40
0,60
0,80
1,00
1,20
Sum of F1-Score
Sum of Recall
Sum of Precision
41. What did we learn?
• Providing the examples is ”boring”
• Training is ”boring”
• Conclusion: faster than human reviewers, but still time consuming
• Solution #2: Gerrit!
– Gerrit is a Google-developed software review tool
42. Measuring code quality
Cycle 2: automated examples
• Problem
– How can we detect violations of coding styles in a dynamic way?
Dynamic == the rules can change over time based on the
team’s programming style
• Solution at a glance
– Teach the code counter to recognize coding standards by
analyzing code reviews
– Use machine learning as the tool’s engine to define the formal
rules
– Apply the tool on the code base to find violations
• Results
– 75% accuracy
Violations
Gerrit reviews
Product
code base
Machine
learning
43. Feature acquisition
44
File type #Characters If … Decision class
java 25 TRUE … Violation
… … … … …
Feature engineering
and extraction engine
Source code: training set
Source code: ML encoded training set
Data set expansion:
Ca. 1,000 LOC -> 180,000 LOC
45. Input
layer
…………………………………….…
Recurrent
layer
…………………………………….… Convolution
layer
………………………….…
Output
layer
Recognize
low level patterns
(e.g. non-standard ”for”)
Recognize
high level patterns
(e.g. non-compiled code)
90% probability of violation
9.9% probability of non-violation
0.1% probability of undecided
Encoded lines
Technical challenges (examples):
• How many layers?
• How many neurons per layer?
• Convolution first vs recurrent first
• Convolution parameters: window, stride, filters
• Recurrent parameters: forget function
46. NN understands
the programming language
• Word embeddings provide the context
• We use Linux kernel as the vocabulary
• The larger the code base, the better the
results from the neural network
– Ca. 20.000 words in the vocabulary
48. Conclusions and take-aways
• Big data is the most important enabler in AI4SE
• AI4SE is closer than we think
• We will still be needed to teach ML/AI