MSR 2022 Foundational Contribution Award Talk on "Software Analytics: Reflection and Path Forward" by Dongmei Zhang and Tao Xie
https://conf.researchr.org/info/msr-2022/awards
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
MSR 2022 Foundational Contribution Award Talk: Software Analytics: Reflection and Path Forward
1. Software Analytics:
Reflection and Path Forward
Dr. Dongmei Zhang
Data, Knowledge, and Intelligence
(DKI) Group
Microsoft Research Asia
Prof. Tao Xie
School of Computer Science
Peking University
2. Outline
• Origin and early research
• Community building
• New research topics
• Reflections
05/20/2022 MSR 2022 2
5. Software Analytics Research
Utilize data-driven approach to help create high quality, user friendly,
and efficiently developed and operated software and services
05/20/2022 MSR 2022 5
Information Visualization
Analysis Algorithms
Large-scale Computing
Vertical
Horizontal
https://www.microsoft.com/en-us/research/group/software-analytics/
http://research.microsoft.com/en-us/news/features/softwareanalytics-052013.aspx
7. Defining Software Analytics
Software analytics is to enable software practitioners to perform data
exploration and analysis in order to obtain insightful and actionable
information for data-driven tasks around software and services.
05/20/2022 MSR 2022 7
D. Zhang, Y. Dang, J. Lou, S. Han, H. Zhang, and Tao Xie. Software Analytics as a Learning Case in Practice: Approaches and Experiences. In MALETS 2011.
8. Six dimensions
05/20/2022 MSR 2022 8
Research
Topics
Technology
Pillars
Target
Audience
Connection
to Practice
Output
Input
9. Research topics – the trinity view
05/20/2022 MSR 2022 9
• Covering major areas of software domain
• Throughout entire development cycle
• Enabling practitioners to obtain insights
Software
Users
Software
Development
Process
Software
System
10. Input - data sources
05/20/2022 MSR 2022 10
Runtime traces
Program logs
System events
Perf counters
…
Usage log
User surveys
Online forum posts
Blog & Twitter
…
Source code
Bug history
Check-in history
Test cases
…
11. Output – insightful information
• Conveys meaningful and useful understanding or knowledge towards
completing the target task
• Not easily attainable via directly investigating raw data without aid of
analytics technologies
• Examples
• It is easy to count the number of re-opened bugs, but how to find out the
primary reasons for these re-opened bugs?
• When the availability of an online service drops below a threshold, how to
localize the problem?
05/20/2022 MSR 2022 11
12. Output – actionable information
• Enables software practitioners to come up with concrete solutions
towards completing the target task
• Examples
• Why bugs were re-opened?
• A list of bug groups each with the same reason of re-opening
• Why availability of online services dropped?
• A list of problematic areas with associated confidence values
• Which part of my code should be refactored?
• A list of cloned code snippets easily explored from different perspectives
05/20/2022 MSR 2022 12
13. Technology pillars
05/20/2022 MSR 2022 13
Software
Users
Software
Development
Process
Software
System
Information Visualization
Analysis Algorithms
Large-scale Computing
Vertical
Horizontal
Technology pillars
15. Connection to practice
• Software Analytics is naturally tied with software development
practice
• Getting real
05/20/2022 MSR 2022 15
Real
Data
Real
Problems
Real
Users
Real
Tools
16. Early projects
05/20/2022 MSR 2022 16
StackMine – Performance debugging in the large via mining millions of stack traces
Scalable code clone analysis
Data exploration for Customer Experience Improvement Program (CEIP)
17. 05/20/2022 MSR 2022 17
Performance Debugging in the Large via
Mining Millions of Stack Traces
S. Han, Y. Dong, D. Zhang, and T. Xie, ICSE 2012
Comprehending Performance from Real-World
Execution Traces: A Device-Driver Case
X. Yu, S. Han, D. Zhang, and T. Xie, ASPLOS 2014
18. 05/20/2022 MSR 2022 18
Performance Debugging in the Large via
Mining Millions of Stack Traces
S. Han, Y. Dong, D. Zhang, and T. Xie, ICSE 2012
Comprehending Performance from Real-World
Execution Traces: A Device-Driver Case
X. Yu, S. Han, D. Zhang, and T. Xie, ASPLOS 2014
as representative paper in 2012, 1 of 20 representative
papers (one paper a year)
25. Tutorials/Tech Briefings at ICSE/FSE/ASE...
• [ASE 11 Tutorial] Zhang & Xie. xSA: eXtreme Software Analytics -
Marriage of eXtreme Computing and Software Analytics
• [CSEE&T 12 Tutorial] Zhang, Dang, Han & Xie. Teaching and Training
for Software Analytics
• [ICSE 12 SEIP Mini Tutorial] Zhang & Xie. Software Analytics in
Practice: Mini Tutorial
• [ICSE 13 Tutorial] Zhang & Tao Xie. Software Analytics: Achievements
and Challenges
• [FSE 14 Tutorial] Zhang & Tao Xie. Software Analytics: Achievements
and Challenges
05/20/2022 MSR 2022 25
26. Community Building by Others
05/20/2022 MSR 2022 26
IEEE Software
2013 Special Issue
Dagstuhl Seminar
2014
International Workshop on
Software Analytics (SWAN)
2015, 2016, 2017, 2018
...
28. Beyond SE Communities: ASPLOS 2021 Keynote
05/20/2022 MSR 2022 28
ASPLOS is the premier forum for interdisciplinary systems research, intersecting computer architecture, hardware
and emerging technologies, programming languages and compilers, operating systems, and networking.
30. Cloud Services
• Shift to cloud becoming mainstream
• Critical role of cloud computing platforms fortified by COVID-19
05/20/2022 MSR 2022 30
2018 2019 2020 2021 2022
System
Infrastructure
11% 13% 16% 19% 22%
Infrastructure
software
13% 15% 17% 18% 20%
Application
software
34% 36% 38% 39% 40%
Business process
outsourcing
27% 28% 29% 29% 30%
Total 19% 21% 24% 26% 28%
Cloud shift proportion by category
Source: Gartner (August 2018)
2019 2020 2021 2022
BPaaS 45,212 44,741 47,521 50,336
PaaS 37,512 43,823 55,486 68,964
SaaS 102,064 101,480 117,773 138,261
IaaS 44,457 51,421 65,264 82,225
DaaS 616 1,204 1,945 2,542
Total Market 242,696 257,549 304,990 362,263
Worldwide public cloud services end-user spending forecast (Millions of USD)
Source: Gartner (November 2020)
Note: Totals may not add up due to rounding.
31. Focusing on Cloud Computing
• Huge space for improvement for cloud computing platforms
• Software Analytics is the digital transformation of software industry
• Cloud intelligence
• Software Analytics focusing on cloud computing
• Re-emergence of AI
• Making impact is key
05/20/2022 MSR 2022 31
32. Cloud Intelligence
Using AI/ML technologies to effectively and efficiently design, build and
operate complex cloud services at scale
MSR 2022 32
Customers
Engineering
Services
• AI for System
Designing and building high-quality services with better
reliability, performance, and efficiency
• AI for Customers
Improving customer satisfaction with intelligence and
better user experiences
• AI for DevOps
Achieving high productivity in DevOps via empowering
engineers with intelligent tooling
05/20/2022
33. • Cloud Intelligence Workshop
• @ AAAI 2020
• @ ICSE 2021
• @ SysML 2022
• Program Chair
Jian Zhang, Microsoft Azure
• Steering Committee
Rama Akkiraju, IBM
Ricardo Bianchini, Microsoft Research
Mike Dahlin, Google
Marcus Fontoura, Microsoft Azure
Ahmed E. Hassan, Queen’s University
Michael Lyu, Chinese University of Hong Kong
Erik Meijer, Facebook
Tao Xie, Peking University
Dongmei Zhang, Microsoft Research
Yuanyuan Zhou, UCSD
Related Efforts
05/20/2022 MSR 2022 33
• AIOps by Gartner
“Put simply, AIOps is the application of machine learning
(ML) and data science to IT operations problems. AIOps
platforms combine big data and ML functionality to
enhance and partially replace all primary IT operations
functions, including availability and performance
monitoring, event correlation and analysis, and IT service
management and automation.”
• AIOps extended
AIOps: Real-world Challenges and Research Innovations
Yingnong Dang, Qingwei Lin, Peng Huang
Technical Briefing, ICSE 2019
34. Scenarios
05/20/2022 MSR 2022 34
Service health measuring (KPI)
• Availability / reliability
• Performance
• Security
Anomalous behavior detection
• KPI (Overall, component)
• Resource (overhead / leak)
Health prediction
• Infrastructure (e.g., power, cooling)
• HW, SW Failure
• Workload
• System capacity
Auto-recovery/adjustment/healing
• Recovery option optimization
• Auto healing
Programming
• API/code suggestion
• Code defect, smell, code review
• Test coverage, test selection
CI/CD
• Integration testing and strategy
• Rollout risk assessment and strategy
Auto-triage & diagnosis
• Auto-triage (investigation owner)
• Diagnosis intelligence
Repair/mitigation decision
• Solution recommendation
• Decision support
Customer behavior understanding
• Usage experience
• Customer churn
Proactive customer engagement
• Service auto-scale (up/down)
• Engaging before reporting
Intelligent customer support
• Self-serve
• Efficient communication
• Intelligent suggestion/hints
Service Engineering Customer
35. Problems and Challenges
MSR 2022 35
Detection
Diagnosis
Optimization
Prediction
• Time-series anomaly detection
• Log-based anomaly detection
• Multi-dimensional change detection
• …
• Log pattern mining
• Correlation analysis
• Dependency graph diagnosis
• …
• Context/dependency-aware prediction
• Automated feature engineering
• Extremely-imbalanced data prediction
• …
Diverse requirements, noisy
data, high dimensions, lack
of labeled data …
Diverse causes, complex
service dependency,
scattered knowledge…
Huge problem space,
large scale data, complex
constraints and tradeoffs, …
Highly imbalanced class,
fast system evolution,
unpredictable behavior
changes, …
• Multi-constraint/objective optimization
• DL-based combinatorial search
• Optimization under prediction uncertainty
• …
PROBLEMS CHALLENGES
05/20/2022
36. Disk Failure Prediction in Cloud Computing Platform
Improving Service Availability of Cloud Systems by Predicting Disk Error, Y. Xu, K. Sui, R. Yao, H. Zhang, Q. Lin, Y. Dang, P. Li, K. Jiang, W. Zhang, J. Lou, M. Chintalapati, D. Zhang, USNIX ATC 2018.
NTAM: Neighborhood-Temporal Attention Model for Disk Failure Prediction in Cloud Platforms, C. Luo, P. Zhao, B. Qiao, Y. Wu, H. Zhang, W. Wu, W. Lu, Y. Dang, S. Rajmohan, Q. Lin, D. Zhang, the Web Conference
2021.
05/20/2022 MSR 2022 36
37. Virtual Machine (VM) Availability and Disk Failures
• Hardware issues are one of the top reasons of VM going down and VM reboot
• Disk failures contribute most to the hardware issues
05/20/2022 MSR 2022 37
Source: https://www.backblaze.com/blog/hard-drive-stats-for-2018/
Source: https://www.microsoft.com/en-us/research/wp-
content/uploads/2016/08/a7-narayanan.pdf
SSD Annualized Failure Rates
38. Binary Classification Problem
The training set is a collection of 𝑁𝑁 training samples, denoted as
𝐷𝐷 = { 𝑋𝑋1, 𝑦𝑦1 , (𝑋𝑋2, 𝑦𝑦2) … , (𝑋𝑋𝑁𝑁, 𝑦𝑦𝑁𝑁)}
𝑋𝑋𝑖𝑖 represents the corresponding disk 𝑑𝑑𝑖𝑖’s own status data and neighborhood information,
i.e., 𝑋𝑋𝑖𝑖 = 𝐴𝐴𝑖𝑖 ∪ 𝐵𝐵𝑖𝑖, 𝐴𝐴𝑖𝑖 ∈ 𝑅𝑅ℎ×𝑛𝑛 represents 𝑑𝑑𝑖𝑖’s own status data, and 𝐵𝐵𝑖𝑖 is a subset of unions
of all 𝐴𝐴𝑖𝑖.
𝑦𝑦𝑖𝑖 ∈ {0,1} is the label
𝑦𝑦𝑖𝑖 = 1 means that the corresponding disk will fail in near future
𝑦𝑦𝑖𝑖 = 0 means ‘healthy’
Loss function
𝐿𝐿 = −
1
𝑁𝑁
�
𝑖𝑖=1
𝑁𝑁
[𝑦𝑦𝑖𝑖 ⋅ log �
𝑦𝑦𝑖𝑖 + 1 − 𝑦𝑦𝑖𝑖 ⋅ log(1 − �
𝑦𝑦𝑖𝑖)]
05/20/2022 MSR 2022 38
39. Related Work
• Traditional machine learning based approaches
• Support Vector Machine (SVM) [MSST 2013]
• Decision Tree (DT) [DSN 2014]
• Random Forest (RF) [DSN 2018]
• Gradient Boosting Decision Tree (GBDT) [Ph.D. Dissertation, UCLA 2017]
• Regularized Greedy Forest (RGF) [KDD 2016]
• Cloud Disk Error Forecasting (CDEF) [USENIX ATC 2018]
• Deep Learning based approaches
• Recurrent Neural Network (RNN) [IEEE Transactions on Computers 2016]
• Long Short-Term Memory (LSTM) [ICDM 2018]
• Temporal Convolution Neural Network (TCNN) [DAC 2019]
• Convolution Neural Network with Long Short-Term Memory (CNN+LSTM) [FAST 2020]
• Neighborhood-Temporal Attention Model (NTAM) [Web Conference 2021]
05/20/2022 MSR 2022 39
40. Observations (1)
• VMs can be impacted before disks completely fail
• Disk errors occur before disk completely fails
• Disk errors often reflected by system-level signals such as OS events
05/20/2022 MSR 2022 40
Name Description
Timestamp The timestamp 𝑡𝑡 of the feature vector recorded.
Disk ID The unique ID of disk 𝑑𝑑𝑖𝑖 .
Node ID The unique ID of each computing server (i.e. node) 𝑑𝑑𝑖𝑖 is associated with.
SMART Attributes The SMART attributes of 𝑑𝑑𝑖𝑖 recorded at 𝑡𝑡, providing information such as the Current Pending
Sector Count, Seek Error Rate, Soft Read Error Rate, etc.
System-related
attributes
OS events such as paging error, file system error, device reset, telemetry loss, etc.
Driver-related
attributes
Gathered from disk driver with information on Flush Count, IO Latency, Controller Reset, etc.
41. Observation (2)
• A disk’s health status may be impacted by its neighboring disks
• Incorporating individual disk’s status and its neighborhood info
05/20/2022 MSR 2022 41
Figure 2: The architecture of the neighborhood-aware component underlying NTAM.
42. Observation (3)
• Extremely imbalanced disk population
• Data enhancement via Temporal Progressive Sampling (TPS)
05/20/2022 MSR 2022 42
Figure 4: The design of the Temporal Progressive Sampling (TPS) method.
43. Neighborhood-Temporal Attention Model (NTAM)
• Neighborhood-aware component
To effectively incorporate
neighborhood information
• Temporal component
To better capture temporal
information
• Decision component
Decide whether the corresponding
disk will fail in near future or not
05/20/2022 MSR 2022 43
Failure probability
Temporal-encoded vector
Neighbor-encoded vectors
Disk Ai & Neighbors Bi
Figure 1: Overview of Neighborhood-aware Attention Model (NTAM).
44. AI & Software Engineering
05/20/2022 MSR 2022 44
New Research Topic (2)
48. Making IntelliTest More Intelligent
05/20/2022 MSR 2022 48
Pex journey [ASE 2014]
Pex shipped as IntelliTest in
Visual Studio Enterprise Edition
since 2015
Self-learning (data driven)
Thummalapenta, Xie, Tillmann, de Halleux, and Schulte. MSeqGen: Object-
Oriented Unit-Test Generation via Mining Source Code. ESEC/FSE 2009.
50. Programming is not easy, even for easy task
SELECT e1.brand AS brand, e1.Year as year
FROM table e1=(select sum(sale) as salesum, year,
brand, group by year, brand )
LEFT OUTER JOIN table e2=(select sum(sale) as
salesum, year, brand, group by year, brand)
ON (e1.year = e2.year AND e1. salesum >= e2.
salesum)
GROUP BY e1.brand, e1.year
HAVING COUNT(*) <= 2
ORDER BY year;
A Question: Writing a SQL statement for “top 2 selling brands in each year”
given a table of three columns “sales”, “Brand”, and “year”.
51. NL2Regex, NL2SQL, ...
05/20/2022 MSR 2022 51
Zhong, Guo, Yang, Peng, Xie, Lou, Liu and Zhang. SemRegex: A Semantics-Based Approach for Generating Regular Expressions from Natural Language Specifications. EMNLP 2018.
Guo, Liu, Lou, Li, Liu, Xie, and Liu. Benchmarking Meaning Representations in Neural Semantic Parsing. EMNLP 2020.
Dong, Sun, Liu, Lou, and Zhang. Data-Anonymous Encoding for Text-to-SQL Generation. EMNLP 2019.
Conversational Interface for
53. aiXcoder
05/20/2022 MSR 2022 53
After aiXcoder 2.0 became online (currently 4.0)
for 1 month, #download > 130K!
So far 2C: 300K users
2B: major banks/IT companies
https://aixcoder.com/en/
54. aiXcoder L and Next
05/20/2022 MSR 2022 54
Billion-scale model parameters NL2Code
55. New Trend: Big Pre-trained Model + Task Adaptation
GPT-3 can program?
58. AI + Human Intelligence
05/20/2022 MSR 2022 58
59. Making Impact in Practice
• Finding the critical scenario
• Closing the loop
• End-to-end and fast iteration
05/20/2022 MSR 2022 59
Perspective Potential Impact
Problem Applicability
Assumption Problem validity
Constraint
Formulation and solution
Requirement
Evaluation Usefulness in practice
Technology readiness framework
60. Takeaways
• Software Analytics
digital transformation of software industry
• Thriving community
• New research topics
• Cloud Intelligence
• AI and Software Engineering
• Reflections
• Data driven vs. problem driven
• AI + human intelligence
• Making impact in practice
• WE ARE HIRING!
05/20/2022 MSR 2022 60
61. Acknowledgement
Sincere thank-you to all the academic collaborators, colleagues and
partners in Microsoft, and our talented intern students for the
collaboration and partnership over the years!
05/20/2022 MSR 2022 61