Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
ICSE2017 - Analytics Driven Load Testing: An Industrial Experience Report on Load Testing of Large-Scale Systems
1. Analytics-Driven Load Testing: An
Industrial Experience Report on Load
Testing of Large-Scale Systems
Mohamed Nasser, Parminder Flora
Tse-Hsun (Peter) Chen Ahmed E. HassanWeiyi Shang Zhen Ming JiangMark D. Syer
2. Gmail’s 25 to 55 minutes
outage affected 42 million
users.
Azure service was interrupted
for 11hrs, affecting Azure users
world-wide.
Down time of large-scale system is
very costly
Jan 24th Nov 19thOct 28th
Facebook went down for 35
minutes, losing $854,700.
2014
Often caused by load-
related problems
3. Load testing may detect problems
before they occur in the field
Performance
counters
System under test
…
System
execution logs
Controlled test environment
Require a large amount
of resources
Need to analyze
terabytes of dataNeed to design
realistic tests
4. Our industrial load testing and
research experience
Over 10 years of industrial load
testing and research experiences
5. Using data analytics to assist different
phases in load testing
Reducing needed
test execution
resources
Assisting load
test analysis
Designing realistic
load tests
6. Traditional approach: measuring
frequency of events
Field logs
Domain knowledge from
performance analysts
Event Frequency
Login 50/min
Purchase 200/min
Browse 1000/min
…
Test plan
Considering only
aggregated user behaviors
may not be sufficient!
7. Our approach: including user-
centric workloads
00:01, Alice starts a conversation with Bob
00:01, Alice says `hi' to Bob
00:02, Alice says `are you busy?' to Bob
00:11, Bob says `yes' to Alice
00:12, Alice says `ok' to Bob
00:18, Alice ends a conversation with Bob
Alice Bob
USER starts a conversation with USER 1 0
USER says MSG to USER 3 1
USER ends a conversation with USER 1 0
Including user-centric workloads helps detect
more problem during load tests.
8. Using data analytics to assist different
phases in load testing
Reducing needed
test execution
resources
Assisting load
test analysis
Designing realistic
load tests
We include user-
centric behaviors
when designing
load tests.
9. Using data analytics to assist different
phases in load testing
Reducing needed
test execution
resources
Assisting load
test analysis
Designing realistic
load tests
We include user-
centric behaviors
when designing
load tests.
10. Running load tests requires lots of
resources
Running load tests on new
versions of the system
Dedicated resources
used for testing
Running a load test may take several
days or even weeks!
11. Our approach: measuring performance counter
repetitiveness for early test termination
We select a random time
period (e.g. 30 min)
Test Time
Current
time
…
1
We exhaustively search for any
potentially similar time period
2
Repeat 1000 times to estimate
the repetitiveness of the test
3
We can significantly reduce the test time and continue
running the test generates almost no new data .
12. Using data analytics to assist different
phases in load testing
Reducing needed
test execution
resources
Assisting load
test analysis
Designing realistic
load tests
We include user-
centric behaviors
when designing
load tests.
We measure test
repetitiveness to
suggest when to
stop a test early.
13. Using data analytics to assist different
phases in load testing
Reducing needed
test execution
resources
Assisting load
test analysis
Designing realistic
load tests
We include user-
centric behaviors
when designing
load tests.
We measure test
repetitiveness to
suggest when to
stop a test early.
14. Helping performance analysts quickly
understand the result of load tests
Apply automated
anomaly detection to
determine if a test needs
further analysis.
Use load test repositories
and visualization to help
uncover root causes.
I will focus on test
repositories and
visualization in this talk.
15. Storing logs, counters, and application
metrics in a central repository
Load test
repository
Load test result on a new version
Extracted
application
metrics (e.g., login
response time)
Post-test log
analysis
Storing logs,
counters, and
extracted metrics
16. Using counters, application metrics, and
dashboards to help diagnose problems
Load test
repository
Machine Counter Value Value in prior version
WorkerA Average CPU 30% 5%
WorkerB Average Mem. 30% 30%
WorkerB Disk I/O 20MB/s 21/MB/s
…
Performance counters in new version
Average CPU
Time
Prior version
New version
Time
Login response time Prior version
New version
Potential problems
with login
performance
17. Using data analytics to assist different
phases in load testing
Reducing needed
test execution
resources
Assisting load
test analysis
Designing realistic
load tests
We include user-
centric behaviors
when designing
load tests.
We measure test
repetitiveness to
suggest when to
stop a test early.
We store data in
repositories for fast
diagnostic using
dashboards.
18. Road Ahead: Open research
challenges in load testing and analysis
19. Is having no performance anomalies
or slowdown good enough?
System under test
Controlled test environment
There may still be many potential improvements
that can be made to the system.
20. Help improve system performance by
leveraging DevOps
User
How system is
executed in the
field.
Proposed techniques
System
Source
Code
Abstraction
Frameworks
System deployed
in production
Finding performance
hotspots in the code
Optimizing perf.
configuration
…