PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Monir Mozumder
1. BIG DATA WORKLOAD ANALYSIS USING SWAT AND
MONIR MOZUMDER
IPYTHON NOTEBOOKS
AMD RESEARCH
THANKS:
JAY OWEN (AMD RESEARCH)
LEONARDO PIGA (AMD RESEARCH)
MAURICIO BRETERNITZ (AMD RESEARCH)
SABARISHYAM SRINIVASARAJU (AMD RESEARCH)
* I also want to acknowledge and thank KEITH LOWERY for
his contributions to the development of the SWAT tool at AMD.
3. SYNTHETIC WORKLOAD ANALYSIS TOOLKIT (SWAT)
OVERVIEW
SWAT
Software platform for automating creation, deployment,
synthetic compute workloads on clusters of arbitrary sizes
execution, and data gathering of
Allows deployment of workloads on Virtual Clusters (Amazon EC2) or physical in-house clusters (Seamicro server,
Bare hardware cluster…)
Supports benchmark workloads from CloudSuite, and some research workloads like GraphLab, HadoopCL etc.
Gathers various system statistics during run and collects them along with workload logs in a batch folder which are
stored in the UI box for later analysis/visualization
3 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
4. SYNTHETIC WORKLOAD ANALYSIS TOOLKIT (SWAT)
COMPONENTS
4 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
5. SYNTHETIC WORKLOAD ANALYSIS TOOLKIT (SWAT)
Front End
Houses the Control Box of the SWAT tool
Separate from the actual cluster that runs the workloads
Does not need the benchmark workloads to be installed
Manages the cluster, if needed boots them with configuration options chosen by User
Stores logs generated for the runs for later analysis
Cluster nodes
Runs the actual workloads as directed by Front End
Needs workloads installed on each node
Can reside on cloud service provider’s data center (Amazon) or in a local internal cluster
5 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
6. SYNTHETIC WORKLOAD ANALYSIS TOOLKIT (SWAT)
FRONT END (UI)
Major steps in running a workload:
Select nodes (instances)
Select Workload Container
Hadoop flavor
Other container (memcached)
Select Actual Workload:
Basic hadoop jobs
Cloudsuite benchmarks
Data Analytics (Mahout)
Data Serving (Cassandra)
Media Streaming (darwin)
Software Testing (Cloud9)
Web Search (nutch)
Web Serving (Oilo, Faban)
GraphLab
McBlaster (memcached)
Start selected job in batch mode or standalone mode
6 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
7. SWAT WORKFLOW STEPS
1. Cluster selection:
7 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
8. SWAT WORKFLOW STEPS
2. Workload Container Selection:
8 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
9. SWAT WORKFLOW STEPS
3. Workload selection:
9 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
10. SWAT WORKFLOW STEPS
4. Job initiation, progress and termination:
10 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
11. IPYTHON BASICS
What is Ipython?
‒ Ipython stands for Interactive Python
‒
‒
‒
‒
A better python interpreter
Interactive IDE for python (QTConsole…)
Better web based front end for interactive analysis of scientific data (IPython notebooks)
Also has parallel execution engine for running workloads on a cluster (not as full-featured as SWAT..)
Installing
‒ Install as all-in-one package in Windows®: Enthought Canopy, Anaconda, ActivePython, pythonxy
‒ In Linux® , we need to install the components separately (assuming setuptools already installed):
‒easy_install ipython[zmq,qtconsole,notebook]
‒easy_install pandas
‒easy_install matplotlib
11 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
12. IPYTHON
INTERFACES
Starting ipython
ipython
opens up default shell like console
ipython qtconsole --pylab=inline
opens up graphical qt-based console
ipython notebook --pylab=inline
instantiates server and a browser window pointing to the server instance
Server dashboard points to notebooks existing in machine, and ways to create new ones
Users can connect to this server instance remotely and collaborate by working on the same notebook
We use this feature for our SWAT log analysis and bottleneck finding experiments
12 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
13. IPYTHON
NOTEBOOK DASHBOARD
Note:
All three modes offer similar functionality.
We use the notebook interface due to easy
13 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
sharing for collaborating remotely with multiple users
14. IPYTHON NOTEBOOK EXAMPLE SESSION
Any python expression can be run at the console prompt:
In [1]: a=range(1,10)
In [2]: a
Out[2]: [1, 2, 3, 4, 5, 6, 7, 8, 9]
Also run basic shell commands:
In [3]: pwd
Out [3]: u'c:monirapu13sumatra'
In [4]: cd ..
c:monir
Even capture the output of any shell command or script into python variable
14 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
15. A COMMON PATTERN IN OUR DATA ANALYSIS
Read Data from sources:
• CSV, JSON, Excel, Raw log file, DB connection….
Munge data into data structure suitable for plotting
• Get it into a DataFrame (pandas library)
Do final plotting
• df.plot()
Do any sub-range plotting for further details
15 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
16. TYPICAL DATA ANALYSIS SESSION
import
pandas as pd
import
matplotlib.pyplot as
plt
……….
swat_archive = '/var/www/html/repo‘
……….
cd $swat_archive/Job_TerasortExperiment_Feb5/job_00_00
……….
df1 = pd.read_csv('vmstat.csv', parse_dates=[['date','time']], usecols=['date', 'time', '%idle'])
df1 = df1.set_index('date_time')
df1.plot()
16 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
17. TYPICAL DATA ANALYSIS SESSION - 2
17 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
18. REUSE INTERACTIONS AS GENERIC FUNCTION
Once you are satisfied with your interactions, put them in a script
Script has to have ipython as #! interpreter
def
#!/usr/bin/ipython
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import pandas as pd
# ……….
task_list = [
"CPU_Timeline_User_plus_Sys",
"Network_Timeline_Tx_plus_Rxw,
"DISK_Timeline_Read_Write_MBps",
"Compare_metrics_bottleneck",
"Compare_metrics_bottleneck_smoothed",
"Webserver_metrics",
# ………., about 20 total tasks
]
18 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
def
def
bottleneck( job, mode='cpu', img_name=None, nodes='All',
alt_repo=None, nb=True, cat=None, arg=17, ifc='eth0',
smooth=None):
# create a custom graph and put it under images folder
…
timeline ( args……..)
# similar function
….
main()
# parse command line args and call appropriate graphing function
19. SWAT POST RUN
Swat post job completion script now calls our graphing script
‒ graph_command_line.ipy current_batch_num current_job_num
‒ Graphing script can also be run manually to create more fine tuned graphs
‒ Same can be done from Ipython notebook:
19 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
20. DATA IS BEAUTIFUL
CPU UTILIZATION FROM ONE RUN
20 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
21. DATA IS BEAUTIFUL - 2
CPU UTILIZATION DATA FROM TWO RUNS
21 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
22. DATA IS BEAUTIFUL - 3
NETWORK UTILIZATION
22 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
23. DATA IS BEAUTIFUL - 4
PERFORMANCE COUNTER DATA
23 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
24. DATA IS BEAUTIFUL ?
PER-CORE CPU UTILIZATION
24 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
25. LIMITATIONS
Images are static: no zooming
Zooming functionality can be done by interactively asking to plot certain sub-range of the series
Example sub-range plot:
25 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
26. HOW IS THIS HELPING US?
26 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
27. BOTTLENECK ANALYSIS
Workloads utilize the system resources
Certain workloads are CPU bound, while certain others are
Disk/Network/Memory bound
Ipython helps explore system logs and create nice graphs showing resource
utilization
‒Resource that is near 100% utilization is the bottleneck
‒If system is not at bottleneck, increase load by configuring workload and repeat
‒If at bottleneck, try to add more resources and repeat
‒Example scenarios in following slides
Enables us to optimize/characterize systems for certain workloads
27 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
30. WORKLOAD TUNING
(WORK IN PROGRESS)
1.
2.
3.
4.
Analyze workload logs of past SWAT runs from IPython notebook
Check resource utilization/performance of a run
Analyze associated workload configuration and correlate
Create new workload configuration based on observation and insights on
tuning
5. Push new config to SWAT template library
6. Initiate new run from SWAT using the new config
7. Repeat as needed to find optimal configuration
30 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
31. CONCLUSION
Ipython notebook is a great tool for interactive data analysis
For short exploratory sessions only, not to code huge code base
Once done exploring - put into scripts for re-use
Has its limitations: there are alternatives
Tableau®: interactive a la carte plots
needs licensing
no custom graphing, but menu has lots of choices
d3.js: needs coding
OpenTSDB: time series charts updated dynamically
etc..
31 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL
32. REFERENCES
Ipython tutorials
Pycon 2013: “IPython in-depth: high-productivity interactive and parallel python”
http://www.youtube.com/watch?v=bP8ydKBCZiY
SciPy2013: “IPython in Depth”:
http://www.youtube.com/watch?v=xe_ATRmw0KM
Ipython website: http://ipython.org/
Check the videos link for more tutorials
Support: Stackoverflow tag: ipython, ipython-notebook
32 | PRESENTATION TITLE | NOVEMBER 14, 2013 | CONFIDENTIAL