Invezz.com - Grow your wealth with trading signals
Proposed Talk Outline for Pycon2017
1. NAVIGATING THE PYTHON ECOSYSTEM
FOR DATA SCIENCE
Ananth Krishnamoorthy, Ph.D.
Outline Slides for Talk at PyCon2017
2. Summary
• In their day-to-day jobs, data science teams and data scientists face challenges in
many overlapping yet distinct areas such as Reporting, Data Processing &
Storage, Scientific Computing, ML Modelling, Application Development. To
succeed, Data science teams, especially small ones, need a deep appreciation of
these dependencies on their success.
• Python ecosystem for data science has a number of tools and libraries for various
aspects of data science, including Machine Learning, Cluster Computing,
Scientific Computing, etc.
• The idea of this talk is to understand what the Python data science ecosystem
offers (so that you don't reinvent it), what are some common gaps (so that you
don't go blue looking for answers).
• In this talk, we describe how different tools/libraries fit in the machine learning
model development and deployment workflow . This talk is about how these
different tools work (and don’t work) together with each other. It is intended as a
landscape survey of the python data science ecosystem, along with a mention of
some common gaps that practitioners may notice as they put together a stack
and/or an application for their company.
3. The most important trait of the Analytics 3.0 era is that not only online firms, but virtually any type of firm
in any industry, can participate in the data economy. Banks, industrial manufacturers, health care
providers, retailers—any company in any industry that is willing to exploit the possibilities—can all
develop data-based offerings for customers, as well as support internal decisions with big data.
Analytics 1.0 Analytics 2.0 Analytics 3.0
Data Enterprise Data
Structured transactional data
Bring in web and social data
Complex, large,
semistructured data sources
GPS, Mobile Device, Clickstream,
Sensor data
Unstructured, real time, streaming
Tools Spreadsheets
BI, OLAP
ETL
On-premise servers
Visualization
NoSQL
Hadoop
Machine Learning , Artificial
Intelligence
On-Demand Everything
Analytical Apps
Integrated, Embedded models
Activity Majority of analytical activity
was descriptive analytics, or
reporting
Creating analytical models
was a time-consuming
“batch” process
Visual analytics dominates
predictive and prescriptive
techniques
Develop products, not
PowerPoints or reports
Analytics integral to running the
business, strategic asset
Rapid and agile insight delivery
Analytical tools available at point of
decision
Source: THE RISE OF ANALYTICS 3.0, By Thomas H. Davenport, IIA, 2013
Evolving Role of Data Science Teams
4. Machine Learning vs Real World Data
Science
Machine Learning
Deployment
Application Development
Big Data Processing
Data Storage
ETL
5. Challenges faced by Data Science Teams
• Requires many more competencies than can be reasonably expected
from one person
• Challenges are greater for smaller teams and smaller companies, e.g.
startups
• Challenges create dependencies on other teams e.g. Development
• Dependencies slow down execution and benefits realization
6. Plethora of Choices
Reporting
Data
Processing
& Storage
Scientific
Computing
ML
Modelling
Application
Development
SQL
NoSQL
Graphdb
OLAP
ETL
Cluster
Computing
Stream
Processing
SQL
Charting
Statistics
Cloud
Front End
Microservices
Back End
ML
Deep Learning
Dim. Reduction
Signal
Processing
Optimization
Time Series
Analysis
Simulation
MapReduce
7. Data Science Workflow
ETL Process ModelStore Deploy
DATA SCIENTIST SKILLS
Infrastructure and Provisioning ???
8. Python Ecosystem
ETL Process ModelStore Deploy
Odo Blaze Pandas
Dask
Spark
Sklearn_Pandas
Scikit-learn
Keras
Spark MLlib
Bokeh
Jupyter
9. Review of Key Tools
(50% of talk time spent here, more slides to be added)
• Jupyter
• Pandas
• Scikit-Learn
• Keras / TensorFlow / Theano
• Matplotlib/Bokeh
• Blaze
• Odo
• Dask
• pySpark
We shall see some code snippets here, to
illustrate a few ideas
The idea is to know enough to pick the right
components for the job at hand
10. Use Case 1: Small Data
This use case will illustrate case of Small
Data i.e. Desktop / In-memory processing
11. Use Case 2: ‘Medium’ Data
This use case will illustrate case of Medium
Data with Out-of-core processing
12. Use Case 3: Big Data
This use case will illustrate case of Big Data
i.e cluster computing
13. What Works
• Sklearn’s Consistent API, wide variety of ML algorithms
• Sklearn Pipelines
• Scikit-Keras Integration
• Pandas for Data Analysis
• ….
• ….
14. Gaps – A Data Scientist’s Perspective
• Uniform API Across Activities
• Separation of Data, Processing, and Instructions
• Single Data Structure Paradigm
• Support for in-memory, out-of-core, and distributed computing in same
paradigm e.g. SFrame
• ETL
• Push heavy lifting to backend systems
• Monitoring workflows
• UI development
• Bokeh
• Deployment
• Application
• Web Services