O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
1© Cloudera, Inc. All rights reserved.
PyData: The Next Generation
Wes McKinney @wesmckinn
Data Day Texas 2015 #ddtx15
2© Cloudera, Inc. All rights reserved.
PyData: Everything’s
awesome…or is it?
Wes McKinney @wesmckinn
Data Day Texas 2015 ...
3© Cloudera, Inc. All rights reserved.
Me
• Data systems, tools, Python guru at Cloudera
• Formerly Founder/CEO of DataPad...
4© Cloudera, Inc. All rights reserved.
What’s this about?
• Hopes and fears for the community and ecosystem
• Why do I car...
5© Cloudera, Inc. All rights reserved.
Python at Cloudera
• Want Cloudera platform users to be successful with Python
• Sp...
6© Cloudera, Inc. All rights reserved.
Historical perspective and background
• 20 years of fast numerical computing in Pyt...
7© Cloudera, Inc. All rights reserved.
How’d this happen?
• Data, data everywhere
• Science! scikit-learn, statsmodels, an...
8© Cloudera, Inc. All rights reserved.
Put a Python (interface) on it!
Something no one got fired for, ever.
9© Cloudera, Inc. All rights reserved.
Meanwhile…
• Hadoop and Big Data go mainstream in 2009 onward
• First Hadoop World:...
10© Cloudera, Inc. All rights reserved.
We’re lucky to have lots of nice things
• What a language!
• IPython: interactive ...
11© Cloudera, Inc. All rights reserved.
“If this isn’t nice, what is?”
—Kurt Vonnegut
12© Cloudera, Inc. All rights reserved.
So, what kind of big data?
• Big multidimensional arrays / linear algebra
• Big ta...
13© Cloudera, Inc. All rights reserved.
What kind of big data problems?
• ETL / Data Wrangling
• Python been used here for...
14© Cloudera, Inc. All rights reserved.
Some ways we are #winning
• Python seen as a viable alternative to SAS/MATLAB/prop...
15© Cloudera, Inc. All rights reserved.
Some ways we are not #winning
• Python still doesn’t have a great “big data story”...
16© Cloudera, Inc. All rights reserved.
Python in big data workflows in practice
HDFS Hadoop-MR
Spark SQL
Big Data, Many m...
17© Cloudera, Inc. All rights reserved.
Big data storage formats
• JSON and CSV are not a good way to warehouse data
• Apa...
18© Cloudera, Inc. All rights reserved.
We’re living in a JVM world
• Scala rapidly taking over big data analytics
• Funct...
19© Cloudera, Inc. All rights reserved.
Dremel, baby, Dremel…
• VLDB 2010: Dremel: Interactive Analysis of Web-Scale Datas...
20© Cloudera, Inc. All rights reserved.
Cloudera Impala
• Open-source interactive SQL for Hadoop
• Analytical query proces...
21© Cloudera, Inc. All rights reserved.
Cloudera Impala (cont’d)
• For high performance big data analytics, Impala could b...
22© Cloudera, Inc. All rights reserved.
Some interesting things in recent
times
23© Cloudera, Inc. All rights reserved.
Set point: Hadley Wickham
• R has upped it’s game with dplyr, tidyr, and other new...
24© Cloudera, Inc. All rights reserved.
Blaze
• Shares some semantics with dplyr
• Uses a generalized datashape protocol
•...
25© Cloudera, Inc. All rights reserved.
libdynd
• Led by Mark Wiebe at Continuum Analytics
• Pure C++11 modern reimagining...
26© Cloudera, Inc. All rights reserved.
PySpark
• Popularity may exceed official Scala API
• Spark was not exactly designe...
27© Cloudera, Inc. All rights reserved.
PySpark: Some more gory details
• Spark master controlled using py4j
• Py4J docs: ...
28© Cloudera, Inc. All rights reserved.
Spartan
• http://github.com/spartan-array/spartan
• Python distributed array expre...
29© Cloudera, Inc. All rights reserved.
Things I think we should do
• Create high fidelity data structures for Dremel-styl...
30© Cloudera, Inc. All rights reserved.
Conclusions
• Python + PyData stack is as strong as ever, and still gaining moment...
31© Cloudera, Inc. All rights reserved.
Thank you
Wes McKinney @wesmckinn
wes@cloudera.com
Próximos SlideShares
Carregando em…5
×

PyData: The Next Generation | Data Day Texas 2015

1.532 visualizações

Publicada em

Speaker: Wes McKinney
Data Day Texas 2015

It's 2015 and the data system landscape is continuing to evolve at a rapid pace. This talk will give an overview of where Python and the "PyData" stack of software stands right now, where it's headed, and where more industry and community energy is needed.

  • Seja o primeiro a comentar

PyData: The Next Generation | Data Day Texas 2015

  1. 1. 1© Cloudera, Inc. All rights reserved. PyData: The Next Generation Wes McKinney @wesmckinn Data Day Texas 2015 #ddtx15
  2. 2. 2© Cloudera, Inc. All rights reserved. PyData: Everything’s awesome…or is it? Wes McKinney @wesmckinn Data Day Texas 2015 #ddtx15
  3. 3. 3© Cloudera, Inc. All rights reserved. Me • Data systems, tools, Python guru at Cloudera • Formerly Founder/CEO of DataPad (visual analytics startup) • Created pandas in 2008, lead developer until 2013 • Python for Data Analysis, published 10/2012 • O’Reilly’s best-selling data book of 2014 • Pythonista since 2007
  4. 4. 4© Cloudera, Inc. All rights reserved. What’s this about? • Hopes and fears for the community and ecosystem • Why do I care? • Python is fun! • Leverage • Accessibility for newbies • Community: smart, nice, humble people
  5. 5. 5© Cloudera, Inc. All rights reserved. Python at Cloudera • Want Cloudera platform users to be successful with Python • Spark/PySpark part of the Enterprise Data Hub / CDH • Actively investing in Python tooling • (p.s. we’re hiring?) • (p.p.s. we have an Austin office now!)
  6. 6. 6© Cloudera, Inc. All rights reserved. Historical perspective and background • 20 years of fast numerical computing in Python (Numeric 1995) • 10 years of NumPy • PyData becomes a thing in 2012 • Python as a data language goes mainstream • Job descriptions tell all • Shift in larger Python community from web towards data • PyCon 2015 committee reported substantial growth in data-related submissions!
  7. 7. 7© Cloudera, Inc. All rights reserved. How’d this happen? • Data, data everywhere • Science! scikit-learn, statsmodels, and friends • Comprehensive data wrangling tools and in-memory analytics/reporting (pandas) • IPython Notebook • Learning resources (books, conferences, blogs, etc.) • Python environment/library management that “just works”
  8. 8. 8© Cloudera, Inc. All rights reserved. Put a Python (interface) on it! Something no one got fired for, ever.
  9. 9. 9© Cloudera, Inc. All rights reserved. Meanwhile… • Hadoop and Big Data go mainstream in 2009 onward • First Hadoop World: Fall 2009 • First Strata conference: Spring 2011 • Lots of smart engineers in fast-growing businesses with massive analytics / ETL problems • Solutions built, frameworks developed, companies founded • Python was generally not a central part of those solutions • A lot of our nice things weren’t much help for data munging and counting at scale (more on this later)
  10. 10. 10© Cloudera, Inc. All rights reserved. We’re lucky to have lots of nice things • What a language! • IPython: interactive computing and collaboration • Libraries to solve nearly any (non-big data) problem • Trustworthy (medium) data wrangling, statistics, machine learning • HPC / GPU / parallel computing frameworks • FFI tools • … and much more
  11. 11. 11© Cloudera, Inc. All rights reserved. “If this isn’t nice, what is?” —Kurt Vonnegut
  12. 12. 12© Cloudera, Inc. All rights reserved. So, what kind of big data? • Big multidimensional arrays / linear algebra • Big tables (structured data) • Big text data (unstructured data) • Empirically I personally am mostly interested in big tables
  13. 13. 13© Cloudera, Inc. All rights reserved. What kind of big data problems? • ETL / Data Wrangling • Python been used here for years with Hadoop Streaming • BI / Analytics (“things you can do in SQL”) • Advanced Analytics / Machine Learning
  14. 14. 14© Cloudera, Inc. All rights reserved. Some ways we are #winning • Python seen as a viable alternative to SAS/MATLAB/proprietary software without nearly as much arguing • Huge uptake in the financial sector • Many current and upcoming generations of data scientists learning Python as a first language • Python in HPC / scientific computing
  15. 15. 15© Cloudera, Inc. All rights reserved. Some ways we are not #winning • Python still doesn’t have a great “big data story” • Little venture capital trickling down to Python projects • Data structures and programming APIs lagging modern realities • Weak support for emerging data formats • Many companies with Python big data successes have not open-sourced their work
  16. 16. 16© Cloudera, Inc. All rights reserved. Python in big data workflows in practice HDFS Hadoop-MR Spark SQL Big Data, Many machines Small/Medium Data, One Machine pandas Viz tools ML / Stats More counting / ETL More insights / reporting DSLs
  17. 17. 17© Cloudera, Inc. All rights reserved. Big data storage formats • JSON and CSV are not a good way to warehouse data • Apache Avro • Compact binary data serialization format • RPC framework • Apache Parquet • Efficient columnar data format optimized for HDFS • Supports nested and repeated fields, compression, encoding schemes • Co-developed by Twitter and Cloudera • Reference impl’s in Impala (C++), and standalone Java/Scala (used in Spark)
  18. 18. 18© Cloudera, Inc. All rights reserved. We’re living in a JVM world • Scala rapidly taking over big data analytics • Functional, concise, good for building high level DSLs • Build nice Scala APIs to clunkier Java frameworks • JVM legitimately good for concurrent, distributed systems • Binary interface with Python a major issue
  19. 19. 19© Cloudera, Inc. All rights reserved. Dremel, baby, Dremel… • VLDB 2010: Dremel: Interactive Analysis of Web-Scale Datasets • Inspiration for Parquet (cf blog “Dremel made easy with Parquet”) • Peta-scale analytics directly on nested data • Google BigQuery said to be a IaaS-ification of Dremel • Supports SQL variant + new user-defined functions with JavaScript + V8 SELECT COUNT(c1 > c2) FROM (SELECT SUM(a.b.c.d) WITHIN RECORD AS c1, SUM(a.b.p.q.r) WITHIN RECORD AS c2 FROM T3)
  20. 20. 20© Cloudera, Inc. All rights reserved. Cloudera Impala • Open-source interactive SQL for Hadoop • Analytical query processor written in C++ with LLVM code generation • Optimized to scan tables (best as Parquet format) in HDFS • SQL front-end and query optimizer / planner • User-defined function API (C++) • impyla enables Python UDFs to be compiled with Numba to LLVM IR
  21. 21. 21© Cloudera, Inc. All rights reserved. Cloudera Impala (cont’d) • For high performance big data analytics, Impala could be Python’s best friend • C++/LLVM backend is lower-level than SQL • Nested data support is coming
  22. 22. 22© Cloudera, Inc. All rights reserved. Some interesting things in recent times
  23. 23. 23© Cloudera, Inc. All rights reserved. Set point: Hadley Wickham • R has upped it’s game with dplyr, tidyr, and other new projects • New standard for a uniform interface to either in-memory or in-database data processing • Composable table primitive operations • Multiple major versions shipped, getting adopted 80dc69b 2012-10-28 | Initial commit of dplyr [hadley] tbl %>% filter(c==‘bar’) %>% group_by(a, b) %>% summarise(metric=mean(d – f)) %>% arrange(desc(metric))
  24. 24. 24© Cloudera, Inc. All rights reserved. Blaze • Shares some semantics with dplyr • Uses a generalized datashape protocol • Fresh start in 2014 under Matthew Rocklin’s (Continuum) direction • Deferred expression API • Support for piping data between storage systems • Multiple backends (pandas, SQL, MongoDB, PySpark, …) • Growing support for out-of-core analytics
  25. 25. 25© Cloudera, Inc. All rights reserved. libdynd • Led by Mark Wiebe at Continuum Analytics • Pure C++11 modern reimagining of NumPy • Python bindings • Supports variadic data cells and nested types (datashape protocol) • Development has focused on the data container design over analytics
  26. 26. 26© Cloudera, Inc. All rights reserved. PySpark • Popularity may exceed official Scala API • Spark was not exactly designed to be an ideal companion to Python • General architecture • Users build Spark deferred expression graphs in Python • User-supplied functions are serialized and broadcast around the cluster • Spark plans job and breaks work into tasks executed by Python worker jobs • Data is managed / shuffled by the Spark Scala master process • Python used largely as a black box to transform input to output
  27. 27. 27© Cloudera, Inc. All rights reserved. PySpark: Some more gory details • Spark master controlled using py4j • Py4J docs: “If performance is critical to your application, accessing Java objects from Python programs might not be the best idea” • Data is marshalled mostly with files with various serialization protocols (pickle + bespoke formats) • Does not natively interface with NumPy (yet) • But, the in-memory benefits of Spark over Hadoop Streaming alternatives massively outweigh the downsides # pass large object by py4j is very slow and need much memory
  28. 28. 28© Cloudera, Inc. All rights reserved. Spartan • http://github.com/spartan-array/spartan • Python distributed array expression evaluator (“distributed NumPy”) • Developed by Russell Power & others at NYU • Uses ZeroMQ and custom RPC implementation
  29. 29. 29© Cloudera, Inc. All rights reserved. Things I think we should do • Create high fidelity data structures for Dremel-style data • Get serious about Avro, Parquet, and other new data format standards • Invest in the Python-Impala-LLVM relationship • Efficient binary protocols to receive and emit data from Python processes
  30. 30. 30© Cloudera, Inc. All rights reserved. Conclusions • Python + PyData stack is as strong as ever, and still gaining momentum • The time for a “dark horse” Python-centric big data solution has probably passed us by. Maybe better to pursue alliances. • Focused work is needed to still be relevant in 2020. Some of our competitive advantages are eroding
  31. 31. 31© Cloudera, Inc. All rights reserved. Thank you Wes McKinney @wesmckinn wes@cloudera.com

×