O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

PyData: The Next Generation

21.595 visualizações

Publicada em

State of the union and questions for Python, Big Data, Analytics, and so forth in 2015 onward

Publicada em: Tecnologia
  • ⇒ www.HelpWriting.net ⇐ This service will write as best as they can. So you do not need to waste the time on rewritings.
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • Have you ever used the help of ⇒ www.WritePaper.info ⇐? They can help you with any type of writing - from personal statement to research paper. Due to this service you'll save your time and get an essay without plagiarism.
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • One word – EXCELLENT! These guys really are pros. I've never seen results like this before. I made £13,870.30 last month and I was in Spain for most of that time. It's incredible how much money you can make from just a few minutes betting each day when you have an expert team at your back. ▲▲▲ https://tinyurl.com/y7tbu6p4
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • Dating direct: ❤❤❤ http://bit.ly/2F7hN3u ❤❤❤
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • Dating for everyone is here: ♥♥♥ http://bit.ly/2F7hN3u ♥♥♥
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui

PyData: The Next Generation

  1. 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   PyData:  The  Next  Genera@on   Wes  McKinney  @wesmckinn   Data  Day  Texas  2015  #ddtx15  
  2. 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   PyData:  Everything’s   awesome…or  is  it?   Wes  McKinney  @wesmckinn   Data  Day  Texas  2015  #ddtx15  
  3. 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   Me   •  Data  systems,  tools,  Python  guru  at  Cloudera   •  Formerly  Founder/CEO  of  DataPad  (visual  analy@cs  startup)   •  Created  pandas  in  2008,  lead  developer  un@l  2013   •  Python  for  Data  Analysis,  published  10/2012   • O’Reilly’s  best-­‐selling  data  book  of  2014   •  Pythonista  since  2007  
  4. 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  this  about?   •  Hopes  and  fears  for  the  community  and  ecosystem   •  Why  do  I  care?   • Python  is  fun!   • Leverage   • Accessibility  for  newbies   • Community:  smart,  nice,  humble  people  
  5. 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Python  at  Cloudera   •  Want  Cloudera  plaaorm  users  to  be  successful  with  Python   •  Spark/PySpark  part  of  the  Enterprise  Data  Hub  /  CDH   •  Ac@vely  inves@ng  in  Python  tooling   • (p.s.  we’re  hiring?)   • (p.p.s.  we  have  an  Aus@n  office  now!)  
  6. 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Historical  perspec@ve  and  background   •  20  years  of  fast  numerical  compu@ng  in  Python  (Numeric  1995)   •  10  years  of  NumPy   •  PyData  becomes  a  thing  in  2012   •  Python  as  a  data  language  goes  mainstream   • Job  descrip@ons  tell  all   • Shig  in  larger  Python  community  from  web  towards  data   •  PyCon  2015  commihee  reported  substan@al  growth  in  data-­‐related   submissions!  
  7. 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   How’d  this  happen?   •  Data,  data  everywhere   •  Science!  scikit-­‐learn,  statsmodels,  and  friends   •  Comprehensive  data  wrangling  tools  and  in-­‐memory  analy@cs/repor@ng  (pandas)   •  IPython  Notebook   •  Learning  resources  (books,  conferences,  blogs,  etc.)   •  Python  environment/library  management  that  “just  works”  
  8. 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Put  a  Python  (interface)  on  it!   Something  no  one  got  fired  for,  ever.    
  9. 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Meanwhile…   •  Hadoop  and  Big  Data  go  mainstream  in  2009  onward   • First  Hadoop  World:  Fall  2009     • First  Strata  conference:  Spring  2011   •  Lots  of  smart  engineers  in  fast-­‐growing  businesses  with  massive  analy@cs  /  ETL   problems   •  Solu@ons  built,  frameworks  developed,  companies  founded   •  Python  was  generally  not  a  central  part  of  those  solu@ons   • A  lot  of  our  nice  things  weren’t  much  help  for  data  munging  and  coun@ng  at   scale  (more  on  this  later)  
  10. 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   We’re  lucky  to  have  lots  of  nice  things   •  What  a  language!   •  IPython:  interac@ve  compu@ng  and  collabora@on   •  Libraries  to  solve  nearly  any  (non-­‐big  data)  problem   •  Trustworthy  (medium)  data  wrangling,  sta@s@cs,  machine  learning   •  HPC  /  GPU  /  parallel  compu@ng  frameworks   •  FFI  tools   •  …  and  much  more    
  11. 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.     “If  this  isn’t  nice,  what  is?”   —Kurt  Vonnegut  
  12. 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   So,  what  kind  of  big  data?   •  Big  mul@dimensional  arrays  /  linear  algebra   •  Big  tables  (structured  data)   •  Big  text  data  (unstructured  data)   •  Empirically  I  personally  am  mostly  interested  in  big  tables  
  13. 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   What  kind  of  big  data  problems?   •  ETL  /  Data  Wrangling   • Python  been  used  here  for  years  with  Hadoop  Streaming   •  BI  /  Analy@cs  (“things  you  can  do  in  SQL”)   •  Advanced  Analy@cs  /  Machine  Learning  
  14. 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   Some  ways  we  are  #winning   •  Python  seen  as  a  viable  alterna@ve  to  SAS/MATLAB/proprietary  sogware  without   nearly  as  much  arguing   •  Huge  uptake  in  the  financial  sector   •  Many  current  and  upcoming  genera@ons  of  data  scien@sts  learning  Python  as  a   first  language   •  Python  in  HPC  /  scien@fic  compu@ng  
  15. 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Some  ways  we  are  not  #winning   •  Python  s@ll  doesn’t  have  a  great  “big  data  story”   •  Lihle  venture  capital  trickling  down  to  Python  projects   •  Data  structures  and  programming  APIs  lagging  modern  reali@es   •  Weak  support  for  emerging  data  formats   •  Many  companies  with  Python  big  data  successes  have  not  open-­‐sourced  their   work  
  16. 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   Python  in  big  data  workflows  in  prac@ce   HDFS   Hadoop-­‐MR   Spark   SQL   Big  Data,  Many  machines   Small/Medium  Data,  One  Machine   pandas   Viz  tools   ML  /  Stats   More  coun@ng  /  ETL   More  insights  /  repor@ng   DSLs  
  17. 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Big  data  storage  formats   •  JSON  and  CSV  are  not  a  good  way  to  warehouse  data   •  Apache  Avro   • Compact  binary  data  serializa@on  format   • RPC  framework   •  Apache  Parquet   • Efficient  columnar  data  format  op@mized  for  HDFS   • Supports  nested  and  repeated  fields,  compression,  encoding  schemes   • Co-­‐developed  by  Twiher  and  Cloudera   • Reference  impl’s  in  Impala  (C++),  and  standalone  Java/Scala  (used  in  Spark)  
  18. 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   We’re  living  in  a  JVM  world   •  Scala  rapidly  taking  over  big  data  analy@cs   • Func@onal,  concise,  good  for  building  high  level  DSLs   • Build  nice  Scala  APIs  to  clunkier  Java  frameworks   •  JVM  legi@mately  good  for  concurrent,  distributed  systems   •  Binary  interface  with  Python  a  major  issue  
  19. 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Dremel,  baby,  Dremel…   •  VLDB  2010:  Dremel:  Interac5ve  Analysis  of  Web-­‐Scale  Datasets   •  Inspira@on  for  Parquet  (cf  blog  “Dremel  made  easy  with  Parquet”)   •  Peta-­‐scale  analy@cs  directly  on  nested  data   •  Google  BigQuery  said  to  be  a  IaaS-­‐ifica@on  of  Dremel   • Supports  SQL  variant  +  new  user-­‐defined  func@ons  with  JavaScript  +  V8   SELECT COUNT(c1 > c2) FROM (SELECT SUM(a.b.c.d) WITHIN RECORD AS c1, SUM(a.b.p.q.r) WITHIN RECORD AS c2 FROM T3)
  20. 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Cloudera  Impala   •  Open-­‐source  interac@ve  SQL  for  Hadoop   •  Analy@cal  query  processor  wrihen  in  C++  with  LLVM  code  genera@on   •  Op@mized  to  scan  tables  (best  as  Parquet  format)  in  HDFS   •  SQL  front-­‐end  and  query  op@mizer  /  planner     •  User-­‐defined  func@on  API  (C++)   • impyla  enables  Python  UDFs  to  be  compiled  with  Numba  to  LLVM  IR  
  21. 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Cloudera  Impala  (cont’d)   •  For  high  performance  big  data  analy@cs,  Impala  could  be  Python’s  best  friend   •  C++/LLVM  backend  is  lower-­‐level  than  SQL   •  Nested  data  support  is  coming  
  22. 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Some  interes@ng  things  in  recent   @mes  
  23. 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   Set  point:  Hadley  Wickham   •  R  has  upped  it’s  game  with  dplyr,  @dyr,  and  other  new  projects   •  New  standard  for  a  uniform  interface  to  either  in-­‐memory  or  in-­‐database  data   processing   •  Composable  table  primi@ve  opera@ons   •  Mul@ple  major  versions  shipped,  gevng  adopted     80dc69b 2012-10-28 | Initial commit of dplyr [hadley] tbl  %>%  filter(c==‘bar’)  %>%  group_by(a,  b)          %>%  summarise(metric=mean(d  –  f))          %>%  arrange(desc(metric))            
  24. 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   Blaze   •  Shares  some  seman@cs  with  dplyr   •  Uses  a  generalized  datashape  protocol   •  Fresh  start  in  2014  under  Mahhew  Rocklin’s  (Con@nuum)  direc@on   • Deferred  expression  API   • Support  for  piping  data  between  storage  systems   • Mul@ple  backends  (pandas,  SQL,  MongoDB,  PySpark,  …)   • Growing  support  for  out-­‐of-­‐core  analy@cs  
  25. 25. 25  ©  Cloudera,  Inc.  All  rights  reserved.   libdynd   •  Led  by  Mark  Wiebe  at  Con@nuum  Analy@cs   •  Pure  C++11  modern  reimagining  of  NumPy   •  Python  bindings   •  Supports  variadic  data  cells  and  nested  types  (datashape  protocol)   •  Development  has  focused  on  the  data  container  design  over  analy@cs  
  26. 26. 26  ©  Cloudera,  Inc.  All  rights  reserved.   PySpark   •  Popularity  may  exceed  official  Scala  API   •  Spark  was  not  exactly  designed  to  be  an  ideal  companion  to  Python   •  General  architecture   • Users  build  Spark  deferred  expression  graphs  in  Python   • User-­‐supplied  func@ons  are  serialized  and  broadcast  around  the  cluster   • Spark  plans  job  and  breaks  work  into  tasks  executed  by  Python  worker  jobs   •  Data  is  managed  /  shuffled  by  the  Spark  Scala  master  process   •  Python  used  largely  as  a  black  box  to  transform  input  to  output  
  27. 27. 27  ©  Cloudera,  Inc.  All  rights  reserved.   PySpark:  Some  more  gory  details   •  Spark  master  controlled  using  py4j     • Py4J  docs:  “If  performance  is  cri@cal  to  your  applica@on,  accessing  Java  objects   from  Python  programs  might  not  be  the  best  idea”   •  Data  is  marshalled  mostly  with  files  with  various  serializa@on  protocols  (pickle  +   bespoke  formats)   •  Does  not  na5vely  interface  with  NumPy  (yet)   •  But,  the  in-­‐memory  benefits  of  Spark  over  Hadoop  Streaming  alterna@ves   massively  outweigh  the  downsides   # pass large object by py4j is very slow and need much memory
  28. 28. 28  ©  Cloudera,  Inc.  All  rights  reserved.   Spartan   •  hhp://github.com/spartan-­‐array/spartan   •  Python  distributed  array  expression  evaluator  (“distributed  NumPy”)   •  Developed  by  Russell  Power  &  others  at  NYU   •  Uses  ZeroMQ  and  custom  RPC  implementa@on  
  29. 29. 29  ©  Cloudera,  Inc.  All  rights  reserved.   Things  I  think  we  should  do   •  Create  high  fidelity  data  structures  for  Dremel-­‐style  data   •  Get  serious  about  Avro,  Parquet,  and  other  new  data  format  standards   •  Invest  in  the  Python-­‐Impala-­‐LLVM  rela@onship   •  Efficient  binary  protocols  to  receive  and  emit  data  from  Python  processes  
  30. 30. 30  ©  Cloudera,  Inc.  All  rights  reserved.   Conclusions   •  Python  +  PyData  stack  is  as  strong  as  ever,  and  s@ll  gaining  momentum   •  The  @me  for  a  “dark  horse”  Python-­‐centric  big  data  solu@on  has  probably  passed   us  by.  Maybe  beher  to  pursue  alliances.   •  Focused  work  is  needed  to  s@ll  be  relevant  in  2020.  Some  of  our  compe@@ve   advantages  are  eroding  
  31. 31. 31  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Wes  McKinney  @wesmckinn   wes@cloudera.com  

×