O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Carregando em…3
1 de 49

High Performance Hadoop with Python - Webinar



Baixar para ler offline

Scale Up & Scale Out with Anaconda

Python is the fastest growing Open Data Science language & is used more than 50% of the time to extract value from Big Data in Spark.

However, both PySpark & SparkR involve JVM overhead and Python/Java serialization when interacting with Spark which negatively impacts the time-to-value from your Big Data. What if there was a way to leverage the entire Python ecosystem without refactoring your Hadoop-based data science investments & get high performance?

Anaconda, the leading Open Data Science Platform, delivers high performance Python for Hadoop. You get to leverage your existing Python-based data science investments with your existing Hadoop or HPC clusters. Anaconda bypasses the typical Hadoop performance issues, leverages existing high performance scientific and array-based computing in Python and now leverages Dask, the powerful parallel execution framework, to deliver fast results on any enterprise Hadoop distribution such as Cloudera & Hortonworks.

On April 13th, Dr. Kristopher Overholt & Dr. Matthew Rocklin of Continuum Analytics present a webinar on High Performance Hadoop with Python.

In this webinar, you'll learn to:
-Analyze NYC taxi data through distributed DataFrames on a cluster on HDFS
-Create interactive distributed visualizations of global temperature data
-Distribute in-memory natural language processing & interactive queries on text data in HDFS
-Wrap and parallelize existing legacy code on custom file formats

High Performance Hadoop with Python - Webinar

  1. 1. High Performance Hadoop with Python
  2. 2. Presenter Bio Kristopher Overholt received his Ph.D. in Civil Engineering from The University of Texas at Austin. Prior to joining Continuum, he worked at the
 National Institute of Standards and Technology (NIST),
 Southwest Research Institute (SwRI), and
 The University of Texas at Austin. Kristopher has 10+ years of experience in areas including applied research, scientific and parallel computing, system administration, open-source software development, and computational modeling. 2 Kristopher Overholt
 Solution Architect Continuum Analytics
  3. 3. Presenter Bio Matthew Rocklin received his Ph.D. in computer science from the University of Chicago and is currently employed at Continuum Analytics as a computational scientist. He is an active contributor to many open source projects in the PyData ecosystem and is the lead developer of Dask. 3 Matthew Rocklin
 Computational Scientist
 Continuum Analytics
  4. 4. Overview 4 • Overview of Continuum and Anaconda • Overview of Dask (Distributed Processing Framework) • Example parallel workflows with Anaconda and Dask • Distributed dataframes on a cluster with CSV data • Distributed natural language processing with text data • Analyzing array-based global temperature data • Parallelizing custom code and workflows • Using Anaconda with Dask • Solutions with Anaconda and Dask
  5. 5. Overview of Continuum and Anaconda
  6. 6. The Platform to Accelerate, Connect & Empower Continuum Analytics is the company behind Anaconda and offers: 6 is…. the leading open data science platform powered by Python the fastest growing open data science language • Consulting • Training • Open-Source Software • Enterprise Software
  7. 7. Bokeh Founders – Travis Oliphant, creator of NumPy and SciPy   – Peter Wang, creator of Chaco & Bokeh visualization libraries Engineers – Antoine Pitrou, Python core developer – Jeff Reback, Pandas maintainer and core developer – Carlos Cardoba, Spyder maintainer and core developer – Damian Avilla, Chris Colbert, Jupyter core team member – Michael Droettboom, Matplotlib maintainer and core developer 7 Deep Domain & Python Knowledge
  8. 8. Financial  Services   –  Risk  Mgmt.,  Quant  modeling,  Data   exploration  and  processing,  algorithmic   trading,  compliance  reporting   Government   –  Fraud  detection,  data  crawling,  web  &   cyber  data  analytics,  statistical  modeling   Healthcare  &  Life  Sciences   –  Genomics  data  processing,  cancer   research,  natural  language  processing  for   health  data  science   High  Tech   –  Customer  behavior,  recommendations,  ad   bidding,  retargeting,  social  media  analytics   Retail  &  CPG   –  Engineering  simulation,  supply  chain   modeling,  scientific  analysis   Oil  &  Gas   –  Pipeline  monitoring,  noise  logging,  seismic   data  processing,  geophysics   8 Trusted by Industry Leaders
  9. 9. Leading Open Data Science Platform powered by Python Quickly Engage w/ Your Data 9 • 720+ Popular Packages   • Optimized & Compiled   • Free for Everyone   • Extensible via conda Package Manager   • Sandbox Packages & Libraries   • Cross-Platform - Windows, Linux, Mac   • Not just Python - over 230 R packages   • Foundation of our Enterprise Products Anaconda
  10. 10. 10 Anaconda Accelerating Adoption of Python for Enterprises COLLABORATIVE NOTEBOOKS with publication,authentication,& search Jupyter/ IPython PYTHON & PACKAGE MANAGEMENT for Hadoop & Apache stack Spark PERFORMANCE with compiled Python for lightning fast execution Numba VISUAL APPS for interactivity, streaming,& Big Bokeh SECURE & ROBUST REPOSITORY of data science libraries,scripts, & notebooks Conda ENTERPRISE DATA INTEGRATION with optimized connectors & out-of-core processing NumPy & Pandas DaskPARALLEL COMPUTING scaling up Python analytics on your cluster for interactivity and streaming data
  11. 11. YARN JVM Bottom Line
 10-100X faster performance • Interact with data in HDFS and Amazon S3 natively from Python • Distributed computations without the JVM & Python/Java serialization • Framework for easy, flexible parallelism using directed acyclic graphs (DAGs) • Interactive, distributed computing with in-memory persistence/caching Bottom Line • Leverage Python & R with Spark Batch Processing Interactive Processing HDFS Ibis Impala PySpark & SparkR Python & R ecosystem MPI High Performance, Interactive, Batch Processing Native read & write NumPy, Pandas, … 720+ packages 11
  12. 12. Overview of Dask as a Distributed Processing Framework
  13. 13. Overview of Dask 13 Dask is a Python parallel computing library that is: • Familiar: Implements parallel NumPy and Pandas objects • Fast: Optimized for demanding for numerical applications • Flexible: for sophisticated and messy algorithms • Scales up: Runs resiliently on clusters of 100s of machines • Scales down: Pragmatic in a single process on a laptop • Interactive: Responsive and fast for interactive data science Dask complements the rest of Anaconda. It was developed with
 NumPy, Pandas, and scikit-learn developers.
  14. 14. Spectrum of Parallelization 14 Threads Processes MPI ZeroMQ Dask Hadoop Spark SQL: Hive Pig Impala Implicit control: Restrictive but easyExplicit control: Fast but hard
  15. 15. Dask: From User Interaction to Execution 15
  16. 16. Dask Collections: Familiar Expressions and API 16 x.T - x.mean(axis=0) df.groupby(df.index).value.mean() def load(filename): def clean(data): def analyze(result): Dask array (mimics NumPy) Dask dataframe (mimics Pandas) Dask imperative (wraps custom code) b.map(json.loads).foldby(...) Dask bag (collection of data)
  17. 17. Dask Graphs: Example Machine Learning Pipeline 17
  18. 18. Dask Graphs: Example Machine Learning Pipeline + Grid Search 18
  19. 19. Scheduler Worker Worker Worker Worker Client Same network User Machine (laptop)Client Worker Dask Schedulers: Example - Distributed Scheduler 19
  20. 20. Example Parallel Workflows with Anaconda and Dask
  21. 21. Examples 21 Analyzing NYC Taxi CSV data using distributed Dask DataFrames • Demonstrate Pandas at scale • Observe responsive user interface Distributed language processing with text data using Dask Bags • Explore data using a distributed memory cluster • Interactively query data using libraries from Anaconda Analyzing global temperature data using Dask Arrays • Visualize complex algorithms • Learn about dask collections and tasks Handle custom code and workflows using Dask Imperative • Deal with messy situations • Learn about scheduling 1 2 3 4
  22. 22. Example 1: Using Dask DataFrames on a cluster with CSV data 22 • Built from Pandas DataFrames • Match Pandas interface • Access data from HDFS, S3, local, etc. • Fast, low latency • Responsive user interface January, 2016 Febrary, 2016 March, 2016 April, 2016 May, 2016 Pandas DataFrame} Dask DataFrame }
  23. 23. Example 2: Using Dask Bags on a cluster with text data 23 • Distributed natural language processing with text data stored in HDFS • Handles standard computations • Looks like other parallel frameworks (Spark, Hive, etc.) • Access data from HDFS, S3, local, etc. • Handles the common case ... (...) data ... (...) data function ... ... (...) data function ... result merge ... ... data function (...) ... function
  24. 24. NumPy Array } }Dask Array Example 3: Using Dask Arrays with global temperature data 24 • Built from NumPy
 n-dimensional arrays • Matches NumPy interface (subset) • Solve medium-large problems • Complex algorithms
  25. 25. Example 4: Using Dask Delayed to handle custom workflows 25 • Manually handle functions to support messy situations • Life saver when collections aren't flexible enough • Combine futures with collections for best of both worlds • Scheduler provides resilient and elastic execution
  26. 26. Precursors to Parallelism 26 • Consider the following approaches first: 1. Use better algorithms 2. Try Numba or C/Cython 3. Store data in efficient formats 4. Subsample your data • If you have to parallelize: 1. Start with your laptop (4 cores, 16 GB RAM, 1 TB disk) 2. Then a large workstation (24 cores, 1 TB RAM) 3. Finally, scale out to a cluster
  27. 27. Using Anaconda with Dask
  28. 28. Cluster Architecture Diagram 28 Client Machine Compute Node Compute Node Compute Node Head Node
  29. 29. • Single machine with multiple threads or processes • On a cluster with SSH (dcluster) • Resource management: YARN (knit), SGE, Slurm • On the cloud with Amazon EC2 (dec2) • On a cluster with Anaconda for cluster management • Manage multiple conda environments and packages 
 on bare-metal or cloud-based clusters Using Anaconda and Dask on your Cluster 29
  30. 30. • Dynamically manage Python and conda environments across a cluster • Works with enterprise Hadoop distributions and HPC clusters • Integrates with on-premises Anaconda repository • Cluster management features are available
 with Anaconda subscriptions Anaconda for Cluster Management 30
  31. 31. Cluster Deployment & Operations Before Anaconda for cluster management Head Node 1. Manually install Python, packages & dependencies 2. Manually install R, packages & dependencies After Anaconda for cluster management Compute Nodes 1. Manually install Python, packages & dependencies 2. Manually install R, packages & dependencies Bottom Line • Empower IT with scalable and supported Anaconda deployments • Fast, secure and scalable Python & R package management on tens or thousands of nodes • Backed by an enterprise configuration management system • Scalable Anaconda deployments tested in enterprise Hadoop and HPC environments Compute Nodes Head Node Easily install conda environments and packages (including Python and R) across cluster nodes 31
  32. 32. Admin Edge Node Compute Nodes Using Dask and Anaconda Enterprise on your Cluster 32 Analyst Machine Anaconda Repository Hadoop or HPC Cluster Dask Anaconda
  33. 33. Analyst Machine Anaconda Repository Hadoop or HPC Cluster Dask Anaconda Analyst ships packages and environments to on-premises repository1. Using Dask and Anaconda Enterprise on your Cluster 33
  34. 34. Admin deploys conda packages and environments to cluster nodes2. Using Dask and Anaconda Enterprise on your Cluster 34 Admin Head Node Compute Nodes Anaconda Repository Hadoop or HPC Cluster Dask Anaconda
  35. 35. Analyst submits distributed jobs and utilizes Anaconda on the cluster3. Using Dask and Anaconda Enterprise on your Cluster 35 Analyst Machine Head Node Compute Nodes Hadoop or HPC Cluster Dask Anaconda
  36. 36. Service Architecture Diagram 36 HMS CM S HS NN RM JN ID SG SNN NM DN G WHCS HS2 ACH Anaconda Cluster Head ACC AR CM G AR ACH Head Node DS JN YG YG G Secondary Head Node ICS ISS S YG Edge Node HFS HFS G H HS2 HMS WHCS Edge Node H SG Anaconda Repository Jupyter Notebook Hadoop Manager Zookeeper Server Impala Daemon History Sever (Spark) Spark Gateway Resource Manage (YARN) Other Services Hue NameNode (HDFS) Secondary NameNode DataNode HttpFS Hive Metastore Gateway WebHCat Server HiveServer2 Yarn GateWay NodeManager Anaconda Cluster Compute ACCACC Compute Nodes DN ID SG ACC DS Dask Scheduler DW Dask Worker DW
  37. 37. acluster conda install numpy scipy pandas numba acluster conda create -n py34 python=3.4 numpy scipy pandas acluster conda list acluster conda info acluster conda push environment.yml Remote Conda Commands 37 Install packages List packages Create environment Conda information Push environment
  38. 38. Cluster Management Commands 38 Create cluster Install plugins List active clusters SSH to nodes Put/get files Run command acluster create dask-cluster -p dask-profile acluster list acluster install notebook distributed acluster ssh acluster put data.hdf5 /home/ubuntu/data.hdf5 acluster 'cmd apt-get install ...'
  39. 39. Solutions with Anaconda and Dask
  40. 40. • Open source foundational components • Dask, Distributed scheduler, HDFS reading/writing,
 YARN interoperability, S3 integration, EC2 provisioning • Enterprise products / subscriptions • Anaconda Workgroup and Anaconda Enterprise • Package management on Hadoop and HPC clusters • Integration with on-premises repository • Provisioning and managing Dask workers on a cluster Working with Anaconda and Dask 40
  41. 41. Application Analytics Data and
 Resource Management Server Jupyter/IPython Notebook pandas, NumPy, SciPy, Numba, NLTK, scikit-learn, scikit-image,
 and more from Anaconda … HDFS, YARN, SGE, Slurm or other distributed systems Bare-metal or Cloud-based Cluster Anaconda Parallel Computation Dask Spark Hive / Impala Cluster 41
  42. 42. DISTRIBUTED 42 Automatic & flexible visualization of billions of points in real-time Interactive querying, exploration and browser visualization Distributed + Remote Query & Computation Interactive Big Data Visualization in Browser DATA SHADINGINSIDE HADOOP High Performance with Anaconda, including Dask Use all available cores/GPUs for distributed & threaded analysis Distributed High- Performance Analytics Recent Work using Dask
  43. 43. Use Cases with Anaconda and Dask 43 • Runs on a single machine or 100s of nodes • Works on cloud-based or bare-metal clusters • Works with enterprise Hadoop distributions and HPC environments • Develop workflows with text processing, statistics, machine learning, image processing, etc. • Works with data in various formats and storage solutions
  44. 44. Solutions with Anaconda and Dask 44 • Architecture consulting and review • Manage Python packages and environments on a cluster • Develop custom package management solutions on existing clusters • Migrate and parallelize existing code with Python and Dask • Architect parallel workflows and data pipelines with Dask • Build proof of concepts and interactive applications with Dask • Custom product/OSS core development • Training on parallel development with Dask
  45. 45. Anaconda Subscriptions 45
  46. 46. Additional Resources
  47. 47. $ conda install anaconda-client $ anaconda login $ conda install anaconda-cluster -c anaconda-cluster $ acluster create cluster-dask -p cluster-dask $ acluster install distributed Test-Drive Anaconda and Dask on your Cluster 1. Register for an Anaconda Cloud account at Anaconda.org 2. Download Anaconda for cluster management using Conda 3. Create a sandbox/demo cluster 4. Install Dask and the distributed scheduler 47
  48. 48. Contact Information and Additional Details • Contact sales@continuum.io for information about Anaconda subscriptions, consulting, or training and support@continuum.io for product support • More information about Anaconda Subscriptions
 continuum.io/anaconda-subscriptions • View Dask documentation and additional examples at dask.pydata.org 48
  49. 49. Kristopher Overholt Twitter: @koverholt Matthew Rocklin Twitter: @mrocklin Thank you 49 Email: sales@continuum.io Twitter: @ContinuumIO