O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Part 2: A Visual Dive into Machine Learning and Deep Learning 


1.218 visualizações

Publicada em


3 Things to Learn About:

*An introduction to machine learning and deep learning
*Common practices and tools
*Introduce a new tool from Cloudera

Publicada em: Software
  • Seja o primeiro a comentar

Part 2: A Visual Dive into Machine Learning and Deep Learning 


  1. 1. 1© Cloudera, Inc. All rights reserved. A Visual Give into Machine Learning and Deep Learning Vartika Singh – Solutions Architect, Cloudera Sean Anderson – Product Marketing, Cloudera
  2. 2. 2© Cloudera, Inc. All rights reserved. Age of Machine Learning 2 Cost of compute Data volume Time Machine Learning NO Machine Learning 1950 s 1960 s 1970 s 1980 s 1990 s 2000 s 2010 s
  3. 3. 3© Cloudera, Inc. All rights reserved. The Enterprise Platform for Machine Learning 3 The data is now here 30B CONNECTED DEVICES 440x MORE DATA Cloudera first to integrate Spark Modern Platform for Machine Learning and Advanced Analytics Leading adoption among enterprises 400Customers Run Spark on
  4. 4. 4© Cloudera, Inc. All rights reserved. Machine Learning and Deep Learning are part of AI
  5. 5. 5© Cloudera, Inc. All rights reserved. Apache Spark Fast and flexible general purpose data processing for Hadoop Data Engineering Stream Processing Data Science & Machine Learning Unified API and processing Engine for large scale data
  6. 6. 6© Cloudera, Inc. All rights reserved. Spark from Cloudera 57% have adopted Cloudera Spark for their most important use case, vs. 26% Hortonworks, 22% an Apache download, and 7% Databricks 48% of respondents said they most commonly use Spark with HBase and 41% of respondents said they use Spark with Kafka **Source: Tejena Group Apache Spark Market Survey 2016 http://tanejagroup.com/profiles-reports/request/apache-spark-market-survey-cloudera-sponsored-research#.WCCdPC0rK70
  7. 7. 7© Cloudera, Inc. All rights reserved. Spark Use Cases Top Use Cases Data Processing (55%), Real-Time Stream Processing (44%), Exploratory Data Science (33%) and Machine Learning (33%). 3 out of 8 are employing Spark in data science research
  8. 8. 8© Cloudera, Inc. All rights reserved. Apache Spark Apache Spark is at the core of our data science experience • Libraries for common machine learning • Trusted in production by our customers • Delivered with expert support and training • A requirement for our Data Science Workbench Apache Spark is a huge driver for machine learning • Native language development tools • Reliable operation at big data scale • Native access to Hadoop data for testing and training Spark 2.1 is here • Separate parcel for easy implementation for multiple Spark instances • Better Streaming Performance • Machine Learning Persistence
  9. 9. 9© Cloudera, Inc. All rights reserved. Solving Data Science is a Full-Stack Problem • Leverage Big Data • Enable real-time use cases • Provide sufficient toolset for the Data Analysts • Provide sufficient toolset for the Data Scientists + Data Engineers • Provide standard data governance capabilities • Provide standard security across the stack • Provide flexible deployment options • Integrate with partner tools • Provide management tools that make it easy for IT to deploy/maintain ✓Hadoop ✓Kafka, Spark Streaming ✓Spark, Hive, Hue ✓Data Science Workbench (beta) ✓Navigator + Partners ✓Kerberos, Sentry, Record Service, KMS/KTS ✓Cloudera Director ✓Rich Ecosystem ✓Cloudera Manager/Director
  10. 10. 10© Cloudera, Inc. All rights reserved. Data Science Workbench (Beta) Self-service data science for the enterprise
  11. 11. 11© Cloudera, Inc. All rights reserved. Open data science in the enterprise IT drive adoption while maintaining compliance Data Scientist explore, experiment, iterate
  12. 12. 12© Cloudera, Inc. All rights reserved. Introducing Cloudera Data Science Workbench Self-service data science for the enterprise Accelerates data science from development to production with: • Secure self-service environments for data scientists to work against Cloudera clusters • Support for Python, R, and Scala, plus project dependency isolation for multiple library versions • Workflow automation, version control, collaboration and sharing
  13. 13. 13© Cloudera, Inc. All rights reserved. The importance of an open ecosystem Open Ecosystem Black Box
  14. 14. 14© Cloudera, Inc. All rights reserved. Insert Vartika Deep Learning Frameworks in CDH and CSDW
  15. 15. 15© Cloudera, Inc. All rights reserved.
  16. 16. 16© Cloudera, Inc. All rights reserved. Machine Learning on Hadoop Raw Data - many sources - many formats - varying validity Validated ML Models End User Data Engineering Data Science Well-formated data Training, validation, and test data cleaning merging filtering model building model training hyper-param tuning pipeline execution production operation Data Engineering Consump- tion for analysis ongoing data ingestion
  17. 17. 17© Cloudera, Inc. All rights reserved. Random Forest Classifier Demo • Github • Run simple Spark code • Show results
  18. 18. 18© Cloudera, Inc. All rights reserved. Big Data and Deep Learning • However, traditional machine learning and feature engineering algorithms are not efficient enough to extract complex and nonlinear patterns, which are hallmarks of the big data. • Deep Learning, on the other hand, solves this central problem via representation learning by introducing representations that are expressed in terms of other, simpler representations. • Many approachable and easy to use Deep Learning Frameworks available. • Compute power
  19. 19. 19© Cloudera, Inc. All rights reserved. Security, Lineage and Governance Ingestion Flume/Sqoop/K afka Analytics Hive/Impala/Sp ark/Search ML spark.mllib Deep Learning Frameworks HDFS Cloudera Manager
  20. 20. 20© Cloudera, Inc. All rights reserved. Where does CSDW fit in Visualizeresults ChangeandCompile Sourcecode Retrainandredeploy ExtensibleEngines Configurable Sessions Trivialtotweak parameters MultipleUsers Roles/Governance CDH
  21. 21. 21© Cloudera, Inc. All rights reserved. Security, Lineage and Governance Ingestion Flume/Sqoop/ Kafka Analytics Hive/Impala/S park/Search ML spark.mllib Deep Learning Frameworks HDFSCDSW with GW roles for HDFS/SPARK and YARN and optionally more. Session A Session B Session N Cloudera Manager
  22. 22. 22© Cloudera, Inc. All rights reserved. Deep Neural Networks (DNNs) require large amounts of computation Inherently parallel in nature Naturally translates to efficient computations to GPUs Accelerated primitive libraries further boost performance. (cuDNNs and MKL) Compute Power. How it helps?
  23. 23. 23© Cloudera, Inc. All rights reserved. Dataset - MNIST
  24. 24. 24© Cloudera, Inc. All rights reserved. TensorflowOnSpark • In sequence, Google releases Tensorflow, enhanced distributed deep learning capabilities in Tensorflow, and then support for HDFS Support • Supports direct Tensor communication between processes. • Scales easily by adding more machines • Tensorflow ingests data using QueueRunners or feed_dict. Does not leverage Spark for data ingestion.
  25. 25. 25© Cloudera, Inc. All rights reserved. TensorflowOnSpark - CSDW • Python Environment • Protobuf • Shaded Libraries • Changes to the code due to Spark Driver and Executor gap • Show environment variables. • Run the four programs one by one.
  26. 26. 26© Cloudera, Inc. All rights reserved. CaffeOnSpark • Caffe is a Deep Learning Framework from Berkley Vision Lab implemented in C++ where models and optimizations are defined as plaintext schemas instead of code. It has a command line as well as a Python interface and has been widely adopted especially for vision related tasks. • Yahoo released a Spark interface for Caffe which gives you the ability to run the DNN model within the same cluster where your ingested data and other analytical frameworks reside, conforming to the company wide security and governance policies.
  27. 27. 27© Cloudera, Inc. All rights reserved. CaffeOnSpark on CDSW • Extensible Engines • Disk mounts • Terminal
  28. 28. 28© Cloudera, Inc. All rights reserved. WHAT IS BIGDL ? Github: github.com/intel-analytics/BigDL http://software.intel.com/ai • Open Source Deep Learning framework for Apache Spark* • Easy Customer and Developer Experience • High Performance & Efficient Scale out leveraging Spark architecture • Feature Parity with Caffe, Torch etc.
  29. 29. 29© Cloudera, Inc. All rights reserved. BigDL on CDSW • Show how easy it is • (Note: Do not compile in environment - do show maven though) • Run!
  30. 30. 30© Cloudera, Inc. All rights reserved. Key Benefits How is Cloudera Data Science different? Works with fully secured clusters One tool for multiple languages (Python, R, Scala) Multi-tenant Architecture Common Platform
  31. 31. 31© Cloudera, Inc. All rights reserved. Don’t Forget Training Apache Spark Developer Training • For: Application Architects, Data Engineers • Cloudera University’s three-day Spark course enables participants to build complete, unified big data applications. Data Science with Spark and Hadoop • For: Data Scientists, Data Engineers • Spark and Hadoop are transforming how data scientists work by allowing interactive and iterative data analysis at scale. Introduction to Machine Learning • For: Data Scientists • Includes coverage of collaborative filtering, clustering, classification, algorithms, and data volume.
  32. 32. 32© Cloudera, Inc. All rights reserved. Thank You This presentation will be available on-demand

×