O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Analytics in the cloud

581 visualizações

Publicada em

Today, data is everywhere. As more data streams into cloud-based systems, the combination of data and computing resources gives us today the unprecedented opportunity to perform very sophisticated data analysis and to explore advanced machine learning methods such as deep learning.

Clouds pack very large amount of computing and storage resources, which can be dynamically allocated to create powerful analytical environments. By accessing those analytics clusters of machines, data analysts and data scientists can quickly evaluate more hypotheses and scenarios in parallel and cost-effectively.

The number of analytical tools which is supported on various clouds is increasing by the day. The list of analytical tools spans from traditional rdms databases as provided by vendors to analytics open sources projects such as Hadoop Hive, Spark, H2O. Next to provisioning tools and solutions on the cloud, managed services for Data Science, Big Data and Analytics are becoming a popular offering of many clouds.

Analytics in the cloud provides whole new ways for data analysts, data scientists and business developer to interact with each other, share data and experiments and develop relevant insight towards improved business processes and results. In this talk, I will describe a number of data analytics solutions for the cloud and how they can be added to your current cloud and on-premise landscape.

Publicada em: Dados e análise
  • Seja o primeiro a comentar

Analytics in the cloud

  1. 1. Analytics in the Cloud Natalino Busa - Head of Data Science
  2. 2. 2 Natalino Busa - @natbusa Distributed computing Machine Learning Statistics Big/Fast Data Streaming Computing Head of Applied Data Science at Teradata On most networks: @natbusa
  3. 3. 3 Natalino Busa - @natbusa Let’s define Cloud Services
  4. 4. 4 Natalino Busa - @natbusa Analytics in the cloud: stacking layers Bare Metal: Physical Machines
  5. 5. 5 Natalino Busa - @natbusa Analytics in the cloud: stacking layers Bare Metal: Physical Machines IAAS: Virtual Resources
  6. 6. 6 Natalino Busa - @natbusa Analytics in the cloud: stacking layers Bare Metal: Physical Machines IAAS: Virtual Resources CAAS: Containers,
  7. 7. 7 Natalino Busa - @natbusa Analytics in the cloud: stacking layers Bare Metal: Physical Machines IAAS: Virtual Resources CAAS: Containers, dPAAS: Datastores, Data Engines iPAAS: Tools Integration, Flows & Processes
  8. 8. 8 Natalino Busa - @natbusa Bare Metal: Physical Machines IAAS: Virtual Resources CAAS: Containers, dPAAS: Datastores, Data Engines iPAAS: Tools Integration, Flows & Processes DAAAS: Data Analytics as a Service Watson Services Azure ML Google Cloud MLBigML Analytics in the cloud: stacking layers
  9. 9. 9 Natalino Busa - @natbusa Analytics in the cloud: today’s talk Bare Metal: Physical Machines IAAS: Virtual Resources CAAS: Containers, dPAAS: Datastores, Data Engines iPAAS: Tools Integration, Flows & Processes DAAAS: Data Analytics as a Service
  10. 10. 10 Natalino Busa - @natbusa “we live in an age of open source datacenters, so we can stack all these things together and we have open source from the ground to ceiling.” Sam Ramji, CEO of Cloud Foundry https://www.youtube.com/watch?v=7oCSFcUW-Qk
  11. 11. 11 Natalino Busa - @natbusa Containers vs VMs
  12. 12. 12 Natalino Busa - @natbusa Techs based on Containers YARN
  13. 13. 13 Natalino Busa - @natbusa Containers as a Service https://aws.amazon.com/ecs/ For example: Amazon ECS
  14. 14. 14 Natalino Busa - @natbusa CaaS: 6 offerings https://www.linux.com/news/5-container-service-tools-you-should-know-about Project Magnum Amazon ECS Docker DataCenterGoogle Container Engine
  15. 15. 15 Natalino Busa - @natbusa Most new PaaS solutions are containerized
  16. 16. 16 Natalino Busa - @natbusa PaaS: Big Data SQL Queries Batch Oriented Large Aggregations Interactive Queries Data Exploration Interactive Queries Machine Learning Streaming: Micro-batching Interactive Queries Machine Learning Streaming: Event-driven
  17. 17. 17 Natalino Busa - @natbusa Advanced Analytics: models and algorithms
  18. 18. 18 Natalino Busa - @natbusa PaaS: Advanced Analytics Graph analytics: - Cluster items - Extract similarities - Detect patterns
  19. 19. 19 Natalino Busa - @natbusa PaaS: Advanced Analytics Text analytics: - Sentiment Analysis - Language Detection - Summarization - Entity extraction
  20. 20. 20 Natalino Busa - @natbusa PaaS: Advanced Analytics Machine Learning: - Classification - Regression - Clustering - Forecasting - Anomaly detection
  21. 21. 21 Natalino Busa - @natbusa PaaS: Advanced Analytics AI and Deep Learning - Unstructured Data - Object Detection - Natural Language Processing - Video Summarization - Speech Recognition
  22. 22. 22 Natalino Busa - @natbusa PaaS: Advanced Analytics SQL + Graph + Text + Machine Learning + Voice/Image/Video
  23. 23. 23 Natalino Busa - @natbusa dPaaS: Machine (deep) Learning … this are just a few examples ...
  24. 24. 24 Natalino Busa - @natbusa Analytics Everywhere Public Cloud Managed Cloud Private Cloud Private Infra
  25. 25. 25 Natalino Busa - @natbusa iPaas: Components for Analytics in the Cloud SQL : Big Data Data Warehousing NoSQL Machine LearningObjects Stores Streaming Computing SQL: Relational Transactional DB
  26. 26. 26 Natalino Busa - @natbusa iPaas, dPaaS: Objects Stores HDFS GlusterFS CephFS NFS Swift Nova Cassandra Redis S3 (AWS) Storage (GCP) ...
  27. 27. 27 Natalino Busa - @natbusa iPaas, dPaaS: NoSQLObjects Stores HDFS GlusterFS CephFS NFS Swift Nova Cassandra Redis S3 (AWS) Storage (GCP) ... Cassandra Redis HBase Accumulo Neo4J ElasticSearch MongoDB Couchbase BigTable (GCP) DynamoDB
  28. 28. 28 Natalino Busa - @natbusa iPaas, dPaaS: NoSQLObjects Stores SQL: Relational Transactional DB HDFS GlusterFS CephFS NFS Swift Nova Cassandra Redis S3 (AWS) Storage (GCP) ... MySQL PostgreSQL MariaDB Oracle (AWS MP) Cassandra Redis HBase Accumulo Neo4J ElasticSearch MongoDB Couchbase BigTable (GCP) DynamoDB
  29. 29. 29 Natalino Busa - @natbusa iPaas, dPaaS: SQL : Big Data Data Warehousing NoSQLObjects Stores SQL: Relational Transactional DB HDFS GlusterFS CephFS NFS Swift Nova Cassandra Redis S3 (AWS) Storage (GCP) ... MySQL PostgreSQL MariaDB Oracle (AWS MP) Hive Presto Spark SQL Impala Redshift (AWS) BigQuery (GCP) Big SQL (IBM) Teradata (AWS MP) SAP Hana(AWS MP) Vertica (AWS MP) Cassandra Redis HBase Accumulo Neo4J ElasticSearch MongoDB Couchbase BigTable (GCP) DynamoDB
  30. 30. 30 Natalino Busa - @natbusa iPaas, dPaaS: SQL : Big Data Data Warehousing NoSQL Machine Learning Objects Stores SQL: Relational Transactional DB HDFS GlusterFS CephFS NFS Swift Nova Cassandra Redis S3 (AWS) Storage (GCP) ... MySQL PostgreSQL MariaDB Oracle (AWS MP) Hive Presto Spark SQL Impala Redshift (AWS) BigQuery (GCP) Big SQL (IBM) Teradata (AWS MP) SAP Hana(AWS MP) Vertica (AWS MP) Cassandra Redis HBase Accumulo Neo4J ElasticSearch MongoDB Couchbase BigTable (GCP) DynamoDB Spark ML H2O Flink Areosolve Theano Tensorflow XGboost Azure ML AWS ML Google ML IBM Watson
  31. 31. 31 Natalino Busa - @natbusa iPaas, dPaaS: SQL : Big Data Data Warehousing NoSQL Machine Learning Objects Stores Streaming Computing SQL: Relational Transactional DB HDFS GlusterFS CephFS NFS Swift Nova Cassandra Redis S3 (AWS) Storage (GCP) ... MySQL PostgreSQL MariaDB Oracle (AWS MP) Hive Presto Spark SQL Impala Redshift (AWS) BigQuery (GCP) Big SQL (IBM) Teradata (AWS MP) SAP Hana(AWS MP) Vertica (AWS MP) Cassandra Redis HBase Accumulo Neo4J ElasticSearch MongoDB Couchbase BigTable (GCP) DynamoDB Spark ML H2O Flink Areosolve Theano Tensorflow XGboost Azure ML AWS ML Google ML IBM Watson Heron (Storm) NiFi Spark Streaming Flink Kafka Streams Logstash StreamSQL Google DataFlow (GCP)
  32. 32. 32 Natalino Busa - @natbusa iPaaS: Selecting your Analytical Stack Flexible. Powerful. - Combinations for this example: 8 * 3 * 4 * 8 * 7 * 7 = 37632 Right tool for the right job - Fit for purpose - Multi-Genre Analytics Hard to maintain and upgrade: - Extended Skills and Know-how - Components upgrades must be compatible Hard to configure: - no matter if cloud or bare or vms - complex stacks with many tools and services
  33. 33. 33 Natalino Busa - @natbusa iPaaS: Deploy & Manage your own Analytics How to simplify? Select a bundle!
  34. 34. 34 Natalino Busa - @natbusa iPaaS: bundled recipes & stacks Select a recipe: - Hortonworks Data Platform - Cloudera Data Platform - Reactive Platform - Smack Stack - Pancake Stack - ELK Stack - Select your own
  35. 35. 35 Natalino Busa - @natbusa iPaaS: my favs analytical stacks Objects Stores NoSQL SQL : Big Data Data Warehousing Machine Learning Streaming Computing All Hadoop (5) HDFS Hbase Hive Spark Storm Smack stack (2) Cassandra Cassandra Spark Spark Spark Elastic (5) HDFS ElasticSearch Hive H2O Kafka Data Science (8) HDFS ElasticSearch Hive, Presto Spark, H2O, Tensorflow Flink Real Time (2) Cassandra Cassandra Flink Flink Flink
  36. 36. 36 Natalino Busa - @natbusa dPaaS: Managed Analytics This is hard ! Can we access it as a service?
  37. 37. 37 Natalino Busa - @natbusa dPaaS: Managed Hadoop & Spark HDInsight: Hadoop, Spark, and R as services Managed Spark Clusters, BigInsight (Hadoop) DataFlow and DataProc: Flink, Spark and Hadoop Clusters as a Service EMR: Hadoop components a la carte
  38. 38. 38 Natalino Busa - @natbusa PaaS: Analytical clusters Ephemeral Create then Dispose Clusters are Short-Lived Data Exploration Isolated, Personal Simple Access Management Interactive Analytics Permanent Clusters are Long Lived Scheduled Operations Production ETL Co-Ordinated Complex Access Management Batch Analytics vs
  39. 39. 39 Natalino Busa - @natbusa DAaaS: Microsoft’s Cortana and ML Studio
  40. 40. 40 Natalino Busa - @natbusa DAaaS: IBM Watson
  41. 41. 41 Natalino Busa - @natbusa DAaaS: Google ML and AI as a service Cloud Computing for Deep Neural Networks > Train, Score, Data AI and ML models for: ● Speech (audio) ● Language (text) ● Vision (images/video)
  42. 42. 42 Natalino Busa - @natbusa Summary • Analytics in the Cloud: The dawn of a new computing era • IPaas, dPaas: complexity vs flexibility, it’s a tradeoff • Computing clusters: Ephemeral and Persistent
  43. 43. 43 Natalino Busa - @natbusa Head of Applied Data Science at Teradata Distributed computing Machine Learning Statistics Big/Fast Data Streaming Computing Linkedin and Twitter: natbusa

×