O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Nisha talagala keynote_inflow_2016

540 visualizações

Publicada em

INFLOW Keynote 2016

Publicada em: Dados e análise
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Nisha talagala keynote_inflow_2016

  1. 1. 1 The  New  Storage  Applications: Lots  of  Data,  New  Hardware  and   Machine  Intelligence Nisha  Talagala Parallel  Machines INFLOW  2016
  2. 2. 2 Storage  Evolution  &  Application  Evolution  Combined Disk  &  Tape Flash DRAM Persistent   Memory Geographically Distributed Clustered Local Key-­Value File,  Object Block Data  Management Classic  Enterprise Transactions Business  Intelligence Search  etc. Advanced  Analytics (Machine  Learning,  Cognitive   Functions)
  3. 3. 3 In  this  talk • What are the new data apps? – with a heavy focus on Advanced Analytics, particularly Machine Learning and Deep Learning • What are their salient characteristics when it comes to storage and memory? • How is storage optimized for these apps today? • Opportunities for the storage stack?
  4. 4. 4 Teaching  Assistants Elderly  Companions Service  Robots Personal  Social  Robots Smart  Cities Robot  Drones Smart  Homes Intelligent  vehicles Personal  Assistants  (bots) Smart  Enterprise Edited  version  of  slide  from  Balint Fleischer’s  talk:  Flash  Memory   Summit  2016,  Santa  Clara,  CA X Growing  Sources  of  Data
  5. 5. 5 Classic Enterprise Transactions,  Business   Intelligence Advanced  Analytics “Killer”  use  cases OTLP ERP Email eCommerce Messaging Social Networks Content  Delivery Discovery  of    solutions,  capabilities Risk  Assessment Improving  customer  experience Comprehending    sensory  data Key  functions RDBMS BI Fraud  detection Databases Social  Graphs SQL  and  ML  Analytics Streaming Natural  Language  Understanding Object Recognition Probabilistic  Reasoning Content  Analytics Data  Types Structured Transactional Structured Unstructured Transactional Streaming Mixed Graphs,  Matrices Storage  Types Enterprise Scale Standards  driven SAN/NAS,  etc Cloud Scale Open  source File/Object ??? Edited  version  of  slide  from  Balint Fleischer’s  talk:  Flash  Memory   Summit  2016 Santa  Clara,  CA The  Application  Evolution
  6. 6. 6 Libraries Libraries Machine  Learning,  Deep   Learning,  SQL,  Graph,  CEP    etc. Data LakeData  Repositories SQL NoSQL Data LakeData  Streams A  Sample  Analytics  Stack Processing  Engine Data  from   Repositories  or   Live  Streams Optimizers/Schedulers Language  Bindings,  APIs Frequently  in   memory Python,  Scala,   Java  etc
  7. 7. 7 Data LakeData  Repositories SQL NoSQL Data LakeData  Streams Machine  Learning  Software  Ecosystem  – a    Partial   View Data  from   Repositories  or   Live  Streams Flink /  Apex Spark  Streaming Storm  /  Samza /  NiFi Caffe Theano Tensor  Flow Hadoop  /  Spark Flink Tensor  Flow Mahout,  Samsara,  Mllib,  FlinkML,  Caffe,  TensorFlow Stream   Processing   Engine Batch Processing   Engine Domain   focused  back   end  engines Algorithms  and  Libraries Beam  (Data  Flow),  StreamSQL,  Keras Layered  API  Providers
  8. 8. 8 In  this  talk • What are the new apps? – with a heavy focus on Advanced Analytics, particularly Machine Learning and Deep Learning • What are their salient characteristics when it comes to storage and memory? • How is storage optimized for these apps today? • Opportunities?
  9. 9. 9 How  ML/DL  Workloads  think  about  Data  – Part  1 • Data Sizes • Incoming datasets can range from MB to TB • Models are typically small. Largest models tend to be in deep neural networks and range from 10s MB to single digit GB • Common Data Types • Time series and Streams • Multi-dimensional Arrays, Matrices and Vectors • DataFrames • Common distributed patterns • Data Parallel, periodic synchronization • Model Parallel • Network sensitivity varies between algorithms. Straggler performance issues can be significant • 2x performance difference between IB and 40Gbit Ethernet for some algorithms like KMeans and SVM
  10. 10. 10 The  Growth  of  Streaming  Data • Continuous data flows and continuous processing • Enabled & driven by sensor data, real time information feeds • Enables native time component “event time” • Allows complex computations that can combine new and old data in deterministic ways • Several variants with varied functionality • True Streams, Micro-Batch (an incremental batch emulation) • Possible with existing models like SQL, supported natively by models like Google DataFlow / Apache Beam • The performance of in-memory streaming enables a convergence between stream analytics (aggregation) and Complex Event Processing (CEP)
  11. 11. 11 Convergence  of  RDBMS  and  Analytics • In-Memory DBs are moving to continuous queries • Ex: StreamSQL interfaces, Pipeline DB (based on PostgreSQL) • Stream and batch analytic engines support SQL interfaces • Ex: SQL support on Spark, Flink • SQL parsers with pluggable back ends – Apache Calcite • Good for basic analytics but need extensions to support machine learning and deep learning • Joins, sorts, etc. good for feature engineering, data cleansing • Many core machine & deep learning operations require linear algebra ops If the idea of a standard database is "durable data, ephemeral queries" the idea of a streaming database is "durable queries, ephemeral data” http://www.databasesoup.com/2015/07/pipelinedb-streaming-postgres.html
  12. 12. 12 The  Growing  Role  of  the  Edge • Closest to data ingest, lowest latency. • Benefits to real time processing • Highly varied connectivity to data centers • Varied hardware architectures and resource constraints • Differs from geographically distributed data center architecture • Asymmetry of hardware • Unpredictable connectivity • Unpredictable device uptime ioT Reference  Model
  13. 13. 13 How  ML/DL  Workloads  think  about  Data  – Part  2 • The older data gets – the more its “role” changes • Older data for batch- historical analytics and model reboots • Used for model training (sort of), not for inference • Guarantees can be “flexible” on older data • Availability can be reduced (most algorithms can deal with some data loss) • A few data corruptions don’t really hurt J • Data is evaluated in aggregate and algorithms are tolerant of outliers • Holes are a fact of real life data – algorithms deal with it • Quality of service exists but is different • Random access is very rare • Heavily patterned access (most operations are some form of array/matrix) • Shuffle phase in some analytic engines
  14. 14. 14 Correctness,  Determinism,  Accuracy  and  Speed • More complex evaluation metrics than traditional transactional workloads • Correctness is hard to measure • Even two implementations of the “same algorithm” can generate different results • Determinism/Repeatability is not always present for streaming data • Ex: Micro-batch processing can produce different results depending on arrival time Vs event time • Accuracy to time tradeoff is non-linear • Exploratory models can generate massive parallelism for the same data set used repeatedly (hyper-parameter search) 0 0.2 0.4 0.6 0.8 1 1.2 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 Error Time SVM  V1   0 0.2 0.4 0.6 0.8 1 1.2 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14Error Time SVM  V2  
  15. 15. 15 The  Role  of  Persistence • For ML functions, most computations today are in-memory • Data flows from data lake to analytic engine and results flow back • Persistent checkpoints can generate large write traffic for very long running computations (streams, large neural network training, etc.) • Persistent message storage to enforce exactly once semantics and determinism, latency sensitive write traffic • For in-memory databases, persistence is part of the core engine • Log based persistence is common • Loading & cleaning of data is still a very large fraction of the pipeline time • Most of this involves manipulating stored data
  16. 16. 16 In  this  talk • What are the new apps? – with a heavy focus on Advanced Analytics, particularly Machine Learning and Deep Learning • What are their salient characteristics when it comes to storage and memory? • How is storage/memory optimized for these apps today? • Opportunities?
  17. 17. 17 Abstractions  and  the  Stack • ML/DL applications use common abstractions that combine linear algebra, tables, streams etc • These are stored as independent entities inside Key-Value pairs, Objects or Files • File system used as common namespace • Information is lost at each level down, along with opportunities to optimize layout, tiering, caching etc Data copies (or transfers denoted by red lines) occur frequently, sometimes more than once! Block File Key-­Value  and  Object Matrices,  Tables,  Streams,  etc
  18. 18. 18 Optimizing  Storage:  Some  Examples • Time series optimized databases • Examples BTrDB (FAST 2016) and Gorrilla DB (Facebook/VLDB 2015) • Streamlined data types, specialized indexing, tiering optimized for access patterns • API pushdown techniques • Iguazio.io • Streams and Spark RDDs as native access APIs • Lineage • Alluxio (Formerly Tachyon) • Link data history & compute history, cache intermediate stages in machine learning pipelines • Memory expansion • Many studies on DRAM/Persistent Memory/Flash tiering for analytics
  19. 19. 19 Opportunities:  Places  to  Start   • Persistent Memory and Flash offer several opportunities to improve ML/DL capacity and efficiency • Fast/Frequent Checkpointing for long running jobs • Note: will put pressure on write endurance • Low latency logging for exactly-once semantics • Memory expansion: DRAM/Persistent Memory/Flash hierarchies • exploit the highly predictable access patterns of ML algorithms • Accelerate data load/save stages of ML/DL pipelines
  20. 20. 20 Opportunities  – More  Fundamental  Shifts • Role of storage types in analytics optimizers and schedulers – superficially similar to DB query optimization • Exploit the more relaxed set of requirements on persistence • Even correctness can be relaxed • Example in compute land for flexibility in synchronization (HogWild! approach to SGD, plus Asynchronous SGD etc.) • Leverage Persistent Memory to unify low latency streaming data requirements and high throughput batch data requirements • New(er) data types and repeatable access patterns • Converged systems with analytics and storage management for cross stack efficiency
  21. 21. 21 Takeaways • The use of ML/DL in enterprise is at its infancy and expanding furiously • These apps put ever larger pressure on data management, latency, and throughput requirements • These apps also introduce another layer of abstraction and another layer of workload intelligence • Further away from block and file • Opportunities exist to significantly improve storage and memory for these use cases by understanding and exploiting their priorities and non-priorities for data
  22. 22. 22 Thank  You