SlideShare uma empresa Scribd logo
1 de 25
Baixar para ler offline
A H i g h Pe r f o r m a n c e D a t a P l a t f o r m U s i n g F l a s h
M A R C H 2 0 1 8
S u p e r C o m p u t i n g A s i a
• Edit Master text styles
• Second level
• Third level
• Fourth level
• Fifth level
$48M in Funding20+ Deployments
Industrial IoT
Financial Services
Telecommunications
Cyber Security
2
100+ Employees
NA
APAC
EMEA
iguazio
• Disk and Network were still slow
• Network was in general slower than disk
When Hadoop started
Data at scale
Compute
Data
Storage
Distribute data
• Disk is faster than Network
Compute is where data is
Compute
Data
StorageDistribute compute
• Scalable but inherently batch and slow
• Disk and Network I/O (Map, reduce, shuffle … )
• Runs on JVM
• Not suitable for all type of workloads (real-time, interactive, iterative M/L)
• Cannot scale compute independent of storage and vice versa
Big Data Workloads –
Performance Characteristics
Big Data Workloads –
Performance Characteristics
Big Data Workload Disk I/O Network I/O Compute
Iterative Machine Learning
Jobs
Interactive Analytics at
scale
Lambda & Kappa
Architecture at scale
Typically Disk or Network maxes out during a massively parallel job, before Compute
High
Medium
Low
Big Data Workloads –
Performance Characteristics
Big Data Workload Disk I/O Network I/O Compute
Iterative Machine Learning
Jobs
Interactive Analytics at
scale
Lambda & Kappa
Architecture at scale
Traditional Database
Workload
Compare this to traditional RDBBS – CPU or Disk I/O maxes out
Typically Disk or Network maxes out during a massively parallel job, before Compute
High
Medium
Low
Big Data Workloads –
Performance Characteristics
Big Data Workload Disk I/O Network I/O Compute
Iterative Machine Learning
Jobs
Interactive Analytics at
scale
Lambda & Kappa
Architecture at scale
Traditional Database
Workload
in-memory and high-speed networking ?
High
Medium
Low
• L1 cache reference 0.5 ns
• Branch mispredict 5 ns
• L2 cache reference 7 ns
• Mutex lock/unlock 100 ns
• Main memory reference 100 ns
• Send 2K bytes over 1 Gbps network 20,000 ns
• SSD seek 80,000 ns
• Read 1 MB sequentially from memory 250,000 ns
• Round trip within same datacenter 500,000 ns
• Disk seek 10,000,000 ns
• Read 1 MB sequentially from network 10,000,000 ns
• Read 1 MB sequentially from disk 30,000,000 ns
• Send packet CA->Netherlands->CA 150,000,000 ns
Designs, Lessons and Advice from Building Large Distributed Systems by Dr Jeff Dean of Google Source http://www.slideshare.net/ikewu83/dean-
keynoteladis2009-4885081
Optimizing Big Data
Performance
RAM is 10-20x more expensive than Flash in $Cost / GB
RAM is expensive
SSDs prices are falling
• L1 cache reference 0.5 ns
• Branch mispredict 5 ns
• L2 cache reference 7 ns
• Mutex lock/unlock 100 ns
• Main memory reference 100 ns
• Send 2K bytes over 1 Gbps network 20,000 ns
• SSD seek 80,000 ns
• Read 1 MB sequentially from memory 250,000 ns
• Round trip within same datacenter 500,000 ns
• Disk seek 10,000,000 ns
• Read 1 MB sequentially from network 10,000,000 ns
• Read 1 MB sequentially from disk 30,000,000 ns
• Send packet CA->Netherlands->CA 150,000,000 ns
Designs, Lessons and Advice from Building Large Distributed Systems by Dr Jeff Dean of Google Source http://www.slideshare.net/ikewu83/dean-
keynoteladis2009-4885081
Optimizing Big Data
Performance
0
1
2
3
4
5
6
7
8
9
DRAM NVMe SSD SATA SSD SAS HDD
8.5
1.2
0.85
0.04
Cost in $ per GB (Raw)
Cost in $ per GB (Raw)
Price of Memory
Technology 2016/17
• NVMe is a software based standard that was specifically optimized for SSDs connected through the
PCIe interface.
What is NVMe?
• Shorter hardware data access path – directly connected via PCIe, Faster compared to SATA
• NVMe Completely redesigned software - bypasses conventional block layer request queue –
• Asynchronous Submission Queue for requests
• Asynchronous Completion Queue
Why is NVMe faster
(than SATA) ?
Source
http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=2CA58A08BFB2821F59A7996DAC8742F1?doi=10.1.1.697.1493&rep=rep1&type=pdf
17
Micro-Services &
Serverless Functions
Unified DB for
Real-Time & Analytics
EXTERNAL
EVENTS AND
SOURCES
IMMEDIATE
INSIGHTS &
ACTIONS
EXTERNAL
DATA
Enrich
Reinforce
Learning
iguazio
• Combined fresh data
• and historical data
• Immediate Insights
• Rapid time to production
• Supporting common APIs
• Deployment Everywhere
Unified DB for Real-Time & Analytics
FUNCTIONS &
MICRO-SERVICES
SQL & TS
QUERY APIS
ANALYTICS &
AI TOOLS AWS APIS
iguazio Data platform
for real time and analytics
Single platform
All in One
S3
Kinesis
DynamoDB
Secure Intelligent Multi-model
UNIFIED & REAL-TIME DATABASE ENGINE
EXTERNAL DATA LAKES & CLOUDS
Architecture
• In-mem DB performance with
Flash economies and density
• Access data concurrently through
multiple standard APIs
• Fully integrated PaaS
19
LOCAL NVMe
AND OS BYPASS
Tr a d i t i o n a l L a y e r e d A p p r o a c h i g u a z i o
How to take full
advantage of NVMe?
And Advanced
Networking
iguazio © 2016
21
The tests were performed using the Yahoo! Cloud Serving Benchmark (YCSB), an open
source framework for evaluating and comparing the performance of multiple types of
NoSQL database management systems, the de facto industry standard for this purpose.
Performance Results
Independently scaling
compute and storage
iguazio Scale Outx2
Storage
Data nodes
(Processing)
CPU/Drive ratio
Performance
Capacity
Smart Rebuild
Scale Out System
CPU/Drive ratio
Performance
Capacity
Smart Rebuild
Processing + Storage
Scale Up System
Processing
CPU/Drive ratio
Performance
Capacity
Smart Rebuild
Storage
iguazio © 2016
24
Why iguazio?
Very high ingestion rate
Low latency for faster dashboards – in memory speed at the cost of SSD
Real time analytics on both fresh and historical data
Scale out architecture – enabling real time analytics on large data sets
Platform as a service providing cloud experience
iguazio © 2016
25
Business benefits - what’s in it for me?
Increase operational efficiency
Reduce TCO
Faster time to market for new services
Reduce cost of traditional data center operations while constraining growth of expensive cloud services such as
Amazon DynamoDB, Kinesis , S3 , redshift and EMR
Bring new services on-line in less time with greater reliability & security
Optimize business processes
Increase data engineering efficiency
Simplify the overall data pipe-line helping engineering to focus on building applications
Thank You
@Santanu_Dey

Mais conteúdo relacionado

Mais procurados

Hardware Provisioning for MongoDB
Hardware Provisioning for MongoDBHardware Provisioning for MongoDB
Hardware Provisioning for MongoDB
MongoDB
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File System
tutchiio
 

Mais procurados (20)

Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
IMC Summit 2016 Breakout - Girish Kathalagiri - Decision Making with MLLIB, S...
IMC Summit 2016 Breakout - Girish Kathalagiri - Decision Making with MLLIB, S...IMC Summit 2016 Breakout - Girish Kathalagiri - Decision Making with MLLIB, S...
IMC Summit 2016 Breakout - Girish Kathalagiri - Decision Making with MLLIB, S...
 
Hardware Provisioning for MongoDB
Hardware Provisioning for MongoDBHardware Provisioning for MongoDB
Hardware Provisioning for MongoDB
 
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
 
GFS - Google File System
GFS - Google File SystemGFS - Google File System
GFS - Google File System
 
WebLogic Stability; Detect and Analyse Stuck Threads
WebLogic Stability; Detect and Analyse Stuck ThreadsWebLogic Stability; Detect and Analyse Stuck Threads
WebLogic Stability; Detect and Analyse Stuck Threads
 
Capacity Planning For Your Growing MongoDB Cluster
Capacity Planning For Your Growing MongoDB ClusterCapacity Planning For Your Growing MongoDB Cluster
Capacity Planning For Your Growing MongoDB Cluster
 
Redis for horizontally scaled data processing at jFrog bintray
Redis for horizontally scaled data processing at jFrog bintrayRedis for horizontally scaled data processing at jFrog bintray
Redis for horizontally scaled data processing at jFrog bintray
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GCHadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
 
Loadays MySQL
Loadays MySQLLoadays MySQL
Loadays MySQL
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
25 snowflake
25 snowflake25 snowflake
25 snowflake
 
Володимир Цап "Constraint driven infrastructure - scale or tune?"
Володимир Цап "Constraint driven infrastructure - scale or tune?"Володимир Цап "Constraint driven infrastructure - scale or tune?"
Володимир Цап "Constraint driven infrastructure - scale or tune?"
 
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at HuaweiHBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
 
Accordion HBaseCon 2017
Accordion HBaseCon 2017Accordion HBaseCon 2017
Accordion HBaseCon 2017
 
Performance Analysis and Troubleshooting Methodologies for Databases
Performance Analysis and Troubleshooting Methodologies for DatabasesPerformance Analysis and Troubleshooting Methodologies for Databases
Performance Analysis and Troubleshooting Methodologies for Databases
 
Scylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair Updates
Scylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair UpdatesScylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair Updates
Scylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair Updates
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter Migration
 
GFS
GFSGFS
GFS
 

Semelhante a Building a High Performance Analytics Platform

Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Kognitio
 
SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...
SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...
SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...
Splunk
 
How to Choose a Host for a Big Data Project
How to Choose a Host for a Big Data ProjectHow to Choose a Host for a Big Data Project
How to Choose a Host for a Big Data Project
Peak Hosting
 

Semelhante a Building a High Performance Analytics Platform (20)

Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and InfrastrctureRevolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...
SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...
SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...
 
Best storage engine for MySQL
Best storage engine for MySQLBest storage engine for MySQL
Best storage engine for MySQL
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
 
Oracle real application_cluster
Oracle real application_clusterOracle real application_cluster
Oracle real application_cluster
 
SQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes
SQream DB - Bigger Data On GPUs: Approaches, Challenges, SuccessesSQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes
SQream DB - Bigger Data On GPUs: Approaches, Challenges, Successes
 
How to Choose a Host for a Big Data Project
How to Choose a Host for a Big Data ProjectHow to Choose a Host for a Big Data Project
How to Choose a Host for a Big Data Project
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
Using SAS GRID v 9 with Isilon F810
Using SAS GRID v 9 with Isilon F810Using SAS GRID v 9 with Isilon F810
Using SAS GRID v 9 with Isilon F810
 
The Last Frontier- Virtualization, Hybrid Management and the Cloud
The Last Frontier-  Virtualization, Hybrid Management and the CloudThe Last Frontier-  Virtualization, Hybrid Management and the Cloud
The Last Frontier- Virtualization, Hybrid Management and the Cloud
 

Último

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Último (20)

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 

Building a High Performance Analytics Platform

  • 1. A H i g h Pe r f o r m a n c e D a t a P l a t f o r m U s i n g F l a s h M A R C H 2 0 1 8 S u p e r C o m p u t i n g A s i a
  • 2. • Edit Master text styles • Second level • Third level • Fourth level • Fifth level $48M in Funding20+ Deployments Industrial IoT Financial Services Telecommunications Cyber Security 2 100+ Employees NA APAC EMEA iguazio
  • 3. • Disk and Network were still slow • Network was in general slower than disk When Hadoop started
  • 5. • Disk is faster than Network Compute is where data is Compute Data StorageDistribute compute
  • 6. • Scalable but inherently batch and slow • Disk and Network I/O (Map, reduce, shuffle … ) • Runs on JVM • Not suitable for all type of workloads (real-time, interactive, iterative M/L) • Cannot scale compute independent of storage and vice versa Big Data Workloads – Performance Characteristics
  • 7. Big Data Workloads – Performance Characteristics Big Data Workload Disk I/O Network I/O Compute Iterative Machine Learning Jobs Interactive Analytics at scale Lambda & Kappa Architecture at scale Typically Disk or Network maxes out during a massively parallel job, before Compute High Medium Low
  • 8. Big Data Workloads – Performance Characteristics Big Data Workload Disk I/O Network I/O Compute Iterative Machine Learning Jobs Interactive Analytics at scale Lambda & Kappa Architecture at scale Traditional Database Workload Compare this to traditional RDBBS – CPU or Disk I/O maxes out Typically Disk or Network maxes out during a massively parallel job, before Compute High Medium Low
  • 9. Big Data Workloads – Performance Characteristics Big Data Workload Disk I/O Network I/O Compute Iterative Machine Learning Jobs Interactive Analytics at scale Lambda & Kappa Architecture at scale Traditional Database Workload in-memory and high-speed networking ? High Medium Low
  • 10. • L1 cache reference 0.5 ns • Branch mispredict 5 ns • L2 cache reference 7 ns • Mutex lock/unlock 100 ns • Main memory reference 100 ns • Send 2K bytes over 1 Gbps network 20,000 ns • SSD seek 80,000 ns • Read 1 MB sequentially from memory 250,000 ns • Round trip within same datacenter 500,000 ns • Disk seek 10,000,000 ns • Read 1 MB sequentially from network 10,000,000 ns • Read 1 MB sequentially from disk 30,000,000 ns • Send packet CA->Netherlands->CA 150,000,000 ns Designs, Lessons and Advice from Building Large Distributed Systems by Dr Jeff Dean of Google Source http://www.slideshare.net/ikewu83/dean- keynoteladis2009-4885081 Optimizing Big Data Performance
  • 11. RAM is 10-20x more expensive than Flash in $Cost / GB RAM is expensive
  • 12. SSDs prices are falling
  • 13. • L1 cache reference 0.5 ns • Branch mispredict 5 ns • L2 cache reference 7 ns • Mutex lock/unlock 100 ns • Main memory reference 100 ns • Send 2K bytes over 1 Gbps network 20,000 ns • SSD seek 80,000 ns • Read 1 MB sequentially from memory 250,000 ns • Round trip within same datacenter 500,000 ns • Disk seek 10,000,000 ns • Read 1 MB sequentially from network 10,000,000 ns • Read 1 MB sequentially from disk 30,000,000 ns • Send packet CA->Netherlands->CA 150,000,000 ns Designs, Lessons and Advice from Building Large Distributed Systems by Dr Jeff Dean of Google Source http://www.slideshare.net/ikewu83/dean- keynoteladis2009-4885081 Optimizing Big Data Performance
  • 14. 0 1 2 3 4 5 6 7 8 9 DRAM NVMe SSD SATA SSD SAS HDD 8.5 1.2 0.85 0.04 Cost in $ per GB (Raw) Cost in $ per GB (Raw) Price of Memory Technology 2016/17
  • 15. • NVMe is a software based standard that was specifically optimized for SSDs connected through the PCIe interface. What is NVMe?
  • 16. • Shorter hardware data access path – directly connected via PCIe, Faster compared to SATA • NVMe Completely redesigned software - bypasses conventional block layer request queue – • Asynchronous Submission Queue for requests • Asynchronous Completion Queue Why is NVMe faster (than SATA) ? Source http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=2CA58A08BFB2821F59A7996DAC8742F1?doi=10.1.1.697.1493&rep=rep1&type=pdf
  • 17. 17 Micro-Services & Serverless Functions Unified DB for Real-Time & Analytics EXTERNAL EVENTS AND SOURCES IMMEDIATE INSIGHTS & ACTIONS EXTERNAL DATA Enrich Reinforce Learning iguazio • Combined fresh data • and historical data • Immediate Insights • Rapid time to production • Supporting common APIs • Deployment Everywhere Unified DB for Real-Time & Analytics
  • 18. FUNCTIONS & MICRO-SERVICES SQL & TS QUERY APIS ANALYTICS & AI TOOLS AWS APIS iguazio Data platform for real time and analytics Single platform All in One S3 Kinesis DynamoDB Secure Intelligent Multi-model UNIFIED & REAL-TIME DATABASE ENGINE EXTERNAL DATA LAKES & CLOUDS Architecture • In-mem DB performance with Flash economies and density • Access data concurrently through multiple standard APIs • Fully integrated PaaS
  • 19. 19 LOCAL NVMe AND OS BYPASS Tr a d i t i o n a l L a y e r e d A p p r o a c h i g u a z i o How to take full advantage of NVMe?
  • 21. iguazio © 2016 21 The tests were performed using the Yahoo! Cloud Serving Benchmark (YCSB), an open source framework for evaluating and comparing the performance of multiple types of NoSQL database management systems, the de facto industry standard for this purpose. Performance Results
  • 22. Independently scaling compute and storage iguazio Scale Outx2 Storage Data nodes (Processing) CPU/Drive ratio Performance Capacity Smart Rebuild Scale Out System CPU/Drive ratio Performance Capacity Smart Rebuild Processing + Storage Scale Up System Processing CPU/Drive ratio Performance Capacity Smart Rebuild Storage
  • 23. iguazio © 2016 24 Why iguazio? Very high ingestion rate Low latency for faster dashboards – in memory speed at the cost of SSD Real time analytics on both fresh and historical data Scale out architecture – enabling real time analytics on large data sets Platform as a service providing cloud experience
  • 24. iguazio © 2016 25 Business benefits - what’s in it for me? Increase operational efficiency Reduce TCO Faster time to market for new services Reduce cost of traditional data center operations while constraining growth of expensive cloud services such as Amazon DynamoDB, Kinesis , S3 , redshift and EMR Bring new services on-line in less time with greater reliability & security Optimize business processes Increase data engineering efficiency Simplify the overall data pipe-line helping engineering to focus on building applications