SlideShare uma empresa Scribd logo
1 de 39
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
IoT: what about data storage?
Vladimir Rodionov
Staff Software Engineer
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
IoT data stream
 Sequence of data points
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
IoT data stream
 Sequence of data points
 Triplet: [ID][TIME][VALUE] – basic time-series
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
IoT data stream
 Sequence of data points
 Triplet: [ID][TIME][VALUE] – basic time-series
 Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
IoT data stream
 Sequence of data points
 Triplet: [ID][TIME][VALUE] – basic time-series
 Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags
 Sometimes with location – spatial data
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
IoT data stream
 Sequence of data points
 Triplet: [ID][TIME][VALUE] – basic time-series
 Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags
 Sometimes with location – spatial data
 But, strictly time-series
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
IoT data stream
 Sequence of data points
 Triplet: [ID][TIME][VALUE] – basic time-series
 Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags
 Sometimes with location – spatial data
 But, strictly time-series
 Do we have good time series data store?
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
IoT data stream
 Sequence of data points
 Triplet: [ID][TIME][VALUE] – basic time-series
 Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags
 Sometimes with location – spatial data
 But, strictly time-series
 Do we have good time series data store?
 Open source?
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
IoT data stream
 Sequence of data points
 Triplet: [ID][TIME][VALUE] – basic time-series
 Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags
 Sometimes with location – spatial data
 But, strictly time-series
 Do we have good time series data store?
 Open source?
 But commercially supported?
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache HBase
 Open Source
 Scalable
 Distributed
 NoSQL Data Store
 Commercially supported
 Temporal?
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache HBase
 Open Source
 Scalable
 Distributed
 NoSQL Data Store
 Commercially supported
 Temporal? Sure, you can do temporal
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache HBase
 Open Source
 Scalable
 Distributed
 NoSQL Data Store
 Commercially supported
 Temporal? Sure, you can do temporal stuff!
 Out of box?
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Time Series DB requirements
 Data Store MUST preserve temporal locality of data for better in-memory caching
 Data Store MUST provide efficient compression
– Time – series are highly compressible (less than 2 bytes per data point in some cases)
– Facebook custom compression codec produces less than 1.4 bytes per data point
 Data Store MUST provide automatic time-based rollup aggregations: sum, count, avg,
min, max, etc., by min, hour, day and so on – configurable. Most of the time its
aggregated data we are interested in.
 Efficient caching policy (RAM/SSD)
 SQL API (nice to have, but it is optional)
 Support IoT use cases ( write/read ratio up to 99/1, millions ops)
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ideal HBase Time Series DB
 Keeps raw data for hours
 Does not compact raw data at all
 Preserves raw data in memory cache for periodic compactions and time-based rollup
aggregations
 Stores full resolution data only in compressed form
 Has different TTL for different aggregation resolutions:
– Days for by_min, by_10min etc.
– Months, years for by_hour
 Compaction should preserve temporal locality of both: full resolution data and
aggregated data.
 Integration with Phoenix (SQL)
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Write Path (for 99%)
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Time Series DB HBase
Raw Events
Region Server
HDFS
CF:Compressed
CF:Raw
CF:Aggregates
C
A
C
A
Compressor Coprocessor
Aggregator Coprocessor
CF:Aggregates
CF:Compressed – TTL days/months
CF:Aggregates – TTL months/years (CF per resolution)
CF:Raw – TTL hours
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HBASE-14468 FIFO compaction
 First-In-First-Out
 No compaction at all
 TTL expired data just get archived
 Ideal for raw data storage
 No compaction – no block cache trashing
 Raw data can be cached on write or on read
 Sustains 100s MB/s write throughput per RS
 Available 0.98.17, 1.1+, 1.2+, HDP-2.4+
 Can be easily back ported to 1.0 (do we need this?)
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Exploring (Size-Tiered) Compaction
 Does not preserve temporal locality of data.
 Compaction trashes block cache
 No efficient caching of data is possible
 It hurts most-recent-most-valuable data access pattern.
 Compression/Aggregation is very heavy.
 To read back recent raw data and run it through compressor, many IO operations are
required, because …
 We can’t guarantee recent data in a block cache.
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HBASE-15181 Date Tiered Compaction
 DateTieredCompactionPolicy
 CASSANDRA-6602
 Works better for time series than ExploringCompactionPolicy
 Better temporal locality helps with reads
 Good choice for compressed full resolution and aggregated data.
 Available in 0.98.17, 1.2+, HDP-2.4 has it as well
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Exploring Compaction + Max Size
 Set hbase.hstore.compaction.max.size
 This emulates Date-Tiered Compaction
 Preserves temporal locality of data – data point which are close will be stored in a same
file, distant ones – in separate files.
 Compaction works better with block cache
 More efficient caching of recent data is possible
 Good for most-recent-most-valuable data access pattern.
 Use it for compressed and aggregated data
 Helps to keep recent data in a block cache.
 ECPM
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HBASE-14496 Delayed compaction
 Files are eligible for minor compaction if their age > delay
 Good for application where most recent data is most valuable.
 Prevents block cache from trashing for recent data due to frequent minor compactions
of a fresh store files
 Will enable this feature for Exploring Compaction Policy
 Improves read latency for most recent data.
 ECP + Max +Delay (1-2 days) is good option for compressed full resolution and
aggregated data. ECPMD
 Patch available.
 HBase 1.0+ (can be back-ported to 0.98)
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Time Series DB HBase
Raw Events
Region Server
HDFS
CF:Compressed
CF:Raw
CF:Aggregates
C
A
C
A
Compressor Coprocessor
Aggregator Coprocessor
CF:Aggregates
CF:Compressed – TTL days/months
CF:Aggregates – TTL months/years (CF per resolution)
CF:Raw – TTL hours
ECPM or DTCP
FIFO
ECPM or DTCP
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HBase Block Cache and Time Series
 Current policy (LRU) is not optimal for time-series applications
 We need something similar to FIFO (both in RAM and on SSD)
 We need support for TB size RAM/SSD-based caches
 Current off-heap bucket cache does not scale well (it keeps keys in Java heap)
 For SSD cache we could mirror most recent store files, thus providing FIFO semantics
w/o any complexity of disk-based cache management.
 This all above are work items for future, but today …
– Disable cache for raw data (prevent extreme cache churn)
– Enable cache on write/read for compressed data and aggregations
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Flexible Retention Policies
Raw
Compressed
Aggregates
Hours Months Years
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Read/Write IO Reduction
100
~50
~10
Base
FIFO+ECPM
+Compaction
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Read/Write IO Reduction
100
~50
~10
Base
FIFO+ECPM
+Compaction
50-100MB/s
25-50MB/s
5-10MB/s
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Read/Write IO Reduction (estimate for 250K/sec data points)
100
~50
~10
Base
FIFO+ECPM
+Compaction
50-100MB/s
25-50MB/s
5-10MB/s
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary
 Disable major compaction
 Do not run HDFS balancer
 Disable HBase auto region balancing: balance_switch false
 Disable region splits (DisabledRegionSplitPolicy)
 Presplit table in advance.
 Have separate column families for raw, compressed and aggregated data (each
aggregate resolution – its own family)
 Increase hbase.hstore.blockingStoreFiles for all column families
 FIFO for Raw, ECPM(D) or DTCP (next session) for compressed and aggregated data
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary (continued)
 Run periodically internal job (coprocessor) to compress data and produce time-based
rollup aggregations.
 Do not cache raw data, write/read cache for others (if ECPM(D))
 Enable WAL Compression - decrease write IO.
 Use maximum compression for Raw data (GZ) – decrease write IO.
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Read Path (for 1%)
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SQL (Phoenix) integration
 Each time series has set of named attributes, which we call meta (tags in OpenTSDB)
 Keep time-series meta in Phoenix type table(s).
 Adding new time series, deleting time-series or updating time-series is DML/DDL
operation on a Phoenix table.
 Meta is static (mostly)
 Define set of attributes in meta which create PK
 Have PK translation to unique ID.
 Store ID, RTS (reversed time stamp), VALUE in HBase
 Now you can index time-series by any attribute(s) in Phoenix
 Query is two-step process: Phoenix first to select list of IDs, then HBase to run query on
ID list
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Query Flow
ID Active Version … MFG
11 true 1.1 SA
12 true 1.3 SA
15 true 1.4 GE
17 true 1.1 GE
… … … … …
345 false 1.0 SA
Phoenix SQL
Time-Series Definition - META
ID Timestamp Value
11 143897653 10.0
12 143897753 11.3
15 143897953 11.6
17 143897853 11.9
… … …
345 143897753 11.0
HBase Time Series DB
Time-Series Data
2)GetAvgByIdSet(ID
set, now(), now() -
24h)
1)SELECT ID FROM META
WHERE MFG=‘SA’AND
Version = ‘1.1’
1. 2.
ID set
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Time-Series DB API
 Group operations on ID sets by time range
– Min, Max, Avg, Count, Sum, other aggregations
 Pluggable aggregation functions
 Support for different time resolutions
 With different approximations (linear, cubic, bi-cubic)
 Batch load support (for writes)
 Can be implemented in a HBase coprocessor layer
 Can work much-much faster than regular SQL DBMS
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Time-Series DB API
 Group operations on ID sets by time range
– Min, Max, Avg, Count, Sum, other aggregations
 Pluggable aggregation functions
 Support for different time resolutions
 With different approximations (linear, cubic, bi-cubic)
 Batch load support (for writes)
 Can be implemented in a HBase coprocessor layer
 Can work much-much faster than regular SQL DBMS
 Because we have already aggregated data
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank you
 Q&A

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Chapter 1 big data
Chapter 1 big dataChapter 1 big data
Chapter 1 big data
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
Open Data in a Day - Introduction to Open Data
Open Data in a Day - Introduction to Open DataOpen Data in a Day - Introduction to Open Data
Open Data in a Day - Introduction to Open Data
 
The rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computingThe rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computing
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
IoT Security Challenges and Solutions
IoT Security Challenges and SolutionsIoT Security Challenges and Solutions
IoT Security Challenges and Solutions
 
Artificial intelligence of things(AIoT): What is AIoT: AIoT applications
Artificial intelligence of things(AIoT): What is AIoT: AIoT applicationsArtificial intelligence of things(AIoT): What is AIoT: AIoT applications
Artificial intelligence of things(AIoT): What is AIoT: AIoT applications
 
IOT DATA MANAGEMENT AND COMPUTE STACK.pptx
IOT DATA MANAGEMENT AND COMPUTE STACK.pptxIOT DATA MANAGEMENT AND COMPUTE STACK.pptx
IOT DATA MANAGEMENT AND COMPUTE STACK.pptx
 
Edge Computing and 5G - SDN/NFV London meetup
Edge Computing and 5G - SDN/NFV London meetupEdge Computing and 5G - SDN/NFV London meetup
Edge Computing and 5G - SDN/NFV London meetup
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptxEX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
 
Embedded Systems and IoT
Embedded Systems and IoTEmbedded Systems and IoT
Embedded Systems and IoT
 
IoT Platforms and Architecture
IoT Platforms and ArchitectureIoT Platforms and Architecture
IoT Platforms and Architecture
 
Ibm big data-platform
Ibm big data-platformIbm big data-platform
Ibm big data-platform
 
Introduction to IoT Architectures and Protocols
Introduction to IoT Architectures and ProtocolsIntroduction to IoT Architectures and Protocols
Introduction to IoT Architectures and Protocols
 
IoT Arduino UNO, RaspberryPi with Python, RaspberryPi Programming using Pytho...
IoT Arduino UNO, RaspberryPi with Python, RaspberryPi Programming using Pytho...IoT Arduino UNO, RaspberryPi with Python, RaspberryPi Programming using Pytho...
IoT Arduino UNO, RaspberryPi with Python, RaspberryPi Programming using Pytho...
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 

Destaque

Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Daniel Madrigal
 
HBase Consistency and Performance Improvements
HBase Consistency and Performance ImprovementsHBase Consistency and Performance Improvements
HBase Consistency and Performance Improvements
DataWorks Summit
 

Destaque (20)

Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Timeline service V2 at the Hadoop Summit SJ 2016
Timeline service V2 at the Hadoop Summit SJ 2016Timeline service V2 at the Hadoop Summit SJ 2016
Timeline service V2 at the Hadoop Summit SJ 2016
 
The Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache FlinkThe Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache Flink
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaExtend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
 
Apache Phoenix + Apache HBase
Apache Phoenix + Apache HBaseApache Phoenix + Apache HBase
Apache Phoenix + Apache HBase
 
MongoDB IoT City Tour LONDON: Managing the Database Complexity, by Arthur Vie...
MongoDB IoT City Tour LONDON: Managing the Database Complexity, by Arthur Vie...MongoDB IoT City Tour LONDON: Managing the Database Complexity, by Arthur Vie...
MongoDB IoT City Tour LONDON: Managing the Database Complexity, by Arthur Vie...
 
Streaming map reduce
Streaming map reduceStreaming map reduce
Streaming map reduce
 
Zero Downtime App Deployment using Hadoop
Zero Downtime App Deployment using HadoopZero Downtime App Deployment using Hadoop
Zero Downtime App Deployment using Hadoop
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
 
HBase Consistency and Performance Improvements
HBase Consistency and Performance ImprovementsHBase Consistency and Performance Improvements
HBase Consistency and Performance Improvements
 
Apache HBase 0.98
Apache HBase 0.98Apache HBase 0.98
Apache HBase 0.98
 
JustWatch - Culture & Core Values
JustWatch - Culture & Core ValuesJustWatch - Culture & Core Values
JustWatch - Culture & Core Values
 
Hbase Nosql
Hbase NosqlHbase Nosql
Hbase Nosql
 
Launching your advanced analytics program for success in a mature industry
Launching your advanced analytics program for success in a mature industryLaunching your advanced analytics program for success in a mature industry
Launching your advanced analytics program for success in a mature industry
 
IOT Paris Seminar 2015 - Storage Challenges in IOT
IOT Paris Seminar 2015 - Storage Challenges in IOTIOT Paris Seminar 2015 - Storage Challenges in IOT
IOT Paris Seminar 2015 - Storage Challenges in IOT
 
Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn
 
Big Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on SparkBig Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on Spark
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
 

Semelhante a IoT:what about data storage?

Semelhante a IoT:what about data storage? (20)

Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime Analytics
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
ACID Transactions in Hive
ACID Transactions in HiveACID Transactions in Hive
ACID Transactions in Hive
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Put is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionPut is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit Edition
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0
 
Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016
 
Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
 

Mais de DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

Mais de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

IoT:what about data storage?

  • 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved IoT: what about data storage? Vladimir Rodionov Staff Software Engineer
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved IoT data stream  Sequence of data points
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved IoT data stream  Sequence of data points  Triplet: [ID][TIME][VALUE] – basic time-series
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved IoT data stream  Sequence of data points  Triplet: [ID][TIME][VALUE] – basic time-series  Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved IoT data stream  Sequence of data points  Triplet: [ID][TIME][VALUE] – basic time-series  Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags  Sometimes with location – spatial data
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved IoT data stream  Sequence of data points  Triplet: [ID][TIME][VALUE] – basic time-series  Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags  Sometimes with location – spatial data  But, strictly time-series
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved IoT data stream  Sequence of data points  Triplet: [ID][TIME][VALUE] – basic time-series  Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags  Sometimes with location – spatial data  But, strictly time-series  Do we have good time series data store?
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved IoT data stream  Sequence of data points  Triplet: [ID][TIME][VALUE] – basic time-series  Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags  Sometimes with location – spatial data  But, strictly time-series  Do we have good time series data store?  Open source?
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved IoT data stream  Sequence of data points  Triplet: [ID][TIME][VALUE] – basic time-series  Multiplet: [ID][TIME][TAG1][…][TAGN][VALUE] – time-series with tags  Sometimes with location – spatial data  But, strictly time-series  Do we have good time series data store?  Open source?  But commercially supported?
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache HBase  Open Source  Scalable  Distributed  NoSQL Data Store  Commercially supported  Temporal?
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache HBase  Open Source  Scalable  Distributed  NoSQL Data Store  Commercially supported  Temporal? Sure, you can do temporal
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache HBase  Open Source  Scalable  Distributed  NoSQL Data Store  Commercially supported  Temporal? Sure, you can do temporal stuff!  Out of box?
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Time Series DB requirements  Data Store MUST preserve temporal locality of data for better in-memory caching  Data Store MUST provide efficient compression – Time – series are highly compressible (less than 2 bytes per data point in some cases) – Facebook custom compression codec produces less than 1.4 bytes per data point  Data Store MUST provide automatic time-based rollup aggregations: sum, count, avg, min, max, etc., by min, hour, day and so on – configurable. Most of the time its aggregated data we are interested in.  Efficient caching policy (RAM/SSD)  SQL API (nice to have, but it is optional)  Support IoT use cases ( write/read ratio up to 99/1, millions ops)
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ideal HBase Time Series DB  Keeps raw data for hours  Does not compact raw data at all  Preserves raw data in memory cache for periodic compactions and time-based rollup aggregations  Stores full resolution data only in compressed form  Has different TTL for different aggregation resolutions: – Days for by_min, by_10min etc. – Months, years for by_hour  Compaction should preserve temporal locality of both: full resolution data and aggregated data.  Integration with Phoenix (SQL)
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Write Path (for 99%)
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Time Series DB HBase Raw Events Region Server HDFS CF:Compressed CF:Raw CF:Aggregates C A C A Compressor Coprocessor Aggregator Coprocessor CF:Aggregates CF:Compressed – TTL days/months CF:Aggregates – TTL months/years (CF per resolution) CF:Raw – TTL hours
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HBASE-14468 FIFO compaction  First-In-First-Out  No compaction at all  TTL expired data just get archived  Ideal for raw data storage  No compaction – no block cache trashing  Raw data can be cached on write or on read  Sustains 100s MB/s write throughput per RS  Available 0.98.17, 1.1+, 1.2+, HDP-2.4+  Can be easily back ported to 1.0 (do we need this?)
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Exploring (Size-Tiered) Compaction  Does not preserve temporal locality of data.  Compaction trashes block cache  No efficient caching of data is possible  It hurts most-recent-most-valuable data access pattern.  Compression/Aggregation is very heavy.  To read back recent raw data and run it through compressor, many IO operations are required, because …  We can’t guarantee recent data in a block cache.
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HBASE-15181 Date Tiered Compaction  DateTieredCompactionPolicy  CASSANDRA-6602  Works better for time series than ExploringCompactionPolicy  Better temporal locality helps with reads  Good choice for compressed full resolution and aggregated data.  Available in 0.98.17, 1.2+, HDP-2.4 has it as well
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Exploring Compaction + Max Size  Set hbase.hstore.compaction.max.size  This emulates Date-Tiered Compaction  Preserves temporal locality of data – data point which are close will be stored in a same file, distant ones – in separate files.  Compaction works better with block cache  More efficient caching of recent data is possible  Good for most-recent-most-valuable data access pattern.  Use it for compressed and aggregated data  Helps to keep recent data in a block cache.  ECPM
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HBASE-14496 Delayed compaction  Files are eligible for minor compaction if their age > delay  Good for application where most recent data is most valuable.  Prevents block cache from trashing for recent data due to frequent minor compactions of a fresh store files  Will enable this feature for Exploring Compaction Policy  Improves read latency for most recent data.  ECP + Max +Delay (1-2 days) is good option for compressed full resolution and aggregated data. ECPMD  Patch available.  HBase 1.0+ (can be back-ported to 0.98)
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Time Series DB HBase Raw Events Region Server HDFS CF:Compressed CF:Raw CF:Aggregates C A C A Compressor Coprocessor Aggregator Coprocessor CF:Aggregates CF:Compressed – TTL days/months CF:Aggregates – TTL months/years (CF per resolution) CF:Raw – TTL hours ECPM or DTCP FIFO ECPM or DTCP
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HBase Block Cache and Time Series  Current policy (LRU) is not optimal for time-series applications  We need something similar to FIFO (both in RAM and on SSD)  We need support for TB size RAM/SSD-based caches  Current off-heap bucket cache does not scale well (it keeps keys in Java heap)  For SSD cache we could mirror most recent store files, thus providing FIFO semantics w/o any complexity of disk-based cache management.  This all above are work items for future, but today … – Disable cache for raw data (prevent extreme cache churn) – Enable cache on write/read for compressed data and aggregations
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Flexible Retention Policies Raw Compressed Aggregates Hours Months Years
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Read/Write IO Reduction 100 ~50 ~10 Base FIFO+ECPM +Compaction
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Read/Write IO Reduction 100 ~50 ~10 Base FIFO+ECPM +Compaction 50-100MB/s 25-50MB/s 5-10MB/s
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Read/Write IO Reduction (estimate for 250K/sec data points) 100 ~50 ~10 Base FIFO+ECPM +Compaction 50-100MB/s 25-50MB/s 5-10MB/s
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summary  Disable major compaction  Do not run HDFS balancer  Disable HBase auto region balancing: balance_switch false  Disable region splits (DisabledRegionSplitPolicy)  Presplit table in advance.  Have separate column families for raw, compressed and aggregated data (each aggregate resolution – its own family)  Increase hbase.hstore.blockingStoreFiles for all column families  FIFO for Raw, ECPM(D) or DTCP (next session) for compressed and aggregated data
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summary (continued)  Run periodically internal job (coprocessor) to compress data and produce time-based rollup aggregations.  Do not cache raw data, write/read cache for others (if ECPM(D))  Enable WAL Compression - decrease write IO.  Use maximum compression for Raw data (GZ) – decrease write IO.
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Read Path (for 1%)
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SQL (Phoenix) integration  Each time series has set of named attributes, which we call meta (tags in OpenTSDB)  Keep time-series meta in Phoenix type table(s).  Adding new time series, deleting time-series or updating time-series is DML/DDL operation on a Phoenix table.  Meta is static (mostly)  Define set of attributes in meta which create PK  Have PK translation to unique ID.  Store ID, RTS (reversed time stamp), VALUE in HBase  Now you can index time-series by any attribute(s) in Phoenix  Query is two-step process: Phoenix first to select list of IDs, then HBase to run query on ID list
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Query Flow ID Active Version … MFG 11 true 1.1 SA 12 true 1.3 SA 15 true 1.4 GE 17 true 1.1 GE … … … … … 345 false 1.0 SA Phoenix SQL Time-Series Definition - META ID Timestamp Value 11 143897653 10.0 12 143897753 11.3 15 143897953 11.6 17 143897853 11.9 … … … 345 143897753 11.0 HBase Time Series DB Time-Series Data 2)GetAvgByIdSet(ID set, now(), now() - 24h) 1)SELECT ID FROM META WHERE MFG=‘SA’AND Version = ‘1.1’ 1. 2. ID set
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Time-Series DB API  Group operations on ID sets by time range – Min, Max, Avg, Count, Sum, other aggregations  Pluggable aggregation functions  Support for different time resolutions  With different approximations (linear, cubic, bi-cubic)  Batch load support (for writes)  Can be implemented in a HBase coprocessor layer  Can work much-much faster than regular SQL DBMS
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Time-Series DB API  Group operations on ID sets by time range – Min, Max, Avg, Count, Sum, other aggregations  Pluggable aggregation functions  Support for different time resolutions  With different approximations (linear, cubic, bi-cubic)  Batch load support (for writes)  Can be implemented in a HBase coprocessor layer  Can work much-much faster than regular SQL DBMS  Because we have already aggregated data
  • 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank you  Q&A