From the Hadoop Summit 2015 Session with Ted Dunning:
Just when we thought the last mile problem was solved, the Internet of Things is turning the last mile problem of the consumer internet into the first mile problem of the industrial internet. This inversion impacts every aspect of the design of networked applications. I will show how to use existing Hadoop ecosystem tools, such as Spark, Drill and others, to deal successfully with this inversion. I will present real examples of how data from things leads to real business benefits and describe real techniques for how these examples work.
Talk track: 2nd in series, first was on how to build a simple recommender. This one on anomaly detection is being sold by O’Reilly on Amazon,
but for a limited time MapR is giving away the e-book for free. Here’s the link where you can register to get one.
Ted’s original talk notes: OpenTSDB consists of a Time Series Daemon (TSD) as well as set of command line utilities. Interaction with OpenTSDB is primarily achieved by running one or more of the TSDs. Each TSD is independent. There is no master, no shared state so you can run as many TSDs as required to handle any load you throw at it. Each TSD uses the open source databaseHBase to store and retrieve time-series data. The HBase schema is highly optimized for fast aggregations of similar time series to minimize storage space. Users of the TSD never need to access HBase directly. You can communicate with the TSD via a simple telnet-style protocol, an HTTP API or a simple built-in GUI. All communications happen on the same port (the TSD figures out the protocol of the client by looking at the first few bytes it receives).
Key ideas: Unique row key based on an id for each time series (looked up from a separate look-up table); important part of the efficiency of design is to have each column be a time off-set from the start time shown in the row key. Note that data is stored point-by-point in this wide table design.
Ted’s notes from his original slide:
One technique for increasing the rate at which data can be retrieved from a time series database is to store many values in each row. Doing this allows data points to be retrieved at a higher speed
Because both HBase and MapR-DB store data ordered by the primary key, this design will cause rows containing data from a single time series to wind up near one another on disk. Retrieving data from a particular time series for a time range will involve largely sequential disk operations and therefore will be much faster than would be the case if the rows were widely scattered.
Typically, the time window is adjusted so that 100–1,000 samples are in each row.
Ted’s notes from original slide:
The table design is improved by collapsing all of the data for a row into a single data structure known as a blob. This blob can be highly compressed so that less data needs to be read from disk. Also, having a single column per row decreases the per-column overhead incurred by the on-disk format that HBase uses, which further increases performance.
Data can be progressively converted to the compressed format as soon as it is known that little or no new data is likely to arrive for that time series and time window. Commonly, once the time window ends, new data will only arrive for a few more seconds, and the compression of the data can begin. Since compressed and uncompressed data can coexist in the same row, if a few samples arrive after the row is compressed, the row can simply be compressed again to merge the blob and the late-arriving samples.
Richard: This is based on a figure from Chapter 3 of our book. Point here is to show that with standard Open TSDB, data is loaded into the wide table point-by-point, then pulled out and compressed to blob, then reloaded to form the hybrid table. This is a fairly efficient arrangement. Next slide will show how this is speeded up with the MapR open source extensions.
Here are Ted’s original notes: Since data is inserted in the uncompressed format, the arrival of each data point requires a row update operation to insert the value into the database.
Then read again by the blob maker. Reads are approximately equal to writes. Once data is compressed to blobs, it is again written to the database.
This row update can limit the insertion rate for data to as little as 20,000 data points per second per node in the cluster.
Richard: Also based on a figure from Chapter 3 of book: This slide shows the increased performance using the open source code MapR made open on github. I’ve added the github link. The key differences is that the blob production occurs upstream, before the data is ever loaded into the table. The restart logs are useful so that if there were ever a glitch with the process of compressing data to blobs and insertion, you would not lose the original data. Note that there is still the delay while blobs are made… see explanation in book, chapters 3 and 4.
Richard: Please preserve the rest of the material on fast ingestion with MapR extensions (direct blob loading) for Ted’s talk on Sat. Use this slide as a preview and mention that Ted will be talking about this on Fiday.
Ted’s original notes: the direct blob insertion data flow allows the insertion rate to be increased by as much as roughly 1,000-fold.
How does the direct blob approach get this bump in performance? The essential difference is that the blob maker has been moved into the data flow between the catcher and the NoSQL time series database. This way, the blob maker can use incoming data from a memory cache rather than extracting its input from wide table rows already stored in the storage tier.
the full data stream is only written to the memory cache, which is fast, rather than to the database. Data is not written to the storage tier until it’s compressed into blobs, so writing can be much faster. The number of database operations is decreased by the average number of data points in each of the compressed data blobs. This decrease can easily be a factor in the thousands.