Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma

www.twosigma.com
Smooth Storage
September 13, 2018Proprietary and Confidential – Not for Redistribution
A storage system for managing structured time
series data at Two Sigma
Saurabh Goel
saurabh.goel@twosigma.com

Disclaimer
This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer
to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon
for, investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates
(collectively, “Two Sigma”). Such views reflect the assumptions of the author(s) of the document and are subject to change without
notice. The document may employ data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of
such information and the use of such information in no way implies an endorsement of the source of such information or its validity.
The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two
Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for
identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark
does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.

Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• System Architecture
• Looking Forward
Proprietary and Confidential – Not for Redistribution

Motivation
September 13, 2018
• Why have specialized storage for time series data ?
 Extremely common at Two Sigma
 Time is one of the primary dimensions along which applications want to partition and
filter data
 Scale – in terms of both size and access
 Optimizing for the target application workload and requirements

Smooth’s design emphasis
September 13, 2018
• Optimized for range queries and range updates executed in parallel per table
• File system like operations but with database like properties like atomicity
and an isolation model for concurrent access
• Centrally managed service at TS
• Higher expectations around reliability, availability, and multi-tenancy
(security, access control, fair sharing of resources, etc)
• Storage efficiency is also a major concern given the overall size of data stored
File system ------------------------------ Smooth --------------- Database

Target Application characteristics
September 13, 2018
• Parallel time partitioned jobs that move a lot of data
• Tend to be batch oriented; care more about throughput than latency
• New use cases are demanding better latency, smaller IO, more query power
• Not good for workloads that require very low latencies or issue large numbers
of small reads and writes

Data Model
• Tables with schema; mandatory time column
• Rows ordered and indexed by time
• Not relational – duplicate timestamps/rows allowed; no notion of primary key
but users can enforce PK constraints in their applications
• Easy to update schema
• Can store wide sparse schemas efficiently

Write API
September 13, 2018
Updates a given time range atomically; the existing rows belonging to the range
are replaced by the given set of new rows
WriteSession s = write(table, [10, 42));
s.addRow(<10, ..>);
s.addRow(<15, ..>);
// repeated timestamp is ok
s.addRow(<15, ..>);
// rows must be added in non-decreasing order
s.addRow(<10, ..>);
// rows must lie within the given time range
s.addRow(<50, ..>);
s.commit();

Write API
• Set of write operations to a table forms a total order; internally each write
gets a unique, strictly monotonically increasing logical commit timestamp
• Distributed atomic writes are possible
• Delete is just a special case of update where no new rows are written

Read API
• Rows returned are based on the latest committed view of the table at the
start of the read operation. Remains isolated from concurrent writes.
Read API
• Snapshot reads over a given time range
Iterator<Row> i = read(table, time range);
while(i.hasNext()) {
doSomething(i.next());
}

Other Operations
September 13, 2018
• Some operations that are not officially supported but a natural fit for smooth
• Distributed snapshot reads
• Reads in the past, permanent snapshots
• Atomic read-modify-write operations using optimistic concurrency control
(OCC) on the commit time

Table Implementation
Time column
Shard 2
Shard 1
overwritten time range
Committime
c1
c2
Data
file
Replica
Data file contains the new
set of ordered rows;
immutable and indexed;
potentially replicated
Shard is the internal representation
of an update operation;
semantically immutable
Data layer
Metadata layer

Read Algorithm
Time column
Committime
Shard 1
Shard 2
Shard 3
Shard 4
Read this range
start of
read
Reads are implemented by
concatenating together visible
subranges of overlapping shards - we
call this the “read plan”
The underlying data file per shard is
ordered and indexed and can efficiently
select rows belonging to visible sub-
ranges

Data File format
September 13, 2018
The underlying data file is indexed using a simple two level static B+Tree

Data File format
September 13, 2018
A data file has one index block and individually compressed data blocks laid out
contiguously
• Data block is the unit of read; variable sized and compressed; typically small
number of MBs; allow random access and parallelization
• Currently use lz4 for most of the files; very low overhead but still gives us
about 2x compression on average; have used gzip for some of the cold data
files

Compaction
September 13, 2018
Problem: overwrites of random time ranges and small writes
• Excessive fragmentation of the read plan; leads to slow reads, and excessive
seeks on the backend data stores reducing overall serving capacity
• Metadata bloat; small shards/files means larger metadata on smooth and
object stores
• Garbage; data under hidden ranges can be garbage collected

Compaction Process
Time column
Committime
Shard 1
Shard 2
Shard 3
Shard 4
New compacted shard
committed here
New compacted
shard
Deleted after the new
shard is committed
Underlying data files
are not immediately
deleted to support
ongoing reads
Only contiguous fragments can be combined
together!

Comparing with LSM
September 13, 2018
Similar to Log Structured Merge (LSM) tree
• Smooth impl is log structured
• immutable shards with embedded B-trees are similar to “sstables”
• both have compaction processes aimed at similar objectives
• Differ in details – each shard carries with itself a “bulk delete” tombstone
whose handling is deferred till compaction time
• read algorithm is different – no row level comparison for “next” operation
• Key-value stores can use similar ideas to optimize bulk deletes

Write Amplification
September 13, 2018
• Write amplification = actual bytes written to storage / bytes written by user
• Has not been an issue in practice – less than 10 on average
• If the write workload gets more challenging (i.e. higher rate of small random
writes)
• Use leveled compaction similar to traditional key-value based LSM storage
engines
• by allowing non-contiguous shards to be combined – shards essentially get moved
into data files
• would make our read algorithm more complex - need to merge read plans from all
levels

System Architecture

System Architecture
September 13, 2018
• All smooth metadata is stored on Microsoft Sql Server which gets replicated
to backup servers in a remote data center
• Stateless metadata servers front the database providing functions like
authorization, quota enforcement, and qos (fair sharing of resources)
• Applications link with a smooth client library in order to access smooth

System Architecture
September 13, 2018
• Data files are stored in object stores
• Multiple different types of OSs can be plugged into smooth and federated
together for scaling, or replicated across for geo-redundancy/availability, or
used for storage tiering.
• Currently we use HDFS for warm data and CELFS for cold data; CELFS is an
internal archival file system at TS

Virtues of Immutability
September 13, 2018
• A design principle we have been using is immutability - both physical (write-
once data files) and semantic (shards)
• The combination of linear metadata (i.e. strictly increasing commit
timestamps) and immutable elements means that user reads and updates, the
shard compaction process, and physical data movement process can operate
in parallel with no interference and with minimal coordination
• Data files can be cached without worrying about consistency
This simple model has been central to keeping the system simple, robust and
scalable.

Some Statistics
September 13, 2018
• Multiple PBs of unique compressed data
• Read peaks in excess of 100 GB/s (before decompressing)
• 100s of millions of files/shards
• 10s of millions of tables
• 10s of thousands of concurrent requests

Looking Forward
September 13, 2018
• Multi-datacenter and public cloud read scaling
• CDN like distributed caching layer that spans even to sites that don’t store
data
• Encryption at rest may be important for cloud use cases
• More cost-efficient multi-dc replication and cold data storage
• Data stores that use erasure coding
• More efficient data encoding and compression
• Data stores that can replicate data across data centers and support
desirable failover semantics

Looking Forward
September 13, 2018
• Performance
• Performance consistency is a major concern - tail latencies are a major issue
with HDFS
• Issues with slow serialization and parsing of rows
• More challenging workloads
• Interactive workloads are becoming common – latency sensitive
• Column filtering
• Complex read queries

Looking Forward
September 13, 2018
Complex queries
• Common for time series datasets to have multiple sub-series merged together
by time, like prices per stock ticker. The sub-series is typically identified by
another column. The cardinality of this column is generally in 10k to 20k
range
• Example query: given an arbitrary subset of tickers and a time range, return all
matching rows ordered by time
• In reality each ticker has its own time range, and there are several variations
of this query
• Looking at new kinds of indexing

Looking Forward
September 13, 2018
• Moving away from a “thick” smooth client
• Enables quick iteration and bug fixes
• Multi-language support
• Opens up many architectural possibilities like caching, easier access control,
Qos, etc
• Various other reliability, multi-tenancy, metadata scaling, security and
operability improvements

September 13, 2018
Thank You!

Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma

Semelhante a Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma (20)

Mais de Two Sigma

Mais de Two Sigma (20)

Último

Último (20)

Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma

Notas do Editor