Smooth is a distributed storage system for managing structured time series data at Two Sigma. Smooth’s design emphasizes scale, both in terms of size and aggregate request bandwidth, reliability and storage efficiency. It is optimized for large parallel streaming read/write accesses over provided time ranges. Smooth has a clear separation between the metadata and data layers, and supports multiple pluggable object stores for storing data files. Data can be replicated or moved between different stores and data centers to support availability, performance and storage tiering objectives. Smooth is widely used at Two Sigma by various applications including modeling research workflows, data pipelines and various data analysis jobs. Smooth has been in development for about 5 years, currently stores multiple PBs of compressed data, and serves peak aggregate throughput in excess of 100 GB/s. In this talk I will discuss the design and implementation of Smooth, our experience running it over the past two years, ongoing challenges and future directions.
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
Smooth Storage - A distributed storage system for managing structured time series data at Two Sigma
1. www.twosigma.com
Smooth Storage
September 13, 2018Proprietary and Confidential – Not for Redistribution
A storage system for managing structured time
series data at Two Sigma
Saurabh Goel
saurabh.goel@twosigma.com
2. Disclaimer
This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer
to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon
for, investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates
(collectively, “Two Sigma”). Such views reflect the assumptions of the author(s) of the document and are subject to change without
notice. The document may employ data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of
such information and the use of such information in no way implies an endorsement of the source of such information or its validity.
The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two
Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for
identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark
does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.
3. Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• System Architecture
• Looking Forward
Proprietary and Confidential – Not for Redistribution
4. Motivation
September 13, 2018
• Why have specialized storage for time series data ?
Extremely common at Two Sigma
Time is one of the primary dimensions along which applications want to partition and
filter data
Scale – in terms of both size and access
Optimizing for the target application workload and requirements
Proprietary and Confidential – Not for Redistribution
5. Smooth’s design emphasis
September 13, 2018
• Optimized for range queries and range updates executed in parallel per table
• File system like operations but with database like properties like atomicity
and an isolation model for concurrent access
• Centrally managed service at TS
• Higher expectations around reliability, availability, and multi-tenancy
(security, access control, fair sharing of resources, etc)
• Storage efficiency is also a major concern given the overall size of data stored
Proprietary and Confidential – Not for Redistribution
File system ------------------------------ Smooth --------------- Database
6. Target Application characteristics
September 13, 2018
• Parallel time partitioned jobs that move a lot of data
• Tend to be batch oriented; care more about throughput than latency
• New use cases are demanding better latency, smaller IO, more query power
• Not good for workloads that require very low latencies or issue large numbers
of small reads and writes
Proprietary and Confidential – Not for Redistribution
7. Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• System Architecture
• Looking Forward
Proprietary and Confidential – Not for Redistribution
8. Data Model
September 13, 2018Proprietary and Confidential – Not for Redistribution
• Tables with schema; mandatory time column
• Rows ordered and indexed by time
• Not relational – duplicate timestamps/rows allowed; no notion of primary key
but users can enforce PK constraints in their applications
• Easy to update schema
• Can store wide sparse schemas efficiently
9. Write API
September 13, 2018
Updates a given time range atomically; the existing rows belonging to the range
are replaced by the given set of new rows
Proprietary and Confidential – Not for Redistribution
WriteSession s = write(table, [10, 42));
s.addRow(<10, ..>);
s.addRow(<15, ..>);
// repeated timestamp is ok
s.addRow(<15, ..>);
// rows must be added in non-decreasing order
s.addRow(<10, ..>);
// rows must lie within the given time range
s.addRow(<50, ..>);
s.commit();
10. Write API
September 13, 2018Proprietary and Confidential – Not for Redistribution
• Set of write operations to a table forms a total order; internally each write
gets a unique, strictly monotonically increasing logical commit timestamp
• Distributed atomic writes are possible
• Delete is just a special case of update where no new rows are written
11. Read API
September 13, 2018Proprietary and Confidential – Not for Redistribution
• Rows returned are based on the latest committed view of the table at the
start of the read operation. Remains isolated from concurrent writes.
Read API
• Snapshot reads over a given time range
Iterator<Row> i = read(table, time range);
while(i.hasNext()) {
doSomething(i.next());
}
12. Other Operations
September 13, 2018
• Some operations that are not officially supported but a natural fit for smooth
• Distributed snapshot reads
• Reads in the past, permanent snapshots
• Atomic read-modify-write operations using optimistic concurrency control
(OCC) on the commit time
Proprietary and Confidential – Not for Redistribution
13. Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• System Architecture
• Looking Forward
Proprietary and Confidential – Not for Redistribution
14. Table Implementation
September 13, 2018Proprietary and Confidential – Not for Redistribution
Time column
Shard 2
Shard 1
overwritten time range
Committime
c1
c2
Data
file
Replica
Data file contains the new
set of ordered rows;
immutable and indexed;
potentially replicated
Shard is the internal representation
of an update operation;
semantically immutable
Data layer
Metadata layer
15. Read Algorithm
September 13, 2018Proprietary and Confidential – Not for Redistribution
Time column
Committime
Shard 1
Shard 2
Shard 3
Shard 4
Read this range
start of
read
Reads are implemented by
concatenating together visible
subranges of overlapping shards - we
call this the “read plan”
The underlying data file per shard is
ordered and indexed and can efficiently
select rows belonging to visible sub-
ranges
16. Data File format
September 13, 2018
The underlying data file is indexed using a simple two level static B+Tree
Proprietary and Confidential – Not for Redistribution
17. Data File format
September 13, 2018
A data file has one index block and individually compressed data blocks laid out
contiguously
• Data block is the unit of read; variable sized and compressed; typically small
number of MBs; allow random access and parallelization
• Currently use lz4 for most of the files; very low overhead but still gives us
about 2x compression on average; have used gzip for some of the cold data
files
Proprietary and Confidential – Not for Redistribution
18. Compaction
September 13, 2018
Problem: overwrites of random time ranges and small writes
• Excessive fragmentation of the read plan; leads to slow reads, and excessive
seeks on the backend data stores reducing overall serving capacity
• Metadata bloat; small shards/files means larger metadata on smooth and
object stores
• Garbage; data under hidden ranges can be garbage collected
Proprietary and Confidential – Not for Redistribution
19. Compaction Process
September 13, 2018Proprietary and Confidential – Not for Redistribution
Time column
Committime
Shard 1
Shard 2
Shard 3
Shard 4
New compacted shard
committed here
New compacted
shard
Deleted after the new
shard is committed
Underlying data files
are not immediately
deleted to support
ongoing reads
Only contiguous fragments can be combined
together!
20. Comparing with LSM
September 13, 2018
Similar to Log Structured Merge (LSM) tree
• Smooth impl is log structured
• immutable shards with embedded B-trees are similar to “sstables”
• both have compaction processes aimed at similar objectives
• Differ in details – each shard carries with itself a “bulk delete” tombstone
whose handling is deferred till compaction time
• read algorithm is different – no row level comparison for “next” operation
• Key-value stores can use similar ideas to optimize bulk deletes
Proprietary and Confidential – Not for Redistribution
21. Write Amplification
September 13, 2018
• Write amplification = actual bytes written to storage / bytes written by user
• Has not been an issue in practice – less than 10 on average
• If the write workload gets more challenging (i.e. higher rate of small random
writes)
• Use leveled compaction similar to traditional key-value based LSM storage
engines
• by allowing non-contiguous shards to be combined – shards essentially get moved
into data files
• would make our read algorithm more complex - need to merge read plans from all
levels
Proprietary and Confidential – Not for Redistribution
22. Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• System Architecture
• Looking Forward
Proprietary and Confidential – Not for Redistribution
24. System Architecture
September 13, 2018
• All smooth metadata is stored on Microsoft Sql Server which gets replicated
to backup servers in a remote data center
• Stateless metadata servers front the database providing functions like
authorization, quota enforcement, and qos (fair sharing of resources)
• Applications link with a smooth client library in order to access smooth
Proprietary and Confidential – Not for Redistribution
25. System Architecture
September 13, 2018
• Data files are stored in object stores
• Multiple different types of OSs can be plugged into smooth and federated
together for scaling, or replicated across for geo-redundancy/availability, or
used for storage tiering.
• Currently we use HDFS for warm data and CELFS for cold data; CELFS is an
internal archival file system at TS
Proprietary and Confidential – Not for Redistribution
26. Virtues of Immutability
September 13, 2018
• A design principle we have been using is immutability - both physical (write-
once data files) and semantic (shards)
• The combination of linear metadata (i.e. strictly increasing commit
timestamps) and immutable elements means that user reads and updates, the
shard compaction process, and physical data movement process can operate
in parallel with no interference and with minimal coordination
• Data files can be cached without worrying about consistency
This simple model has been central to keeping the system simple, robust and
scalable.
Proprietary and Confidential – Not for Redistribution
27. Some Statistics
September 13, 2018
• Multiple PBs of unique compressed data
• Read peaks in excess of 100 GB/s (before decompressing)
• 100s of millions of files/shards
• 10s of millions of tables
• 10s of thousands of concurrent requests
Proprietary and Confidential – Not for Redistribution
28. Outline
September 13, 2018
• Motivation and design emphasis
• Data Model and API
• Implementation of the data model
• System Architecture
• Looking Forward
Proprietary and Confidential – Not for Redistribution
29. Looking Forward
September 13, 2018
• Multi-datacenter and public cloud read scaling
• CDN like distributed caching layer that spans even to sites that don’t store
data
• Encryption at rest may be important for cloud use cases
• More cost-efficient multi-dc replication and cold data storage
• Data stores that use erasure coding
• More efficient data encoding and compression
• Data stores that can replicate data across data centers and support
desirable failover semantics
Proprietary and Confidential – Not for Redistribution
30. Looking Forward
September 13, 2018
• Performance
• Performance consistency is a major concern - tail latencies are a major issue
with HDFS
• Issues with slow serialization and parsing of rows
• More challenging workloads
• Interactive workloads are becoming common – latency sensitive
• Column filtering
• Complex read queries
Proprietary and Confidential – Not for Redistribution
31. Looking Forward
September 13, 2018
Complex queries
• Common for time series datasets to have multiple sub-series merged together
by time, like prices per stock ticker. The sub-series is typically identified by
another column. The cardinality of this column is generally in 10k to 20k
range
• Example query: given an arbitrary subset of tickers and a time range, return all
matching rows ordered by time
• In reality each ticker has its own time range, and there are several variations
of this query
• Looking at new kinds of indexing
Proprietary and Confidential – Not for Redistribution
32. Looking Forward
September 13, 2018
• Moving away from a “thick” smooth client
• Enables quick iteration and bug fixes
• Multi-language support
• Opens up many architectural possibilities like caching, easier access control,
Qos, etc
• Various other reliability, multi-tenancy, metadata scaling, security and
operability improvements
Proprietary and Confidential – Not for Redistribution
A shard is semantically immutable, i.e. it always returns the same set of rows
The physical representation of the underlying data can change in format or storage location or be replicated
Gets the read plan for the entire time range and finds areas with excessive fragmentation (many small fragments)
Selects a contiguous segment of the read plan containing fragments to be fixed, and rewrites them as a single new shard.
The commit time of the new shard is the max of participating input shards – this makes sure the compaction process does not interfere with ongoing writes
The underlying data files for the deleted shards are not immediately removed so that references from read plans of ongoing reads remain valid