Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

Architecture of an Open
Source RDBMS powered by
HBase and Spark
January 12, 2017
Spark Meetup Barcelona
Daniel Gómez Ferro

▪ Introduction to Splice Machine
▪ 1.x: the need for Spark
▪ 2.0: Spark introduction, challenges and wins
▪ Future
2
Agenda

▪ Splice Machine
▪ Distributed database company
▪ Open source
▪ VC-backed
▪ Offices in San Francisco and St Louis (MO)
3
Who are we?

4
What do we do?
The Open Source RDBMS Powered By Hadoop & Spark
ANSI SQL
No retraining or rewrites for SQL-based
analysts, reports, and applications
¼ the Cost
Scales out on
commodity hardware
SQL Scale Out Speed
Transactions
Ensure reliable updates
across multiple rows
Mixed Workloads
Simultaneously support
OLTP and OLAP workloads
Elastic
Increase scale in
just a few minutes
10x Faster
Leverages Spark
in-memory technology

Timeline
5
Where are we?
Founded
2012
v1.0
Nov 2014
v1.5
Oct 2015
v2.0
Open Source
Spark integration
Jul 2016
v2.5
Feb? 2017

▪ SQL Parser
▪ SQL Planner
▪ SQL Optimizer
▪ Data storage
▪ Execution Engine
7
RDBMS building blocks

▪ SQL Parser
▪ SQL Planner
▪ SQL Optimizer
▪ Data storage
▪ Execution Engine
8
RDBMS building blocks
Apache Derby
Apache HBase

▪ Distributed Sorted Map
▪ Keys and Values are byte[]
▪ Lexicographically sorted (like a dictionary)
▪ [] < [0x00; 0x00] < [0x01] < [0x01, 0x00]
▪ Key range is partitioned/sharded in ‘regions’
▪ Operations
▪ get(byte[] key) -> byte[]
▪ put(byte[] key, byte[] value)
▪ scan(byte[] start, byte[] stop) -> scanner
▪ Extensions
▪ Coprocessors
9
HBase Introduction

▪ Sits on top of HDFS
▪ Distributed append-only Filesystem
▪ Files are immutable once closed
11
HBase challenges

▪ Log-Structured Merge Tree
▪ Multi level storage
▪ Inserts are buffered in memory
▪ Flush this buffer to disk, writing indexed, sorted files
12
HBase challenges
In-memory
buffer
[a; m]
File 1 [a; p]
File 2 [f; z]

▪ Log-Structured Merge Tree
▪ Multi level storage
▪ Inserts are buffered in memory
▪ Flush this buffer to disk, writing indexed, sorted files
13
HBase challenges
File 3 [a; m]
File 1 [a; p]
File 2 [f; z]
In-memory
buffer
[]

15
HBase Architecture
▪ HRegion
▪ MemStore
▪ One or more HFiles
▪ HLog
▪ Writes
▪ Add it to the MemStore
▪ Write it to the HLog
▪ When the MemStore gets big enough
▪ Flush: dump the MemStore into a new HFile
▪ Reads
▪ In parallel from the MemStore and all HFiles

▪ We reused several Derby components
▪ JDBC driver
▪ SQL Parser/Planner/Optimizer
▪ In-memory data formats
▪ Bytecode generation
▪ Developed some custom solutions
▪ TEMP table for transient data (joins, aggregates, etc.)
▪ Task framework (using HBase’s coprocessors)
▪ Connection pooling
▪ Switched Derby’s datastore for HBase
▪ Primary Keys and Indexes make use of HBase’s sorting order
▪ Removing Derby’s assumptions about running on a single machine...
16
Derby - HBase integration

▪ Great for
▪ Operational workloads
▪ Replacing non-scalable RDBMS solutions
▪ SQL support
▪ SQL 99, Indexes, Triggers, Foreign Keys, cost based optimizer...
17
Splice Machine 1.x

▪ Great for
▪ SQL support
▪ But...
▪ Struggled on analytical queries
▪ HBase’s compactions created instabilities
▪ Minimum latency was too high (due to Task Framework)
18
Splice Machine 1.x

▪ Great for
▪ SQL support
▪ But...
▪ Struggled on analytical queries
▪ HBase’s compactions created instabilities
▪ Minimum latency was too high (due to Task Framework)
19
Splice Machine 1.x

2.0: Spark introduction,
challenges and wins

▪ Challenging but natural
▪ Matched tree of database operators with RDD transformations
21
Spark Integration
Aggregate
Join
Scan Restriction
Scan
reduceByKey()
join()
newAPIHadoopRDD() filter()
newAPIHadoopRDD()

22
▪ Abstracted away the Spark API
▪ Two implementations
▪ In-memory using Guava’s FluentIterable APIs
▪ Distributed using Spark
▪ SQL operations have a single implementation
▪ In-memory use case:
▪ OLTP workloads
▪ Very low latency
▪ Bring data in, perform computation locally
▪ Anti-pattern in distributed systems, but it works
Unified API

▪ Got rid of TEMP table
▪ Spark maintains temporary data in memory
▪ Got rid of Task Framework
▪ Spark performs the same job, less complexity
▪ Resource isolation
▪ HBase and Spark in separate processes
▪ Analytical queries don’t impact as much HBase stability
23
Spark Integration Benefits

▪ Many serialization boundaries: poor performance
▪ HDFS -> HBase -> Spark -> Client
▪ Task granularity
▪ Too coarse
▪ Multiple Spark contexts
▪ One per HRegionServer
▪ Derby legacy issues
▪ Custom datatypes
▪ Thread context assumptions
24
Integration Problems

▪ Remove serialization boundaries
▪ Hybrid scanners:
▪ Custom InputFormat that reads HFiles directly from HDFS into Spark
▪ Merges those values with a fast scanner on the MemStore
▪ Most data: HDFS -> Spark
▪ Small part: HBase (in-memory) -> Spark
▪ Requires some hooks in HBase
▪ Compactions remove HFiles, Spark might be reading them
▪ Flushes add HFiles
▪ Much better read performance
25
Solving: serialization boundaries

▪ Increase task granularity
▪ HTableInputFormat default is:
▪ 1 region = 1 partition
▪ Each region could be 1 GB or more
▪ SpliceInputFormat subdivides regions into blocks (default 32Mb)
▪ Better parallelism
▪ Better performance
▪ This also needs hooks in HBase (coprocessors)
26
Solving: task granularity

▪ Single shared Spark context
▪ JobServer wasn’t good enough
▪ It would become a bottleneck, results would go through it
▪ Custom JobServer (called OLAPServer)
▪ Single Spark context on this server
▪ Currently colocated with the HMaster (fault tolerant for free)
▪ Makes Spark jobs stream results directly to the client
▪ Runs several partitions in parallel
▪ Starts streaming as soon as there’s data
27
Solving: multiple Spark contexts

28
JobServer vs OLAPServerTime
JobServer
Start partition 1
Next row
Next row
…
End partition 1, send
Start partition 2
Next row
Next row
…
Client
Run partition 1
Get results
Run partition 2
Get results
During this time the client is
blocked waiting for more data

29
JobServer vs OLAPServerTime
JobServer
Start partition 1
Next row
Next row
…
Start partition 2
Next row
Next row
…
Client
Run partition 1
Get results
Run partition 2
Get results
OLAPServer
Start partition 1, 2, 3
Get and send row
Get and send row
…
End partition 1
Start partition 4
Get and send row
Get and send row
…
End partition 2, 3
Start partition 5, 6
Client
Run partition 1, 2, 3
Get result
Get result
Run partition 4
Get result
Get result
Run partition 5, 6

30
Single shared Spark context

▪ Custom datatypes:
▪ Custom Kryo serializers for Derby objects
▪ Thread contexts
▪ Not completely solved
▪ Use TaskContext.addTaskCompletionListener() to cleanup after ourselves
▪ Still finding resource leaks from time to time...
31
Solving: Derby legacy issues

▪ HBase compactions in Spark:
▪ HBase compactions can be expensive
▪ Reading and writing lots of data
▪ If they happen in the HBase JVM they can kill OLTP performance
▪ We made possible running them in Spark
▪ Maintaining data locality
▪ Scheduled among other jobs
▪ Fallback to HBase if the Spark scheduler doesn’t have resources
32
Other Spark goodies

33
Other Spark goodies
▪ Integration with Spark streaming:
▪ We can ingest data directly from Spark Streaming
▪ Easy to write to Splice Machine from Kafka through it

35
▪ Move to DataFrame APIs
▪ Catalyst optimizer
▪ Whole stage code generation (better than Derby’s codegen)
▪ Already transitioned some operations
▪ Requires good UnsafeRow support
▪ UnsafeRow
▪ Compact in memory representation
▪ Rows are a serialized in a continuous block of memory
▪ Better memory management
▪ Less GC time
Future Spark work

36
Future Spark work
Row1
Current
DataDes[]
SQLDate SomeOtherField
SQLInteger
SQLInteger
UnsafeRow
MemoryBlock: byte[]
Row1
0, 100
Row2
100, 50

37
▪ Columnar storage format
▪ We already have ‘Pinned’ tables:
▪ Create Parquet snapshot of table
▪ Get columnar access
▪ Good for read-only data
▪ Planning on maintaining dual representation
▪ Row-oriented for recently written values
▪ Column-oriented for historical data
▪ Merge those on the fly
Future Spark work

38
▪ Better Spark shell integration
▪ Our SparkContext resides in the OLAPServer
▪ Getting data to a Spark shell incurs a serialization boundary
▪ From Splice’s SparkContext to the shell context
▪ We want to achieve transparent conversion
▪ ResultSet -> DataFrame
Future Spark work

39
▪ Performance increases across the board
▪ TPCC, TPCH, Backup/Restore, ODBC driver…
▪ Incremental backup
▪ Native PL/SQL support (in Beta)
▪ No excuses left for migrating those Oracle databases
▪ Client load balancing/failover
▪ Via HAProxy
▪ Statistics improvements
▪ Histograms, sketching libraries
▪ RDD caching (pinning)
2.5 Roadmap

Open Source:
github.com/splicemachine
Community:
community.splicemachine.com
Slack channel:
tinyurl.com/SMslack

TPCH 100 Query Times (seconds)
Query
1 395 TRAFODION-2237 99
2 PHOENIX-3322 516 44
3 PHOENIX-3322 TRAFODION-2237 126
4 PHOENIX-3322 TBD 133
6 74 3178 38
7 PHOENIX-3322 4442 220
9 PHOENIX-3322 941 273
11 PHOENIX-3317 463 56

TPCH 100 Query Times (seconds)
Query
12 379 TBD 85
18 PHOENIX-3322 TBD SPLICE-34
20 PHOENIX-3320 TBD SPLICE-410

Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

Recomendados

Recomendados

Mais conteúdo relacionado

Último

Último (20)

Destaque

Destaque (20)

Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark