This presentation contextualizes the need for Apache Spark in Splice Machine and describes the main challenges integrating it into an existing ACID, distributed database. We highlight the novel contributions to the Spark/HBase ecosystem, such as hybrid scanners, custom InputFormat, out-of-JVM compactions and more. We end with some roadmap items under development involving new row-based and column-based storage encodings.
Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark
1. Architecture of an Open
Source RDBMS powered by
HBase and Spark
January 12, 2017
Spark Meetup Barcelona
Daniel Gómez Ferro
2. ▪ Introduction to Splice Machine
▪ 1.x: the need for Spark
▪ 2.0: Spark introduction, challenges and wins
▪ Future
2
Agenda
3. ▪ Splice Machine
▪ Distributed database company
▪ Open source
▪ VC-backed
▪ Offices in San Francisco and St Louis (MO)
3
Who are we?
4. 4
What do we do?
The Open Source RDBMS Powered By Hadoop & Spark
ANSI SQL
No retraining or rewrites for SQL-based
analysts, reports, and applications
¼ the Cost
Scales out on
commodity hardware
SQL Scale Out Speed
Transactions
Ensure reliable updates
across multiple rows
Mixed Workloads
Simultaneously support
OLTP and OLAP workloads
Elastic
Increase scale in
just a few minutes
10x Faster
Leverages Spark
in-memory technology
15. 15
HBase Architecture
▪ HRegion
▪ MemStore
▪ One or more HFiles
▪ HLog
▪ Writes
▪ Add it to the MemStore
▪ Write it to the HLog
▪ When the MemStore gets big enough
▪ Flush: dump the MemStore into a new HFile
▪ Reads
▪ In parallel from the MemStore and all HFiles
16. ▪ We reused several Derby components
▪ JDBC driver
▪ SQL Parser/Planner/Optimizer
▪ In-memory data formats
▪ Bytecode generation
▪ Developed some custom solutions
▪ TEMP table for transient data (joins, aggregates, etc.)
▪ Task framework (using HBase’s coprocessors)
▪ Connection pooling
▪ Switched Derby’s datastore for HBase
▪ Primary Keys and Indexes make use of HBase’s sorting order
▪ Removing Derby’s assumptions about running on a single machine...
16
Derby - HBase integration
17. ▪ Great for
▪ Operational workloads
▪ Replacing non-scalable RDBMS solutions
▪ SQL support
▪ SQL 99, Indexes, Triggers, Foreign Keys, cost based optimizer...
17
Splice Machine 1.x
18. ▪ Great for
▪ Operational workloads
▪ Replacing non-scalable RDBMS solutions
▪ SQL support
▪ SQL 99, Indexes, Triggers, Foreign Keys, cost based optimizer...
▪ But...
▪ Struggled on analytical queries
▪ HBase’s compactions created instabilities
▪ Minimum latency was too high (due to Task Framework)
18
Splice Machine 1.x
19. ▪ Great for
▪ Operational workloads
▪ Replacing non-scalable RDBMS solutions
▪ SQL support
▪ SQL 99, Indexes, Triggers, Foreign Keys, cost based optimizer...
▪ But...
▪ Struggled on analytical queries
▪ HBase’s compactions created instabilities
▪ Minimum latency was too high (due to Task Framework)
19
Splice Machine 1.x
21. ▪ Challenging but natural
▪ Matched tree of database operators with RDD transformations
21
Spark Integration
Aggregate
Join
Scan Restriction
Scan
reduceByKey()
join()
newAPIHadoopRDD() filter()
newAPIHadoopRDD()
22. 22
▪ Abstracted away the Spark API
▪ Two implementations
▪ In-memory using Guava’s FluentIterable APIs
▪ Distributed using Spark
▪ SQL operations have a single implementation
▪ In-memory use case:
▪ OLTP workloads
▪ Very low latency
▪ Bring data in, perform computation locally
▪ Anti-pattern in distributed systems, but it works
Unified API
23. ▪ Got rid of TEMP table
▪ Spark maintains temporary data in memory
▪ Got rid of Task Framework
▪ Spark performs the same job, less complexity
▪ Resource isolation
▪ HBase and Spark in separate processes
▪ Analytical queries don’t impact as much HBase stability
23
Spark Integration Benefits
25. ▪ Remove serialization boundaries
▪ Hybrid scanners:
▪ Custom InputFormat that reads HFiles directly from HDFS into Spark
▪ Merges those values with a fast scanner on the MemStore
▪ Most data: HDFS -> Spark
▪ Small part: HBase (in-memory) -> Spark
▪ Requires some hooks in HBase
▪ Compactions remove HFiles, Spark might be reading them
▪ Flushes add HFiles
▪ Much better read performance
25
Solving: serialization boundaries
26. ▪ Increase task granularity
▪ HTableInputFormat default is:
▪ 1 region = 1 partition
▪ Each region could be 1 GB or more
▪ SpliceInputFormat subdivides regions into blocks (default 32Mb)
▪ Better parallelism
▪ Better performance
▪ This also needs hooks in HBase (coprocessors)
26
Solving: task granularity
27. ▪ Single shared Spark context
▪ JobServer wasn’t good enough
▪ It would become a bottleneck, results would go through it
▪ Custom JobServer (called OLAPServer)
▪ Single Spark context on this server
▪ Currently colocated with the HMaster (fault tolerant for free)
▪ Makes Spark jobs stream results directly to the client
▪ Runs several partitions in parallel
▪ Starts streaming as soon as there’s data
27
Solving: multiple Spark contexts
28. 28
JobServer vs OLAPServerTime
JobServer
Start partition 1
Next row
Next row
…
End partition 1, send
Start partition 2
Next row
Next row
…
End partition 2, send
Client
Run partition 1
Get results
Run partition 2
Get results
During this time the client is
blocked waiting for more data
29. 29
JobServer vs OLAPServerTime
JobServer
Start partition 1
Next row
Next row
…
End partition 1, send
Start partition 2
Next row
Next row
…
End partition 2, send
Client
Run partition 1
Get results
Run partition 2
Get results
OLAPServer
Start partition 1, 2, 3
Get and send row
Get and send row
…
End partition 1
Start partition 4
Get and send row
Get and send row
…
End partition 2, 3
Start partition 5, 6
Client
Run partition 1, 2, 3
Get result
Get result
Run partition 4
Get result
Get result
Run partition 5, 6
31. ▪ Custom datatypes:
▪ Custom Kryo serializers for Derby objects
▪ Thread contexts
▪ Not completely solved
▪ Use TaskContext.addTaskCompletionListener() to cleanup after ourselves
▪ Still finding resource leaks from time to time...
31
Solving: Derby legacy issues
32. ▪ HBase compactions in Spark:
▪ HBase compactions can be expensive
▪ Reading and writing lots of data
▪ If they happen in the HBase JVM they can kill OLTP performance
▪ We made possible running them in Spark
▪ Maintaining data locality
▪ Scheduled among other jobs
▪ Fallback to HBase if the Spark scheduler doesn’t have resources
32
Other Spark goodies
33. 33
Other Spark goodies
▪ Integration with Spark streaming:
▪ We can ingest data directly from Spark Streaming
▪ Easy to write to Splice Machine from Kafka through it
35. 35
▪ Move to DataFrame APIs
▪ Catalyst optimizer
▪ Whole stage code generation (better than Derby’s codegen)
▪ Already transitioned some operations
▪ Requires good UnsafeRow support
▪ UnsafeRow
▪ Compact in memory representation
▪ Rows are a serialized in a continuous block of memory
▪ Better memory management
▪ Less GC time
Future Spark work
37. 37
▪ Columnar storage format
▪ We already have ‘Pinned’ tables:
▪ Create Parquet snapshot of table
▪ Get columnar access
▪ Good for read-only data
▪ Planning on maintaining dual representation
▪ Row-oriented for recently written values
▪ Column-oriented for historical data
▪ Merge those on the fly
Future Spark work
38. 38
▪ Better Spark shell integration
▪ Our SparkContext resides in the OLAPServer
▪ Getting data to a Spark shell incurs a serialization boundary
▪ From Splice’s SparkContext to the shell context
▪ We want to achieve transparent conversion
▪ ResultSet -> DataFrame
Future Spark work
39. 39
▪ Performance increases across the board
▪ TPCC, TPCH, Backup/Restore, ODBC driver…
▪ Incremental backup
▪ Native PL/SQL support (in Beta)
▪ No excuses left for migrating those Oracle databases
▪ Client load balancing/failover
▪ Via HAProxy
▪ Statistics improvements
▪ Histograms, sketching libraries
▪ RDD caching (pinning)
2.5 Roadmap