Splice Machine is an open-source database that combines the benefits of modern lambda architectures with the full expressiveness of ANSI-SQL. Like lambda architectures, it employs separate compute engines for different workloads - some call this an HTAP database (Hybrid Transactional and Analytical Platform). This talk describes the architecture and implementation of Splice Machine V2.0. The system is powered by a sharded key-value store for fast short reads and writes, and short range scans (Apache HBase) and an in-memory, cluster data flow engine for analytics (Apache Spark). It differs from most other clustered SQL systems such as Impala, SparkSQL, and Hive because it combines analytical processing with a distributed Multi-Value Concurrency Method that provides fine-grained concurrency which is required to power real-time applications. This talk will highlight the Splice Machine storage representation, transaction engine, cost-based optimizer, and present the detailed execution of operational queries on HBase, and the detailed execution of analytical queries on Spark. We will compare and contrast how Splice Machine executes queries with other HTAP systems such as Apache Phoenix and Apache Trafodian. We will end with some roadmap items under development involving new row-based and column-based storage encodings.
Speakers:
Monte Zweben, is a technology industry veteran. Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. He then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit. In 1998, he was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings. He currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.
How AI, OpenAI, and ChatGPT impact business and software.
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark
1. Splice Machine Proprietary and Confidential
Open Source RDBMS
For Mixed Operational and Analytical Workloads
Monte Zweben
October 20, 2016
2. Splice Machine Proprietary and Confidential
Who We Are
The Open Source RDBMS Powered By Hadoop & Spark
2
ANSI SQL
No retraining or rewrites for SQL-based
analysts, reports, and applications
¼ the Cost
Scales out on
commodity hardware
SQL Scale Out Speed
Transactions
Ensure reliable updates
across multiple rows
Mixed Workloads
Simultaneously support
OLTP and OLAP workloads
Elastic
Increase scale in
just a few minutes
10x Faster
Leverages Spark
in-memory technology
3. Splice Machine Proprietary and Confidential
Life Sciences
Digital Marketing Financial Services
DECISIONS IN THE MOMENT
Supply Chain Optimization
4. Splice Machine Proprietary and Confidential
Today’s Reality: Stale Data, Backward-Looking Decisions
4
How old is the data in your reports?
1 day +
1 day
4 hours +
1 hour +
Real-time
5. Splice Machine Proprietary and Confidential
Today’s Reality: Stale Data, Backward-Looking Decisions
5
24%
50%
7%
9%
9%
* Source: Webinars on 11-3-15 and 12-10-15, 237 respondents
How old is the data in your reports?
1 day +
1 day
4 hours +
1 hour +
Real-time
6. Splice Machine Proprietary and Confidential
Legacy ETL Architectures Unable to Keep Up
Ad Hoc
Analytics
Executive
Business Reports
Operational Reports
ERP
CRM
Supply
Chain
HR
…
Data
Warehouse
Datamart
Stream or Batch
Updates
Mixed Workload
Apps
ODS
ETL
OLTP
Systems
Extract
Transform
Load
OLAP
Systems Pain
Separate OLTP & OLAP
systems
Messy ETL “glue”
Why?
Different workloads
Different data structures
Hard to isolate workloads
No longer adequate
Can’t afford to wait days or
hours to analyze data
6
7. Splice Machine Proprietary and Confidential
Recent Approach: Lambda Architecture
Complex to setup and maintain
7
Speed Layer
Batch Layer
Serving Layer
Developer Integrates Specialized Compute Engines
8. Splice Machine Proprietary and Confidential
New Approach: Lambda-In-A-Box Architecture
Easy to use with SQL
8
Speed Layer
Batch Layer
SQL Optimizer Selects Pre-Integrated Compute Engines
Serving Layer
9. Splice Machine Proprietary and Confidential
Simultaneous OLTP & OLAP Workloads
9
Unique Dual-Engine Architecture isolates workloads
Traditional RDBMSs Splice Machine
HBASE
Engine
SPARK
Engine
BOTTLENECKS, DELAYS
O L A P
WORKLOAD ISOLATION
O L T P
K E Y
10. Splice Machine Proprietary and Confidential
Simultaneous OLTP & OLAP Workloads
10
Unique Dual-Engine Architecture isolates workloads
Traditional RDBMSs Splice Machine
As OLAP load rises,
OLTP response times increase
OLAP LOAD
OLTPRESPONSETIME
As OLAP load rises,
OLTP response times remain flat
OLAP LOAD
OLTPRESPONSETIME
12. Splice Machine Proprietary and Confidential
Proven Building Blocks: Spark, Hadoop and Derby
Apache Derby
ANSI SQL-99 RDBMS
Java-based
ODBC/JDBC Compliant
Apache HBase/Hadoop
Auto-sharding
High availability
Scalability to 100s of PBs
Apache Spark
Analytical engine
Fast, in-memory technology
Memory resilient to node failure
12
13. Splice Machine Proprietary and Confidential
HBase: Proven Scale-Out
Auto-sharding
Scales with commodity hardware
Cost-effective from GBs to PBs
High availability thru failover and replication
LSM-trees
13
14. Splice Machine Proprietary and Confidential
Apache
14
Unmatched Performance
Fastest sort of 1PB of data
Advanced In-Memory Technology
Spill-to-disk for large datasets
Resilient against node failures
Pipelining for computation parallelism
Most Active Apache Community
Almost 1000 contributors
Extensive Libraries
Over 140 and growing
Libraries for machine learning,
streaming and graph processing
15. Splice Machine Proprietary and Confidential
Splice Machine: Advanced Spark Integration
15
Innovative, High-Performance
RDD Creation
Fast access to HFiles in HDFS
Merged with deltas from Memstore
Avoids slower HBase API
Universal Execution Plan
and Byte Code
Optimizer, plan and code shared across
Spark or HBase execution
•••
HBase Region Server
HDFS
•••
Region 1
Memstore
Spark Worker
•••RDD 1
HFile HFile•••
P H Y S I C A L N O D E
RDD N
HFile••• HFile•••
Region N
Memstore
HBase Region Server
HDFS
•••
Region 1
Memstore
Spark Worker
•••RDD 1
HFile HFile•••
P H Y S I C A L N O D E
RDD N
HFile••• HFile•••
Region N
Memstore
16. Splice Machine Proprietary and Confidential
Splice Machine Architecture
1. Standard install of HBase
Cluster (HBase, HDFS,
ZooKeeper) with Spark
HBase
Co-Processor
L
E
G
E
N
D
2. Distribute Splice Machine
JAR to each region server
3. Automatically invoke co-
processors on each region
16
Cach
e
•••
Tas
k
Executor
Tas
k
HBase Region Server
•••
HDFS
SPLICE PARSER
SPLICE PLANNER
SPLICE OPTIMIZER
SPLICE EXECUTOR
• Snapshot Isolation
• Indexes
Region Region
SPLICE EXECUTOR
• Snapshot Isolation
• Indexes
Spark Worker RDD
Spark Master
RDD
Cach
e
•••
Tas
k
Executor
Tas
k
•••
•••
•••
Cach
e
•••
Tas
k
Executor
Tas
k
HBase Region Server
HDFS
SPLICE PARSER
SPLICE PLANNER
SPLICE OPTIMIZER
SPLICE EXECUTOR
• Snapshot Isolation
• Indexes
Region Region
SPLICE EXECUTOR
• Snapshot Isolation
• Indexes
Spark Worker RDDRDD
Cach
e
•••
Tas
k
Executor
Tas
k
•••
•••
•••
HMasterZookeeper
21. Splice Machine Proprietary and Confidential
Splice Machine: Query Execution
21
OLTP Execution on HBase
4a. Execute OLTP query from
byte code
5a. Use block cache and bloom
filters to optimize data access
6a. Return results
3. Generate optimal byte code
1. Parse SQL
2. Optimize query plan
22. Splice Machine Proprietary and Confidential
Splice Machine: Query Execution
22
OLAP Execution on Spark
4b. Generate Spark execution plan
OLTP Execution on HBase
4a. Execute OLTP query from
byte code
5a. Use block cache and bloom
filters to optimize data access
6a. Return results
3. Generate optimal byte code
1. Parse SQL
2. Optimize query plan
OLAP Execution on Spark
4b. Generate Spark execution plan
5b. Submit Spark plan with byte code
6b. Fair scheduling of distributed of tasks
7b. Generate RDD from HFiles and Memstore
8b. Execute query and return results
23. Splice Machine Proprietary and Confidential
Isolated Resource Management
23
Isolate Spark & HBase resources through Linux Cgroups
24. Splice Machine Proprietary and Confidential
Isolated Resource Management
24
Isolate Spark & HBase resources through Linux Cgroups
25. Splice Machine Proprietary and Confidential
Configurable Spark Resource Management
25
Prioritize Spark resources between Query, Admin & Import jobs
Custom resource pools
through XML
26. Splice Machine Proprietary and Confidential
Spark Query Management
26
Visualization of active and completed queries
27. Splice Machine Proprietary and Confidential
Spark Query Management (cont’d)
27
Visualization of stages for each query, plus kill function
28. Splice Machine Proprietary and Confidential
Spark Query Management (cont’d)
28
Visualization of stages for query plan, plus kill function
29. Splice Machine Proprietary and Confidential
Spark Query Management (cont’d)
29
Detailed metrics for tasks in each stage
31. Splice Machine Proprietary and Confidential
Working With External Data and Compute Engines
31
Virtual Table Interface (VTI)
Execute federated queries against external
files, libraries or databases
External Databases
Use JDBC to access data in DBs such as Oracle
and DB2
External Libraries
Access over 140 Spark libraries for machine
learning and streaming
External Files
Pre-defined or dynamic schema
Access local FS, HDFS, AWS S3
Sample query:
MapReduce I/O Formats
Accept federated queries from
MapReduce, Pig, and Hive
Register Splice Machine schema in
HCATALOG
Merge structured (Splice) and
unstructured data in ad-hoc query
Seamless integration to Hadoop
ecosystem
33. Splice Machine Proprietary and Confidential 33
High Concurrency, ACID transactions
Required to support OLTP applications
share_quantity share_price
TIMESTAMP VALUE TIMESTAMP VALUE
T12 4,000
“Virtual”
Snapshot
T7 $15.11
T7 2,000 T5 $15.65
T3 5,000
Transaction
@T6
T2 $15.74
T1 3,000 T0 $15.27
T3 5,000
Transaction
@T6
T2 $15.74
T5 $15.65
value_held = share_quality* share_price
@T6: value_held = 5,000 * $15.65
@T3: value_held = 5,000 * $15.74
State-of-the-art, distributed
snapshot isolation
Form of Multi-Version
Concurrency Control (MVCC)
Writers do not block readers
Fast, high concurrency
Delivers performance for small
reads/writes & batch loads
Extends research from Google
Percolator & Yahoo Labs
Patent pending technology
34. Splice Machine Proprietary and Confidential
BI and SQL tool support via ODBC/JDBC
34
No application rewrites needed
35. Splice Machine Proprietary and Confidential
Open Source
Features Community
Edition
Enterprise
Edition
Scale-out Architecture, ANSI SQL & Concurrent ACID Transactions ✓ ✓
OLAP and OLTP Resource Isolation ✓ ✓
Distributed In-Memory Joins, Aggregations, Scans and Groupings ✓ ✓
Cost-Based Statistics, Query Optimizer, Management Console ✓ ✓
Compaction Optimization ✓ ✓
Apache Kafka-enabled Streaming ✓ ✓
Virtual Table Interfaces ✓ ✓
New Releases and Maintenance Updates ✓ ✓
Tutorials, Forums, Videos, Documentation, Community Support ✓ ✓
Backup and Restore, Column Access Control ✓
Encryption, Kerberos, LDAP Support ✓
24/7 Support via Web and Phone ✓
Complimentary Account Management Services ✓
36. Splice Machine Proprietary and Confidential
Try it at scale immediately on AWS Sandbox
5 Click Sand Box
Cluster has full system deployed
SSH for CLI
URL to Management Consoles
Open SQL connection on any
node
Customize template
37. Splice Machine Proprietary and Confidential
Community
Slack channel - #splicecommunity
Video and code tutorials
GitHub
38. Splice Machine Proprietary and Confidential
Advisory Board
41
Advisory Board includes luminaries in databases and technology
Roger Bamford
Former Principal Architect at Oracle
Father of Oracle RAC
Mike Franklin
Chair,Dept of Computer Science, UChicago
Director, UC Berkeley AMPLab
Founder of Apache Spark
Marie-Anne Neimat
Co-Founder, Times-Ten Database
Former VP, Database Eng. at Oracle
Ken Rudin
Head of Growth and Analytics for Google Search
Head of Analytics at Facebook
Abhinav Gupta
Co-Founder, Rocket Fuel
Runs 15PB HBase Cluster
40. Splice Machine Proprietary and Confidential
Seasoned Team
43
Monte Zweben
Co-Founder &
Chief Executive
Officer
John Leach
Co-Founder &
Chief Technology
Officer
St. Louis
Hadoop
User Group
Krishnan
Parasuraman
VP of Sales and
Business
Development
Eran Pilovsky
Chief Financial
Officer
Gene Davis
Co-Founder & VP
of Products &
Operations
Eric Kalabacos
VP of Customer
Solutions
41. Splice Machine Proprietary and Confidential
Next Steps
44
Try Us!
splicemachine.com/get-started
GitHub • Tutorials • Sandbox
42. Splice Machine Proprietary and Confidential
Powering Real-Time
Applications & Analytics
Enabling Decisions in the Moment
October 20, 2016
Notas do Editor
The Hadoop RDBMS is designed to scale-out from a single server to thousands of machines, with a high degree of fault tolerance. Rather than relying on high-end hardware, Splice Machine uses the proven scale-out and high availability of Hadoop, proven in production clusters of dozens of petabytes at large scale leaders like Yahoo, Facebook, and Twitter.
The Hadoop RDBMS benefits include:
Affordability – scale-out -- using commodity hardware
Elasticity -- expand or scale back easily
Transactional – execute real time updates and ACID transactions
ANSI SQL -- leverage existing SQL code, tools, and skills
Flexibility -- support both operational and analytical workloads
Notes:
SQL: Structured Query Language. SQL is a special-purpose programming language designed for managing data held in a relational database management system (RDBMS).
Splice Machine has focused on the orange blocks to maximize the value of our R&D investment
Derby's database engine, is a full-functioned relational embedded database-engine, supporting JDBC and SQL as programming APIs. It uses IBM DB2 SQL syntax.
Apache Derby originated at Cloudscape Inc, an Oakland, California, start-up founded in 1996
In 1999 Informix Software, Inc., acquired Cloudscape, Inc. In 2001 IBM acquired the database assets of Informix Software, including Cloudscape
In August 2004 IBM contributed the code to the Apache Software Foundation as Derby
Splice Machine has focused the middle of the stack to maximize the value created by our R&D
Our parallelization engine to execute SQL
Secondary indexes
Join strategies
Query optimizers
Performance
High concurrency, lockless programming
HBase is a “distributed, versioned, non-relational database modeled after Google's Bigtable, a distributed storage system for structured data”. HBase can handle very high throughput scaling.
Fully leverage Hbase as a storage engine for horizontal scale-out
Auto-sharding in Hbase is based on regions with regions assigned to region servers
Hbase does region balancing/re-balancing
Recall that regions are ranges of rowkeys in sorted order
RDBMS primary key uses Hbase rowkeys
Fast single row selects
Fast range based scans
Dense secondary indices stored in separate tables
Write pipeline
Writes to WAL first (not shown) for durability
The writes to memstore
Memstores are eventually flushed to disk to storefiles
Read pipeline
Blocks are read from storefiles and memstores
Blocks are cached in Block Cache (not shown)
Remember that HDFS is an immutable file system
Storefiles are written and never updated
Updates are really inserts (upserts)
Key points
A theme in Distributed computing is moving the code to where the data is because the data is big
Splice Machine has its own query execution and task parallelization engine
Secret sauce
Not based on map/reduce
Predicates pushed to shards and locally applied
Each region executes local HBase operations
Results are returned to the controlling node and “spliced together” hence the “splice” in the company name
Serialization
Highly compressed storage format for table row
Snappy compression
Reduces network traffic!
Join strategies:
Nested Loop, SortMerge, Merge, Broadcast
Rely on HBase co-processors: end-points and observers
Performance
Speed of the read and async write pipelines
20 msecs for read
30 msecs for write
This is big area of focus for us
Based on MVCC “snapshot isolation”
Lockless is key here
Patent pending in this area
Based on timestamps
Transaction C will see changes from transaction A
Transaction C won’t see changes from transaction B
Here's an explanation of what is depicted in Figure 1:
Transaction T1 bumps up the Qty for A by 10 twice, then commits at time t6. At the commit, A's Qty is 30, which is now visible to other transactions.
Note, however, that T2 started at time t4 before T1's commit, so its value for A is still 10. Thus when it computes C = A+10, this results in 20.
T3 starts at t7, as an overlap to T2, and attempts to update B just as T2 did, resulting in a write-write conflict. T3 rolls back, and attempts a reissue with T3'. This succeeds with the previously committed value from T2.