The End of a Myth: Ultra-Scalable Transactional Management
1. The End of a Myth:
Ultra-Scalable
Transactional
Management
Presented by:
Ricardo Jimenez-Peris
CEO & Co-founder
@ LeanXcale
2. About the Speaker
Top researcher on scalable transactional management and distributed
data management with 100+ publications in top conferences and journals
Co-author of a book on Database Replication
Professor on distributed systems and data management for over 25 years
Co-inventor of two granted patents and 8 new patent applications
Invited speaker to top-tech companies in Silicon Valley, such as Facebook, Twitter,
Salesforce, Heroku, EMC-Pivotal (when it was EMC-Greenplum), HP, Microsoft
3. About LeanXcale
Vendor of a NewSQL ultra-
scalable database, Full ACID,
Full SQL
LeanXcale – HTAP Database:
blending Operational and
Analytical capabilities
delivering real-time data
LeanXcale leverages an ultra-
efficient storage engine,
which is a relational key-value
data store
Product Team
45%
30%
15
Awards
Total number
PhD Holders
10-25 years of
Industry expertise
Top Engineers
from Industry
Top Researchers
From Academia
4. The Myth
”Operational databases can not scale”
WHY?
Nobody managed to scale them in three decades.
Some say that is due to the CAP Theorem.
- vendors that do not provide ACID properties
5. C - Consistency
A - Availability
P – Partitions
The CAP theorem states something very well known in
distributed systems, i.e. if you want to tolerate partitions
in a replicated system, choose:
Availability at all nodes and no consistency:
Partitionable System
OR
Consistency and no Availability at all nodes:
Primary Component
The CAP Theorem
Q: Where is the S of Scalability?
A: Nowhere
6. Solved how to scale
transactions to large
scale (i.e. 100 million
update transactions
per second) in a fully
seamless way
Breakthrough result of
15+ years of research
by a tenacious team
The End of the Myth: Ultra-Scalable Transactions
8. The transactional management provides ultra-scalability
Fully transparent:
• No sharding.
• No required a priori knowledge about rows to be accessed.
• Syntactically: no changes required in the application.
• Semantically: equivalent behavior to a centralized system.
Provides Snapshot Isolation
(the isolation level provided by Oracle when set to “Serializable” isolation)
+
+
Transactional Processing
10. Evaluation without data manager/logging to see how much throughput can
attain the transactional processing
2.35 Million
transactions
per second
Scalability
12. Start End
Reads Writes
Reads & Writes
Snapshot isolation splits atomicity in two points one at the beginning of the
transaction where all reads happen and one at the end of the transaction
where all writes happen
Serializability provides a fully atomic view of a transaction, reads and writes
happen atomically at a single point in time
Snapshot Isolation VS Serializability
17. Separation of commit from the visibility of committed data
Proactive pre-assignment of commit timestamps to committing
transactions
Transactions can commit in parallel due to:
• They do not conflict
• They have their commit timestamp already assigned that will determine its
serialization order
• Visibility is regulated separately to guarantee the reading of fully consistent states
Detection and resolution of conflicts before commit
Main Principles
18. Transactional Life Cycle: Start
Snapshot
Server
The local txn mng
gets the “start TS”
from the snapshot
server.
Get start TS
Local Txn
Manager
19. Transactional Life Cycle: Execution
Local Txn
Manager
Get start TS
Run on start
TS snapshot
Conflict
Manager
The transaction will read the state
as of “start TS”.
Write-write conflicts are detected
by conflict managers on the fly.
20. Transactional Life Cycle: Commit
Get start TS
Run on start
TS snapshot
Commit
The local transaction
manager orchestrates
the commit.
Local Txn
Manager
21. Transactional Life Cycle: Commit
Data Store
Commit TS Writeset Writeset Commit TS
Local Txn
Manager
Get
Commit TS
Log
Public
Updates
Report
Snaps Serv
Commit
Sequencer
Snapshot
Server
Logger
22. Snapshot
Server
The Snapshot server keeps track
of the most recent snapshot that
is consistent:
• Its TS should such that there is no
previous commit TS that is not yet
durable and readable or it has been
discarded
• That is, it keeps the longest prefix of
used/discarded TSs such that there
are no gaps
Keeps track of
and reports most
recent consistent
TS
Gets
reports of
discarded
TSs
Gets reports
of durable &
readable TSs
In this way transactions can
commit in parallel and
consistency preserved
Transactional Life Cycle: Commit
23. Time
Sequence of timestamps received by the Snapshot Server
Evolution of the current snapshot at the Snapshot Server
11 15 12 14 13
11 11 12 12 15
Transactional Life Cycle: Commit
24. There can be as many conflict managers as needed, they scale in the
same way as hashing based key-value data stores
By doing concurrency control at conflict managers that has a much smaller
number than data managers, batching is much more effective
With TPC-C the ratio of nodes devoted to concurrency management and
query engine/region server is 20 to 1 (resulting in a 20 times more
efficient batching)
Each conflict manager takes care of a set of keys
Conflict Managers
25. Each logger takes care of a fraction of the log records
Loggers log in parallel and are uncoordinated
There can be as many loggers as needed to provide the necessary IO
bandwidth to log the rate of updates
Loggers can be replicated
If this is the case, the durability can be configured as:
•To be in the memory of a majority of logger replicas (replicated memory durability)
•To be in a persistent storage of a logger replica (1-safe durability)
•To be in a persistent storage of a majority of logger replicas (n-safe durability)
The client gets the commit reply after the writeset is durable (with respect
the configured durability)
Loggers
26. The described approach so far is the original reactive approach
It results in multiple messages per update transaction.
The adopted approach is proactive:
•The local transaction managers report periodically about the number of committed update
transactions per second
•The commit sequencer distributes batches of commit timestamps to the local transaction
managers
•The snapshot server gets periodically batches of timestamps (both used and discarded) from local
transaction managers
•The snapshot server reports periodically to local transaction managers the most current
consistent snapshot
Increasing efficiency
27. SQL processing is performed at the SQL engine tier
A SQL engine instance:
•Transforms SQL code into a query plan
•The query plan is optimized according the collected statistics (e.g. cardinality of keys)
•Orchestrate the query plan execution on top of the distributed data store
•Returns the result of the SQL execution to the client
•Maintains updated the statistics in the data store
The SQL engine has been attained by forking from Apache Derby the
query engine (same SQL dialect as DB2)
The scan operators has been modified to access KiVi instead of local
storage
The metadata is stored at KiVi instead of local storage
Increasing efficiency
28. s_id = id
σlocation = 'Rome' and color = 'red'
id = w_id
Stor
e
At the leaves of the Query
Plan there are Scan Operators that
have predicate filtering, aggregation,
grouping and sorting capabilities.
They have been rewritten to
access KiVi instead of local storage.
They enable to push down all
algebraic operators below a join.
SELECT
s.id, s.location
FROM
Store s
INNER JOIN Catalog c ON s.id=c.s_id
INNER JOIN Widget w ON c.w_id=w.id
WHERE
s.location='Rome' AND w.color='red'
SQL is translated into a query plan
represented as a tree of algebraic
operators. Algebraic operators
are written in Java plus bytecode
Store Cata
log
Wid
get
Query Engine
29. cat_id = id
location = 'Rome' and color = 'red'
Inv_id = id
color = 'red’ (Item)location = ‘Rome’ (Store)
Selections are
pushed down
Store Inven
tory
Cata
log
Selection Push Down
Select *
from Store s, Inventory I, Catalog c
where I.cat_id = c.id
and s.inv_id = i.id
and s.location = ‘Rome’
and c.color = ‘red’
Data Engine Instance 1 Data Engine Instance 2
Query Engine Instance
30. Data Engine Instance 1 Data Engine Instance 2
Aggregation Push Down
(units)
select sum(i.units)
from inventory i
Global Aggregation
Query Engine Instance
All values travel
from data engine instances
to the query engine
Inven
tory
Inven
tory
31. Data Engine Instance 1 Data Engine Instance 2
Local
Aggregation
Inven
tory
Inven
tory
Aggregation Push Down
(units) instance 1 (units) Instance 2
(units)
select sum(i.units)
from inventory i
Global Aggregation
Query Engine Instance
A single value travels
from each data engine instance
to the query engine
33. Real-Time Big Data
Full SQL Full ACID DB
OLAP over
Operational Data
Ultra-Scalable OLTP
Non-disruptive data migration, continuous
load balancing and
Elastic & Ultra-Efficient
Queries across SQL, Hbase,
MongoDB, Neo4J & Hadoop files
Integration with Data Streaming
Polyglot
What is LeanXcale?
An Ultra-Scalable SQL Database for Any Size and Any Workload
35. Enabling to implement the Customer Experience Management (CEM) halving
the number of nodes.
Leveraging the computation of aggregates in real-time as raw KPIs are
inserted.
Analytical aggregation queries become simple single-row queries.
Elasticity enables to substantially reduce the operation personnel cost during
the non-working hours with low loads.
Reducing Cost of Ownership at Telcos
36. LeanXcale is the first database technology that can substitute the mainframe.
It can bear the operational workloads of a mainframe, but at the same time
provide real-time analytics over the operational data.
It can be deployed by the mainframe to be loaded/updated in real-time, and
applications can be offloaded from the mainframe one by one.
LeanXcale is partnering with Bull Atos to provide a database appliance that
will provide the substitute of the mainframe.
Offloading/Substituting Mainframe
37. Using the key-value interface for large data ingestion of IoT applications while
still accessible through SQL and reducing by several times the infrastructure
needed.
Real-time analytics.
Computation of aggregates in real-time to reduce the cost of aggregation
analytical queries, e.g., for the smart grid.
Elasticity enable to adjust the consumption of resources to the load received.
Large IoT Applications
38. Using the key-value interface to reduce the footprint needed to get clicks
Real-time analytics for implementing availability checking
Elasticity enable to adjust the consumption of resources to the load
received
Full ACIDity to guarantee the consistency of the truth of sales and actual
availability
Disrupting Travel Tech