But it isn’t easy
Changing your company is not easy.
Give examples: you’ve just invested $1m in a data warehouse, but business now wants to … It now will cost you 10 fold.
We are announcing Vector on Hadoop - industrial strength sql on hadoop with atom smashing speed never before seen in the industry. This is a core part of our Actian Analytics Platform – Hadoop SQL Edition. Let me tell you about it (details below) and show you a few things.
What are we announcing?
Highest performing, most industrialized SQL in Hadoop Turns Hadoop into a High-Performance, Fully-Functional Analytics Database
Actian Analytics Platform – Hadoop SQL Edition includes our hardened (patented) X100 vector processing engine, combined with Actian’s visual data and analytics work flow, all running natively in Hadoop via YARN
How is this unique?
Highest performing, most industrialized SQL access to Hadoop data
Only end-to-end analytic processing natively in Hadoop (covers the full analytics processes: data blending & enrichment, discovery & data science, analytics & operational BI)
Most consumable, accessible, manageable Hadoop analytics
What does this mean to our customers?
Removes all barriers for business access to big data analytics
Unleashes millions of business-savvy, SQL users with no constraints on Hadoop data to improve the accuracy of their analytical predictions and decision-making
Accelerates time to value and turns Hadoop data into transformational value: customer delight, competitive advantage, world-class risk management, disruptive business models
I’m going to show you three things: How fast it is, how easy it is to get started and how it can be used in real-world scenarios.
internationalization
1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed.
A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible.
2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache.
3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512mb and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped.
To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed.
4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution.
We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache.
5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column.
All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
Execution
Subset of TPC-DS as chosen by Impala
Data size is 3TB (SF3000)
Executed on 5-node “rushcluster” in Austin
Both Impala and Vector numbers are on the same hardware
Comparison with Impala
Verified that Impala plans are sensible
Currently observed average speedup is 11x
Optimal query plans (manually written) gives us 16x speedup
These are real numbers! We executed manual plans directly
Changes in the cost model would get us to this performance
Performance improvements
Cost model changes will get us to 16x speedup
Pipeline of query execution changes
Well into H2
Estimated to get us 2x improvement
So, estimated speedup vs Impala would be ~30x (no guarantees)
Planning to run TPC-H SF1000 and SF3000
With all planned improvements (end of the year) we should be able to beat the EXASOL cluster numbers.
What are we announcing?
Actian Analytics Platform – Hadoop SQL Edition, the first offering that turns Hadoop into a fully-functioning analytics platform.
This new edition introduces the highest performing, most industrialized SQL in Hadoop, powered by our hardened (patented) X100 vector processing engine, combined with Actian’s visual data and analytics work flow, all running natively in Hadoop via YARN.
How is this unique?
Provides the only end-to-end analytic processing natively in Hadoop (covers the full analytics processes: data blending & enrichment, discovery & data science, analytics & operational BI)
Delivers the highest performing, most industrialized SQL access to Hadoop data
Makes the entire analytic process more consumable, easier to access, and easier to manage than on any other
What does this mean to our customers?
Industrialized SQL in Hadoop removes all barriers for business access to big data analytics
Broad SQL access unleashes millions of business-savvy, SQL users with no constraints on Hadoop data to improve the accuracy of their analytical predictions and decision-making
Turbocharged Hadoop analytics and SQL in Hadoop accelerates time to value and turns Hadoop data into transformational value: customer delight, competitive advantage, world-class risk management, disruptive business models
We want to partner with you to identify where the most obvious places where big data analytics could be applied to your organization.