2. Who Am I?
â Andrew Hutchings, aka âLinuxJediâ
â Lead Software Engineer for MariaDBâs ColumnStore
â Previous worked for:
â NGINX - Senior Developer Advocate / Technical
Product Manager
â HP - Principal Software Engineer (HP Cloud / ATG)
â SkySQL - Senior Sustaining Engineer
â Rackspace - Senior Software Engineer
â Sun/Oracle - MySQL Senior Support Engineer
â Co-author of MySQL 5.1 Plugin Development
â IRC/Twitter: LinuxJedi
â EMail: linuxjedi@mariadb.com
3. Overview
â History of MariaDB ColumnStore
â Technical Use Case
â Components of MariaDB ColumnStore
â Disk Storage
â Writing Data
â Querying Data
â Optimizing for MariaDB ColumnStore
â Closing Notes
â Questions
4. History of MariaDB ColumnStore
â March 2010 - Calpont launches InfiniDB
â September 2014 - Calpont (now itself called InfiniDB) closes down
â MariaDB (then SkySQL) supports InfiniDB customers
â April 2016 - MariaDB announces development of MariaDB ColumnStore
â August 2016 - I joined MariaDB and jumped straight into ColumnStore
â December 2016 - MariaDB ColumnStore 1.0 GA
â InfiniDB + MariaDB 10.1 + Many fixes and improvements
â November 2017 - MariaDB ColumnStore 1.1 GA
â MariaDB 10.2 + APIs + Even more improvements
6. Technical Use Case
MariaDB ColumnStore
â Very large data sets
â Many columns
â Many millions of rows
â Complex joins and aggregates
â Rapid bulk data insertion
â The larger the batch the better
Traditional OLTP Engines
â Smaller data sets
â Basic queries
â Lots of DML queries
â Complex data types
7. Data Types
â INT types - range is 2 less from max unsigned or min signed
â CHARâ
- max 255 bytes
â VARCHARâ
- max 8000 bytes
â DECIMAL - max 18 digits
â DOUBLE/FLOAT
â DATETIME - no sub-seconds (coming in 1.2)
â DATE
â BLOB/TEXTâ
â Empty string is the same as NULL
8. Other DDL Differences
â No indexes
â Columns are somewhat self-indexing
â Auto increment is handled differently (a table comment)
â No constraints
â PARTITION syntax not supported
â Columns are partitioned automatically
9. Row-oriented vs. Column-oriented Format
ID Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
ID
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
SELECT Fname FROM People WHERE State = 'NY'
12. Query Processing
Shared Nothing Distributed Data Storage
SQL
Column
Primitives
User
Module
Performance
Module
UM
PM
Primitives ââââ
Intermediate
ââResultsââ
13. Hardware Requirements
â Lots of RAM
â minimum 32GB for UM, 16GB for PM
â minimum 4GB for trying single server out on a VM
â Optimised for HDD spindles, will still work with SSD
â We are looking into SSD optimisation soon
â More cores typically better
â 8 core minimum recommendation
â For AWS m4.4xlarge is the recommended minimum
15. Column Types
âą 8-byte fixed length token (pointer).
âą A variable length value stored at the
location identified by the pointer.
1-byte Field
with 8192
values per 8k
block
2-byte Field
with 4096
values per 8k
block
4-byte Field
with 2048
values per 8k
block
8-byte Field
with 1024
values per 8k
block
Dictionary structure
made up of 2
files/extents with:
16. Extent Map
Object ID The ID for the column (or dictionary)
Object Type Column or Dictionary
LBID Start / End Start / End Logical Block Pointer
Minimum Value Lowest value in the extent
Maximum Value Highest value in the extent
Width Column Width
DBRoot DBRoot (disk partition) number
Partition ID / Segment ID / Block Offset The extent number
High Water Mark Atomic last block pointer
19. Inserting Data
â Multiple methods
â Single INSERTs
â INSERT...SELECT
â LOAD DATA INFILE
â cpimport
â Bulk Write API
â Designed for large bulk inserts
â Inserts are appended at the end of extents (or new extents created)
â This means reads are not affected
â A High Water Mark pointing to the last block is moved at the end of the insert
20. cpimport
â Uses CSV files or piped CSV data
â Fastest way to get data into ColumnStore
â Does minimal data conversion and pipes it straight into the PMs
â Works by appending new blocks to the table and moving an atomic block pointer (HWM)
â No UNDO log needed (atomic pointer not moved on rollback)
â Therefore can cause a gap of 0-64KB in a column
â Can load multiple tables simultaneously
â Can load into multiple PMs for the same table simultaneously
â Can load into specific PMs for physical partitioning by PM
21. Bulk Write API
â A simple C++ API to inject data into the PMs
â Bindings in Python and Java available
â Works in a similar way to cpimport
â Append new blocks and an atomic block pointer (HWM)
â LGPL licensed
22. DML Writes
â Regular INSERT / UPDATE / DELETE
â Also INSERT...SELECT and LOAD DATA INFILE when autocommit is off
â Slow compared to other engines
â INSERT is very slow compared to cpimport
â Requires the use of a version buffer for an undo log
â But INSERT appends to data blocks so no wasted storage
â Data sent to DMLProc to process
23. A Note About DELETE
â Need to touch every column and the undo log
â So very slow
â Also leaves a gap in the column that wonât be filled
â Having a column that is marked using an UPDATE query is faster
â Dropping entire partitions is instantaneous
â Partitions can be disabled first
24. INSERT...SELECT / LOAD DATA INFILE
â Injects the binary row data from MariaDB into cpimport
â Good for backwards compatibility with tools and remote loading
â cpimport then injects this data into the column extent files
â In 1.2 it will use the write API instead
â If autocommit is turned off this will behave like regular DML instead (slow)
27. Extent Elimination
Horizontal
Partition:
8 Million Rows
Extent 2
Horizontal
Partition:
8 Million Rows
Extent 3
Horizontal
Partition:
8 Million Rows
Extent 1
Storage Architecture reduces I/O
âą Only touch column files
that are in filter, projection, group by, and
join conditions
âą Eliminate disk block touches
to partitions outside filter
and join conditions
Extent 1:
ShipDate: 2016-01-12 - 2016-03-05
Extent 2:
ShipDate: 2016-03-05 - 2016-09-23
Extent 3:
ShipDate: 2016-09-24 - 2017-01-06
SELECT Item, sum(Quantity) FROM Orders
WHERE ShipDate between â2016-01-01â and â2016-01-31â
GROUP BY Item
Id OrderId Line Item Quantity Price Supplier ShipDate ShipMode
1 1 1 Laptop 5 1000 Dell 2016-01-12 G
2 1 2 Monitor 5 200 LG 2016-01-13 G
3 2 1 Mouse 1 20 Logitech 2016-02-05 M
4 3 1 Laptop 3 1600 Apple 2016-01-31 P
... ... ... ... ... ... ... ... ...
8M 2016-03-05
8M+1 2016-03-05
... ... ... ... ... ... ... ... ...
16M 2016-09-23
16M+1 2016-09-24
... ... ... ... ... ... ... ... ...
24M 2017-01-06
ELIMINATED PARTITION
ELIMINATED PARTITION
28. Query Analysis
MariaDB [tpch1]> select calsettrace(1);
...
MariaDB [tpch1]> select c_count, count(*) as custdist
-> from ( select c_custkey, count(o_orderkey) c_count
-> from v_customer left outer join v_orders on c_custkey = o_custkey
-> and o_comment not like '%special%requests%'
-> group by c_custkey ) c_orders
-> group by c_count
-> order by custdist desc, c_count desc;
...
42 rows in set, 1 warning (9.07 sec)
MariaDB [tpch1]> select calgetstats()G
*************************** 1. row ***************************
calgetstats(): Query Stats: MaxMemPct-4; NumTempFiles-0; TempFileSpace-0B; ApproxPhyI/O-0; CacheI/O-12503;
BlocksTouched-12503; PartitionBlocksEliminated-812; MsgBytesIn-102MB; MsgBytesOut-3KB; Mode-Distributed
1 row in set (0.00 sec)
29. Query Analysis
MariaDB [tpch1]> select calgettrace()G
*************************** 1. row ***************************
calgettrace():
Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows
BPS PM customer 7254 (c_custkey) 0 75 0 0.032 150000
TNS UM - - - - - - 0.045 150000
BPS PM customer 7254 (c_custkey) 0 0 75 0.000 0
TNS UM - - - - - - 0.000 0
TUS UM - - - - - - 0.303 150000
BPS PM orders 7268 (o_comment,o_custkey,o_orderkey) 0 12428 0 2.293 1500000
TNS UM - - - - - - 2.967 1500000
BPS PM orders 7268 (o_comment,o_custkey,o_orderkey) 0 0 737 0.000 0
TNS UM - - - - - - 0.000 0
TUS UM - - - - - - 3.796 1500000
HJS UM v_customer-v_orders - - - - - ----- -
TAS UM - - - - - - 1.658 150000
TNS UM - - - - - - 0.044 150000
TAS UM - - - - - - 0.050 42
1 row in set (0.01 sec)
30. Cross Engine Joins
â Allows non-ColumnStore tables to join
with ColumnStore
â The whole query is processed by
ColumnStore
â Cross Engine makes new MariaDB
connections to retrieve data from
non-ColumnStore tables Original
Query
Non-ColumnStore Query
(Cross Engine)
MariaDB
Server
ExeMgr
32. Data Modeling
â Star-schema optimizations are generally a good idea
â Conservative data typing is very important
â Especially around fixed-length vs. dictionary boundary (8 bytes)
â IP Address vs. IP Number
â Break down compound fields into individual fields:
â Trivializes searching for sub-fields
â Can avoid dictionary overhead
â Cost to re-assemble is generally small
33. Data Insertion
â Order data as best you can before inserting
â Helps extent elimination when min/max range for an extent is small
â Insert in large batches using cpimport or bulk write API
34. Improving Your Queries
â Avoid filtering on a >= 8byte VARCHAR/CHAR column where possible
â Two extents need to be read per column, no extent elimination
â Use extent map elimination where possible
â Donât use a function to filter
â Extent elimination wonât happen
â Only reference required columns, avoid âSELECT *â
â Use the smallest possible data type for your data
â Avoid large ORDER BY
â Read https://mariadb.com/kb/en/mariadb/columnstore-performance-tuning/
35. Tuning
â Generally self-tuning
â Uses as much RAM as possible automatically
â Uses all CPU cores
â More RAM in PMs = more LRU data cache
â More RAM in UMs = ability to process aggregates / joins on bigger data sets
â Disk joins are possible
37. MariaDB ColumnStore 1.2 (later in 2018)
â MariaDB 10.3 base
â TIME datatype
â Microsecond support
â Improvements to LOAD DATA INFILE and INSERT...SELECT
â Phase 1 of MariaDB ColumnStore Storage Engine Convergence project
â Many other cool things