SlideShare uma empresa Scribd logo
1 de 38
MODELLING DATA
FOR SCALABLE,
AD HOC ANALYSIS
ASSEN TOTIN
Senior Engineer R&D
MariaDB Corporation
WHY ARE WE HERE?
● If working with a transactional (row) storage is like driving a car (and almost as
ubiquitous)...
● … then working with an analytical storage is driving a trailer.
● Bottom line: change your driving attitude or you’re not going to make it even out
of the parking lot!
QUICK SUMMARY
● The analytical workload.
● MariaDB ColumnStore brief.
● ColumnStore data modelling: preparing data for loading, preparing appropriate
schema, optimizing the queries and finding your way around them.
● Moving data to ColumnStore: usage scenarios.
● Q & A
COLUMNSORE AND
THE ANALYTICAL
WORKLOAD
THE ANALYTICAL WORKLOAD
● Relatively small set of functions needed, compared to general-purpose
scientific work.
● If needed, the business logic can be moved outside the data storage. Thus the
storage can be reduced to its most basic storing and retrieval functions.
● Data is mostly historic, hence time-sequenced, almost exclusively appended
and rarely – if at all – updated. Data is almost never deleted.
● Large sets of data are retrieved as batches, often full columns or continuous
parts of such.
MARIADB
COLUMNSORE
COLUMNSTORE STORAGE
● Dedicated columnar
storage
● Data is organised in
a hierarchical
structure (unlike flat
row-based storages)
COLUMNSTORE STORAGE
● Each database is a directory,
each table is a directory inside it,
each column is a file inside it
● Columns are split into multiple
files (extents) of equal size (8M
cells)
● Optional compression, defined
per-table
COLUMNSTORE STORAGE
● Data can be loaded (written) directly into extents.
● Completely bypasses the SQL layer, leaving it free
to process queries.
● Once writing completes, we then notify the
processing engine that new data is available.
COLUMNSTORE STORAGE
● For each extent some meta-data is calculated and stored in memory (MIN and
MAX values etc.).
● Divide-and-conquer strategy to queries: eliminate all unnecessary extents and
only load the one needed.
COLUMNSTORE STORAGE
COLUMNSTORE CLUSTER
● In ColumnStore, a module is a set of running processes.
● Two types of modules (nodes), User Module (UM) and Performance Module
(PM).
● User Module: provides client connectivity (speaks SQL) and has local storage
engines (InnoDB, MyISAM...). More UM = more concurrent connections and
HA. UM can be replicated.
● Performance Module: stores actual data. More PM = more data stored.
● For dev purposes, one UM and one PM may live together in a single OS.
COLUMNSTORE TABLES
● ColumnStore is a storage engine in MariaDB.
● To create a ColumnStore table, use
CREATE TABLE… ENGINE=ColumnStore
● Just as with any other MariaDB server, you can mix-and-match different storage
engines in one database.
● Just as with any other MariaDB server, you can do a cross-engine JOIN
between ColumnStore tables and tables in local storage engines on the UM.
COLUMNSTORE DATA DISTRIBUTION
● ColumnStore tables are always distributed (assuming more .than one PM).
● ColumnStore distributes data across the PM nodes in round-robin fashion.
● When a new (empty) PM is added, it receives data until its size catches up with
other PM.
● Manual control over data distribution is possible when side-loading via the Bulk
Load API: cpimport modes 2 & 3.
COLUMNSTORE DATA DISTRIBUTION
COLUMNSTORE
DATA MODELLING
NO INDICES, PLEASE!
● ColumnStore has no indices: with big data indices do not fit into memory, so
they become useless.
● This helps to reduce I/O drastically; ColumnStore I/O requirements are
significantly lower than for row storage (works very well on spinning media).
● Reduce CPU load previously spent on maintaining indices.
● The filesystem is always in-sync: file-level backup in real-time is again possible
and natural.
● Direct injection of data into the storage (bypassing SQL layer) is now possible.
● Instead of indices, ColumnStore uses divide-and-conquer to only load what’s
needed to serve a query.
PREPARING DATA FOR LOAD
● ColumnStore will append the data in the order we send it, so it is up to us to
order it.
● In order for the divide-and-conquer approach to work best, data has to be
arranged in sequential fashion (because then the most extents can be
eliminated before actual data read from disk begins).
● Examine your data and identify columns with incremental (or time-based)
ordering.
● Examine your queries and find which of these columns is most often used as
predicate.
● Order the data by this column prior to loading it.
CLUSTERING THE SCHEMA
● ColumnStore follows the map/reduce approach: each PM does the same work
on its part of the data (map), then all results are aggregated by a UM (reduce).
● To distribute a JOIN (push-down to all PM) one needs to ensure that either
– each node has one of the sides in full, or
– both sides are partitioned by the same key.
● With automated data distribution, ColumnStore finds the smaller side of the
JOIN and redistributes it on-the-fly to facilitate a distributed JOIN. If the smaller
side is bigger than a threshold, the JOIN is pushed up to the UM (which
requires more RAM).
CLUSTERING THE SCHEMA
● The optimal ColumnStore schema will thus consist of small number of big
tables and larger number of smaller tables so that JOIN between a big and a
small table can be distributed.
● This schema assumes high degree of data normalisation, so that big tables will
contain as many as possible references to small tables, from which actual
values are derived.
● This schema is usually referred to as star schema: one big table (in the centre)
linked to multiple small tables (around it).
CLUSTERING THE SCHEMA: STAR
Source: Wikipedia
CLUSTERING THE SCHEMA
● The big table in the centre is called fact table, because it contains data (rows),
related to events (facts) that occurred in different moments in time. These facts
are often associated with the technical or business activity that is represented
by the schema (e.g., each sale could be a fact, registered in one row; or, each
reading of a sensor value in an IoT system etc.).
● The fact table is amended in each new data load (new rows = new events).
● New rows are appended to the end of the fact table.
● Generally, older (time-wise) facts precede the newer ones.
CLUSTERING THE SCHEMA
● The small tables that are linked to the fact table are called dimension tables,
because they contain data that describes properties of the facts.
● Dimension tables constitute of things like nomenclatures and other nearly-
immutable data: e.g., the list of states and cities, list of points of sale etc.
● Dimension tables are rarely amended.
CLUSTERING THE SCHEMA
● Having a second layer of links may provide a more complicated design,
sometimes called snowflake schema.
● In a multi-tier (snowflake) schema, a table may be a dimension to one level and
a fact to another, e.g. the list of telco subscribers may be a fact (linked to
dimensions like the subscription plan), but also a dimension (to which the list of
phone calls links).
CLUSTERING THE SCHEMA: SNOWFLAKE
Source: Wikipedia
PHONE
CALLS
USERS
PLANS
OPTIMIZING THE QUERIES
● An important prerequisite for properly designing a schema is to know how it is
going to be used.
● Ensure the queries and the star schema match each other.
● Always JOIN a fact table to a dimension table only. Never JOIN two fact tables!
● As each column is a separate set of files, the more columns are requested in
the result set, the more data has to be read from the disk; always specify exact
columns and only those needed; never do SELECT (*)
OPTIMIZING THE QUERIES
● Filter on sequential columns as much as possible.
● Filter on actual values, not on functions, because functions prevent extent
elimination and lead to full column scan; make extra separate columns if
needed, e.g. have a separate column year instead of YEAR(date).
● ORDER BY and LIMIT are run last and always on the UM, so can be expensive
(depending on amount of data).
● JOIN with a table from a local storage engine (InnoDB, MyISAM...) is done by
first fetching the local table from UM. As this requires loopback connection, this
is often relatively slow – so consider its usage carefully.
OPTIMIZING DIMENSIONS
● Keep dimensions small (up to 1M rows) as they will be redistributed on-the-fly
for each JOIN.
● Increase the distributed JOIN threshold for bigger dimensions (but carefully).
This is a cluster-wide tunable from Columnstore.xml.
EXTENDING COLUMNSTORE ENGINE
● ColumnStore engine might not always be the best choice (e.g., data type
support,encoding support etc.).
● Local storage engines on UM may supplement the ColumnStore engine via
cross-engine JOIN.
● Usually multiple UM will be replicated, so tables from local storage engines are
also replicated… but in some special cases you may want not to replicate them
and effectively have different content for the same local table on different UM;
in this case, make sure to configure jobs to run only on the UM where
connected (access to ExeMgr process).
TRACING YOUR STEPS: EXPLAIN
● EXPLAIN works for ColumnStore, but is less useful (no indices)
SELECT t.customer_id, t.discount, t.discounted_price
FROM transactions t
JOIN books b ON b.book_id=t.book_id
WHERE t.trans_date BETWEEN '2018-01-01' AND '2018-01-31';
MariaDB [bookstore]> EXPLAIN SELECT t.customer_id, t.discount, t.discounted_price FROM transactions t JOIN books b ON
b.book_id=t.book_id WHERE t.trans_date BETWEEN '2018-01-01' AND '2018-01-31';
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------+
| 1 | SIMPLE | t | ALL | NULL | NULL | NULL | NULL | 2000 | Using where with pushed condition |
| 1 | SIMPLE | b | ALL | NULL | NULL | NULL | NULL | 2000 | Using where; |
| | | | | | | | | | Using join buffer (flat, BNL join) |
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------+
TRACING YOUR STEPS: STATS
● Use SELECT calGetStats() stored
procedure, which provides statistics
about resources used on the User
Module (UM) node, PM node, and
network by the last run query.
● 582979 rows in set (3.373 sec)
MariaDB [bookstore]> calGetStats();
+---------------------------------+
| Query Stats: |
| MaxMemPct-3; |
| NumTempFiles-0; |
| TempFileSpace-0B; |
| ApproxPhyI/O-71674; |
| CacheI/O-47150; |
| BlocksTouched-47128; |
| PartitionBlocksEliminated-1413; |
| MsgBytesIn-37MB; |
| MsgBytesOut-63KB; |
| Mode-Distributed |
+---------------------------------+
TRACING YOUR STEPS: TRACE
● To trace a query, first enable tracing with SELECT calSetTrace(1), then run
the query, then get the trace with SELECT calGetTrace().
MariaDB [bookstore]> SELECT calGetTrace();
+-------------------------------------------------------------------------------------------------------------------+
| Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows |
| BPS PM b 301760 (book_id) 0 7 0 0.002 5001 |
| BPS PM t 301805 (book_id,customer_id,discount,discounted_price,trans_date) 0 17280 1413 0.308 582979 |
| HJS PM t-b 301805 - - - - ----- - |
| TNS UM - - - - - - 2.476 582979 |
+-------------------------------------------------------------------------------------------------------------------+
COLUMNSTORE
DATA MOVING
MOVING DATA TO COLUMNSTORE
● Scenario A: Use the same schema as in transactional.
● Only use ColumnStore as long-term cold storage for large amounts of data.
● No OLAP as schema does not match requirements.
● No OLTP as data is too big.
● Copy selected parts of the data back to OLTP engine for processing.
MOVING DATA TO COLUMNSTORE
● Scenario B: Use dedicated star schema.
● Actively use ColumnStore as OLAP backend.
● Load the data from OLTP storage in batches: ETL with either LOAD DATA or
Bulk Load API (preferred: cpimport, shared library/JAR).
● Use any preferred front-end tool to drive the analytics (Tableau, Pentaho
Mondrian, Microsft SSAS, Apache Zeppelin…).
HAVE YOUR SAY!
Q&A
● Ask questions now…
● … or ask later. We are here for you!
THANK YOU!

Mais conteúdo relacionado

Semelhante a Modeling data for scalable, ad hoc analytics

The thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designThe thinking persons guide to data warehouse design
The thinking persons guide to data warehouse design
Calpont
 
Mohan Testing
Mohan TestingMohan Testing
Mohan Testing
smittal81
 

Semelhante a Modeling data for scalable, ad hoc analytics (20)

The thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designThe thinking persons guide to data warehouse design
The thinking persons guide to data warehouse design
 
Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)
 
Deep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceDeep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performance
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Introduction To Maxtable
Introduction To MaxtableIntroduction To Maxtable
Introduction To Maxtable
 
DB2 LUW V11.1 CERTIFICATION TRAINING PART #1
DB2 LUW V11.1 CERTIFICATION TRAINING PART #1DB2 LUW V11.1 CERTIFICATION TRAINING PART #1
DB2 LUW V11.1 CERTIFICATION TRAINING PART #1
 
Melhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon RedshiftMelhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon Redshift
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData Webinar
 
How to Cost-Optimize Cloud Data Pipelines_.pptx
How to Cost-Optimize Cloud Data Pipelines_.pptxHow to Cost-Optimize Cloud Data Pipelines_.pptx
How to Cost-Optimize Cloud Data Pipelines_.pptx
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
 
Mohan Testing
Mohan TestingMohan Testing
Mohan Testing
 
Dutch PHP Conference 2021 - MySQL Indexes and Histograms
Dutch PHP Conference 2021 - MySQL Indexes and HistogramsDutch PHP Conference 2021 - MySQL Indexes and Histograms
Dutch PHP Conference 2021 - MySQL Indexes and Histograms
 
Running MySQL in AWS
Running MySQL in AWSRunning MySQL in AWS
Running MySQL in AWS
 
Sap memory management ,workload and performance analysis.pptx
Sap memory management ,workload and performance analysis.pptxSap memory management ,workload and performance analysis.pptx
Sap memory management ,workload and performance analysis.pptx
 
MySQL Performance Optimization
MySQL Performance OptimizationMySQL Performance Optimization
MySQL Performance Optimization
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
Bigtable and Boxwood
Bigtable and BoxwoodBigtable and Boxwood
Bigtable and Boxwood
 
Open Source 1010 and Quest InSync presentations March 30th, 2021 on MySQL Ind...
Open Source 1010 and Quest InSync presentations March 30th, 2021 on MySQL Ind...Open Source 1010 and Quest InSync presentations March 30th, 2021 on MySQL Ind...
Open Source 1010 and Quest InSync presentations March 30th, 2021 on MySQL Ind...
 

Mais de MariaDB plc

Mais de MariaDB plc (20)

MariaDB Paris Workshop 2023 - MaxScale 23.02.x
MariaDB Paris Workshop 2023 - MaxScale 23.02.xMariaDB Paris Workshop 2023 - MaxScale 23.02.x
MariaDB Paris Workshop 2023 - MaxScale 23.02.x
 
MariaDB Paris Workshop 2023 - Newpharma
MariaDB Paris Workshop 2023 - NewpharmaMariaDB Paris Workshop 2023 - Newpharma
MariaDB Paris Workshop 2023 - Newpharma
 
MariaDB Paris Workshop 2023 - Cloud
MariaDB Paris Workshop 2023 - CloudMariaDB Paris Workshop 2023 - Cloud
MariaDB Paris Workshop 2023 - Cloud
 
MariaDB Paris Workshop 2023 - MariaDB Enterprise
MariaDB Paris Workshop 2023 - MariaDB EnterpriseMariaDB Paris Workshop 2023 - MariaDB Enterprise
MariaDB Paris Workshop 2023 - MariaDB Enterprise
 
MariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance OptimizationMariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance Optimization
 
MariaDB Paris Workshop 2023 - MaxScale
MariaDB Paris Workshop 2023 - MaxScale MariaDB Paris Workshop 2023 - MaxScale
MariaDB Paris Workshop 2023 - MaxScale
 
MariaDB Paris Workshop 2023 - novadys presentation
MariaDB Paris Workshop 2023 - novadys presentationMariaDB Paris Workshop 2023 - novadys presentation
MariaDB Paris Workshop 2023 - novadys presentation
 
MariaDB Paris Workshop 2023 - DARVA presentation
MariaDB Paris Workshop 2023 - DARVA presentationMariaDB Paris Workshop 2023 - DARVA presentation
MariaDB Paris Workshop 2023 - DARVA presentation
 
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
 
MariaDB SkySQL Autonome Skalierung, Observability, Cloud-Backup
MariaDB SkySQL Autonome Skalierung, Observability, Cloud-BackupMariaDB SkySQL Autonome Skalierung, Observability, Cloud-Backup
MariaDB SkySQL Autonome Skalierung, Observability, Cloud-Backup
 
Einführung : MariaDB Tech und Business Update Hamburg 2023
Einführung : MariaDB Tech und Business Update Hamburg 2023Einführung : MariaDB Tech und Business Update Hamburg 2023
Einführung : MariaDB Tech und Business Update Hamburg 2023
 
Hochverfügbarkeitslösungen mit MariaDB
Hochverfügbarkeitslösungen mit MariaDBHochverfügbarkeitslösungen mit MariaDB
Hochverfügbarkeitslösungen mit MariaDB
 
Die Neuheiten in MariaDB Enterprise Server
Die Neuheiten in MariaDB Enterprise ServerDie Neuheiten in MariaDB Enterprise Server
Die Neuheiten in MariaDB Enterprise Server
 
Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®
 
Introducing workload analysis
Introducing workload analysisIntroducing workload analysis
Introducing workload analysis
 
Under the hood: SkySQL monitoring
Under the hood: SkySQL monitoringUnder the hood: SkySQL monitoring
Under the hood: SkySQL monitoring
 
Introducing the R2DBC async Java connector
Introducing the R2DBC async Java connectorIntroducing the R2DBC async Java connector
Introducing the R2DBC async Java connector
 
MariaDB Enterprise Tools introduction
MariaDB Enterprise Tools introductionMariaDB Enterprise Tools introduction
MariaDB Enterprise Tools introduction
 
Faster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDBFaster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDB
 
The architecture of SkySQL
The architecture of SkySQLThe architecture of SkySQL
The architecture of SkySQL
 

Último

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Último (20)

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 

Modeling data for scalable, ad hoc analytics

  • 1. MODELLING DATA FOR SCALABLE, AD HOC ANALYSIS ASSEN TOTIN Senior Engineer R&D MariaDB Corporation
  • 2. WHY ARE WE HERE? ● If working with a transactional (row) storage is like driving a car (and almost as ubiquitous)... ● … then working with an analytical storage is driving a trailer. ● Bottom line: change your driving attitude or you’re not going to make it even out of the parking lot!
  • 3. QUICK SUMMARY ● The analytical workload. ● MariaDB ColumnStore brief. ● ColumnStore data modelling: preparing data for loading, preparing appropriate schema, optimizing the queries and finding your way around them. ● Moving data to ColumnStore: usage scenarios. ● Q & A
  • 5. THE ANALYTICAL WORKLOAD ● Relatively small set of functions needed, compared to general-purpose scientific work. ● If needed, the business logic can be moved outside the data storage. Thus the storage can be reduced to its most basic storing and retrieval functions. ● Data is mostly historic, hence time-sequenced, almost exclusively appended and rarely – if at all – updated. Data is almost never deleted. ● Large sets of data are retrieved as batches, often full columns or continuous parts of such.
  • 7. COLUMNSTORE STORAGE ● Dedicated columnar storage ● Data is organised in a hierarchical structure (unlike flat row-based storages)
  • 8. COLUMNSTORE STORAGE ● Each database is a directory, each table is a directory inside it, each column is a file inside it ● Columns are split into multiple files (extents) of equal size (8M cells) ● Optional compression, defined per-table
  • 9. COLUMNSTORE STORAGE ● Data can be loaded (written) directly into extents. ● Completely bypasses the SQL layer, leaving it free to process queries. ● Once writing completes, we then notify the processing engine that new data is available.
  • 10. COLUMNSTORE STORAGE ● For each extent some meta-data is calculated and stored in memory (MIN and MAX values etc.). ● Divide-and-conquer strategy to queries: eliminate all unnecessary extents and only load the one needed.
  • 12. COLUMNSTORE CLUSTER ● In ColumnStore, a module is a set of running processes. ● Two types of modules (nodes), User Module (UM) and Performance Module (PM). ● User Module: provides client connectivity (speaks SQL) and has local storage engines (InnoDB, MyISAM...). More UM = more concurrent connections and HA. UM can be replicated. ● Performance Module: stores actual data. More PM = more data stored. ● For dev purposes, one UM and one PM may live together in a single OS.
  • 13. COLUMNSTORE TABLES ● ColumnStore is a storage engine in MariaDB. ● To create a ColumnStore table, use CREATE TABLE… ENGINE=ColumnStore ● Just as with any other MariaDB server, you can mix-and-match different storage engines in one database. ● Just as with any other MariaDB server, you can do a cross-engine JOIN between ColumnStore tables and tables in local storage engines on the UM.
  • 14. COLUMNSTORE DATA DISTRIBUTION ● ColumnStore tables are always distributed (assuming more .than one PM). ● ColumnStore distributes data across the PM nodes in round-robin fashion. ● When a new (empty) PM is added, it receives data until its size catches up with other PM. ● Manual control over data distribution is possible when side-loading via the Bulk Load API: cpimport modes 2 & 3.
  • 17. NO INDICES, PLEASE! ● ColumnStore has no indices: with big data indices do not fit into memory, so they become useless. ● This helps to reduce I/O drastically; ColumnStore I/O requirements are significantly lower than for row storage (works very well on spinning media). ● Reduce CPU load previously spent on maintaining indices. ● The filesystem is always in-sync: file-level backup in real-time is again possible and natural. ● Direct injection of data into the storage (bypassing SQL layer) is now possible. ● Instead of indices, ColumnStore uses divide-and-conquer to only load what’s needed to serve a query.
  • 18. PREPARING DATA FOR LOAD ● ColumnStore will append the data in the order we send it, so it is up to us to order it. ● In order for the divide-and-conquer approach to work best, data has to be arranged in sequential fashion (because then the most extents can be eliminated before actual data read from disk begins). ● Examine your data and identify columns with incremental (or time-based) ordering. ● Examine your queries and find which of these columns is most often used as predicate. ● Order the data by this column prior to loading it.
  • 19. CLUSTERING THE SCHEMA ● ColumnStore follows the map/reduce approach: each PM does the same work on its part of the data (map), then all results are aggregated by a UM (reduce). ● To distribute a JOIN (push-down to all PM) one needs to ensure that either – each node has one of the sides in full, or – both sides are partitioned by the same key. ● With automated data distribution, ColumnStore finds the smaller side of the JOIN and redistributes it on-the-fly to facilitate a distributed JOIN. If the smaller side is bigger than a threshold, the JOIN is pushed up to the UM (which requires more RAM).
  • 20. CLUSTERING THE SCHEMA ● The optimal ColumnStore schema will thus consist of small number of big tables and larger number of smaller tables so that JOIN between a big and a small table can be distributed. ● This schema assumes high degree of data normalisation, so that big tables will contain as many as possible references to small tables, from which actual values are derived. ● This schema is usually referred to as star schema: one big table (in the centre) linked to multiple small tables (around it).
  • 21. CLUSTERING THE SCHEMA: STAR Source: Wikipedia
  • 22. CLUSTERING THE SCHEMA ● The big table in the centre is called fact table, because it contains data (rows), related to events (facts) that occurred in different moments in time. These facts are often associated with the technical or business activity that is represented by the schema (e.g., each sale could be a fact, registered in one row; or, each reading of a sensor value in an IoT system etc.). ● The fact table is amended in each new data load (new rows = new events). ● New rows are appended to the end of the fact table. ● Generally, older (time-wise) facts precede the newer ones.
  • 23. CLUSTERING THE SCHEMA ● The small tables that are linked to the fact table are called dimension tables, because they contain data that describes properties of the facts. ● Dimension tables constitute of things like nomenclatures and other nearly- immutable data: e.g., the list of states and cities, list of points of sale etc. ● Dimension tables are rarely amended.
  • 24. CLUSTERING THE SCHEMA ● Having a second layer of links may provide a more complicated design, sometimes called snowflake schema. ● In a multi-tier (snowflake) schema, a table may be a dimension to one level and a fact to another, e.g. the list of telco subscribers may be a fact (linked to dimensions like the subscription plan), but also a dimension (to which the list of phone calls links).
  • 25. CLUSTERING THE SCHEMA: SNOWFLAKE Source: Wikipedia PHONE CALLS USERS PLANS
  • 26. OPTIMIZING THE QUERIES ● An important prerequisite for properly designing a schema is to know how it is going to be used. ● Ensure the queries and the star schema match each other. ● Always JOIN a fact table to a dimension table only. Never JOIN two fact tables! ● As each column is a separate set of files, the more columns are requested in the result set, the more data has to be read from the disk; always specify exact columns and only those needed; never do SELECT (*)
  • 27. OPTIMIZING THE QUERIES ● Filter on sequential columns as much as possible. ● Filter on actual values, not on functions, because functions prevent extent elimination and lead to full column scan; make extra separate columns if needed, e.g. have a separate column year instead of YEAR(date). ● ORDER BY and LIMIT are run last and always on the UM, so can be expensive (depending on amount of data). ● JOIN with a table from a local storage engine (InnoDB, MyISAM...) is done by first fetching the local table from UM. As this requires loopback connection, this is often relatively slow – so consider its usage carefully.
  • 28. OPTIMIZING DIMENSIONS ● Keep dimensions small (up to 1M rows) as they will be redistributed on-the-fly for each JOIN. ● Increase the distributed JOIN threshold for bigger dimensions (but carefully). This is a cluster-wide tunable from Columnstore.xml.
  • 29. EXTENDING COLUMNSTORE ENGINE ● ColumnStore engine might not always be the best choice (e.g., data type support,encoding support etc.). ● Local storage engines on UM may supplement the ColumnStore engine via cross-engine JOIN. ● Usually multiple UM will be replicated, so tables from local storage engines are also replicated… but in some special cases you may want not to replicate them and effectively have different content for the same local table on different UM; in this case, make sure to configure jobs to run only on the UM where connected (access to ExeMgr process).
  • 30. TRACING YOUR STEPS: EXPLAIN ● EXPLAIN works for ColumnStore, but is less useful (no indices) SELECT t.customer_id, t.discount, t.discounted_price FROM transactions t JOIN books b ON b.book_id=t.book_id WHERE t.trans_date BETWEEN '2018-01-01' AND '2018-01-31'; MariaDB [bookstore]> EXPLAIN SELECT t.customer_id, t.discount, t.discounted_price FROM transactions t JOIN books b ON b.book_id=t.book_id WHERE t.trans_date BETWEEN '2018-01-01' AND '2018-01-31'; +------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------+ | 1 | SIMPLE | t | ALL | NULL | NULL | NULL | NULL | 2000 | Using where with pushed condition | | 1 | SIMPLE | b | ALL | NULL | NULL | NULL | NULL | 2000 | Using where; | | | | | | | | | | | Using join buffer (flat, BNL join) | +------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------+
  • 31. TRACING YOUR STEPS: STATS ● Use SELECT calGetStats() stored procedure, which provides statistics about resources used on the User Module (UM) node, PM node, and network by the last run query. ● 582979 rows in set (3.373 sec) MariaDB [bookstore]> calGetStats(); +---------------------------------+ | Query Stats: | | MaxMemPct-3; | | NumTempFiles-0; | | TempFileSpace-0B; | | ApproxPhyI/O-71674; | | CacheI/O-47150; | | BlocksTouched-47128; | | PartitionBlocksEliminated-1413; | | MsgBytesIn-37MB; | | MsgBytesOut-63KB; | | Mode-Distributed | +---------------------------------+
  • 32. TRACING YOUR STEPS: TRACE ● To trace a query, first enable tracing with SELECT calSetTrace(1), then run the query, then get the trace with SELECT calGetTrace(). MariaDB [bookstore]> SELECT calGetTrace(); +-------------------------------------------------------------------------------------------------------------------+ | Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows | | BPS PM b 301760 (book_id) 0 7 0 0.002 5001 | | BPS PM t 301805 (book_id,customer_id,discount,discounted_price,trans_date) 0 17280 1413 0.308 582979 | | HJS PM t-b 301805 - - - - ----- - | | TNS UM - - - - - - 2.476 582979 | +-------------------------------------------------------------------------------------------------------------------+
  • 34. MOVING DATA TO COLUMNSTORE ● Scenario A: Use the same schema as in transactional. ● Only use ColumnStore as long-term cold storage for large amounts of data. ● No OLAP as schema does not match requirements. ● No OLTP as data is too big. ● Copy selected parts of the data back to OLTP engine for processing.
  • 35. MOVING DATA TO COLUMNSTORE ● Scenario B: Use dedicated star schema. ● Actively use ColumnStore as OLAP backend. ● Load the data from OLTP storage in batches: ETL with either LOAD DATA or Bulk Load API (preferred: cpimport, shared library/JAR). ● Use any preferred front-end tool to drive the analytics (Tableau, Pentaho Mondrian, Microsft SSAS, Apache Zeppelin…).
  • 37. Q&A ● Ask questions now… ● … or ask later. We are here for you!