The document discusses key considerations for designing a data warehouse, including building a logical design, transitioning to a physical design, and monitoring and tuning the design. It recommends using a modeling tool to capture logical designs, manual partitioning in some cases, and letting database engines do the work. It also covers physical design decisions like SQL vs NoSQL, row vs column storage, partitioning, indexing and optimizing data loads. Regular monitoring of workloads, bottlenecks and ratios is advised to tune performance.
5. What is the key component for success? In other words, what you do with your MySQL Server – in terms of physical design, schema design, and performance design – will be the biggest factor on whether a BI system hits the mark… * Philip Russom, “Next Generation Data Warehouse Platforms”, TDWI, 2009. *
8. Simple reporting databases OLTP Database Read Shard One Reporting Database Application Servers End Users ETL Just use the same design on a different box… Replication
17. SQL or NoSQL…? Row or Column database…? How to scale…? Should I worry about High availability…? Index or no…? How should I partition my data…? Is sharding a good idea…?
21. What technologies you should be looking at * Philip Russom, “Next Generation Data Warehouse Platforms”, TDWI, 2009. *
22. Row or column-based engine? Medium-very large data Small-medium data Very dynamic; query patterns change Know exactly what to index; won’t change Need very fast loads; little DML Will be doing lots of single inserts/deletes Only need subset of columns for query Will need most columns in a table for query Yes, Column-based tables! Yes, Row-based tables!
23. Column vs. row orientation A column-oriented architecture looks the same on the surface, but stores data differently than legacy/row-based databases…
24. Example: InfiniDB vs. “Leading” row DB InfiniDB takes up 22% less space InfiniDB loaded data 22% faster InfiniDB total query times were 65% less InfiniDB average query times were 59% less Notice not only are the queries faster, but also more predictable * Tests run on standalone machine: 16 CPU, 16GB RAM, CentOS 5.4 with 2TB of raw data