2. Agenda
1. Bring the data together and then transform
2. Data modeling for pragmatism
3. Let the datasets talk to each other
4. Beef up sandbox and backfill
5. Performance tuning
6. Examples, questions and comments
3. ELT and RIT
● ELT =
○ Extract from Source
○ Load to HDFS/MPP
○ Transform and Integrate inside Hadoop/MPP
● RIT =
○ Replicate message/log to Queue (JMS/Kafka/CDC)
○ Stream from Queue and Ingest to HDFS/MPP
○ Transform and Integrate inside Hadoop/MPP
4. Why ELT and RIT (instead of ETL)
● Store related raw data together for better
leverage
● Big queryable staging area is quite useful
● Reduce workload impact against source
systems
● Write data cleansing and business logic in
similar languages/scripting used by BI
5. Is Data Modeling still Important
● Do we still need to model the data in the era
of NoSQL and Big Data?
● Shall we de-normalize/pre-join everything?
● Shall we use hierarchical JSON/XML and/or
Key-Value pair for everything?
● Balance and trade-off: analytics, reusability,
metadata-driven, size vs. easy-to-query
6. Is Data Modeling still Important
● Cluster all attributes and children objects to
the tree structure (thinking in NoSQL way)
● Can we live without JOIN operator?
● Is mutable/updatable dataset still useful?
● Why not snapshot everything? Why SCD2?
● Is relational-model outdated?
● Model it in source or fix it in report?
● All or none: Index, Hash, and Full Scan?
7. Integration Brings the True Value
● Like the idea of SOA, be careful with DQ
● Data producers are loosely-coupled for the
sake of scalability
● Integration and cross-reference is deferred
to DW/BI layer, yet someone has to do it
● How can unique identifier help here?
● Replicate dim/ref/lkp and Federate tx/event
8. Profiling Prototype Deploy Backfill
● Profiling data to understand data
● Prototype with real data
● Enrich and harden the derived data in
sandbox before deploying to production
● Be ready to backfill the data because it will
happen (easier to produce or to consume?)
9. Self-service is Cool, but
● Strong automation tools must be built first
● Software can monitor and throttle
● If a user’s job gets killed, there needs to be
enough info/clue to explain why and how to
(try to) fix it
● Education and knowledge sharing is
essential, is wiki page/runbook good enough
10. Performance Matters
● Good instrumentation/logging will pay off big
time in performance tuning
● Can we run complex OLAP reports on top of
operational metadata?
Where needs the tuning the most?
● Swapping (spill to disk) can be a big issue
● Detect and kill the bad jobs early
11. Examples
1. data exploration in MPP/Hadoop instead of
in source systems (don’t let brain wait)
2. web click stream backend transaction
3. replicate/synchronize rollup hierarchy
(mapping lookup) to multiple data systems;
then produce near-real-time aggregation in
each system; finally federate the aggregates