Big Data has been around long enough that there are some common issues that occur whenever an organization tries to implement and integrate it into their ecosystem. This presentation covers some of those pitfalls, which also impact traditional data warehouses/business intelligence ecosystems
3. 3
So What is it?
●
Misnomer and marketing speak
●
“Unstructured” data
– Text heavy
– Without obvious/clear structure
●
Comes from many places, in many styles
11. 11
Hadoop to the Rescue
● Cross system analytics?
● Data quality confidence?
● Source of truth?
● Tool chain support?
● Giant yellow elephants?
12. 12
Hadoop to the Rescue
● Cross system analytics?
● Data quality confidence?
● Source of truth?
● Tool chain support?
● Giant yellow elephants?
If any are ignored...
18. 18
Data Warehouse Models
● Traditional models don't cover semi-
structured data
● Modern models are hybrids that cross the
structured semi-structured boundary
20. 20
Data Vault
● Developed by Dan Linstedt
● Tie technical keys across structured and semi-structured data sources
● Semi-structured data can me made more structured and loaded into relational data
vault
● Tools have to support crossing sources
● More details: http://www.tdan.com/view-articles/5054/
22. 22
Anchor
● Developed by Lars Rönnbäck
● 6th normal form data warehouse
● Have to transform semi-structured data to match the anchor model
● Provides flexible model that should be able to have marts built upon it
● More details: http://www.anchormodeling.com/
23. 23
Textual Disambiguation
● Developed by Bill Inmon
● Breaking semi-structured data down by context
● Converts the data into structured format, consumable by tools
● Store data within the data warehouse – 8th/9th normal form
● White papers and more details are on Bill's website:
http://www.forestrimtech.com/
25. 25
Working With “Unstructured” Data
● Most data tools require structure (Database schema, clear-cut data formatting)
● Business and technical knowledge required
– Business to provide the pattern “the grammar or syntax”
– Technical to provide the “how”
29. 29
How to Implement
● Map/Reduce code, Hive queries, data integration tools (Pentaho, Talend)
● Have to create the grammar/syntax rules for particular business
● MDM is _not_ the solution
● Best to have a data warehouse based on subject/relationships
– Data Vault
– Anchor
– Textual Disambiguation
30. 30
Data Symbiosis
● Data in data lake can't stand on it's own
– Ties back to rest of the structured data
– Requires firm understanding of business rules/logic
● Provides richer data sets
● Difficult to do before data lakes, after adding a data lake the problems magnify
– But so do the rewards!
31. 31
Data Quality
● Not just a problem for Data Warehouses!
● Measuring “fit for purpose”
● Same rules used for data warehouses
apply to big data
32. 32
Principles of Data Quality
● Consistency
● Correctness
● Timeliness
● Precision
● Unambiguous
● Completeness
● Reliability
● Accuracy
● Objectivity
● Conciseness
● Usefulness
● Usability
● Relevance
● Quantity
Source: Data Quality Fundamentals, The Data Warehouse Institute
33. 33
Why Data Quality?
● Main way to control/tame your data
problems
● Most hidden costs because it's hardest to
fix
● Target upstream for problem solutions
34. 34
How to Implement
● Data integration tools
● Custom coding (Map/Reduce, etc.)
● Data Profiling
● MDM (as central “dictionary”/”grammar”
handler)
36. 36
Does Your Tool Chain...
● Support Hadoop?
● Interface with non-traditional database solutions (i.e. not an RDBMS)?
● Allow for integration across disparate sources?
● Support data quality?
38. 38
Hadoop Ecosystem
● Bridges some of the gaps
– Hive – SQL to Hadoop interface (jdbc support)
● Provides even more power
https://hadoopecosystemtable.github.io/
Plus dozens of others... and growing