Big Data Pitfalls

Big Data
Pitfalls
April 8, 2015

3
So What is it?
●
Misnomer and marketing speak
●
“Unstructured” data
– Text heavy
– Without obvious/clear structure
●
Comes from many places, in many styles

11
Hadoop to the Rescue
● Cross system analytics?
● Data quality confidence?
● Source of truth?
● Tool chain support?
● Giant yellow elephants?

12
Hadoop to the Rescue
● Cross system analytics?
● Data quality confidence?
● Source of truth?
● Tool chain support?
● Giant yellow elephants?
If any are ignored...

14
Don't worry, even the Jedi had a Data Swamp...

15
Goal is to build a Data Reservoir

16
Reservoirs...
● Contain data that is...
– Managed
– Transformed
– Filtered
– Secured
– Portable
– Fit for purpose
Source: Gartner

18
Data Warehouse Models
● Traditional models don't cover semi-
structured data
● Modern models are hybrids that cross the
structured semi-structured boundary

20
Data Vault
● Developed by Dan Linstedt
● Tie technical keys across structured and semi-structured data sources
● Semi-structured data can me made more structured and loaded into relational data
vault
● Tools have to support crossing sources
● More details: http://www.tdan.com/view-articles/5054/

22
Anchor
● Developed by Lars Rönnbäck
● 6th normal form data warehouse
● Have to transform semi-structured data to match the anchor model
● Provides flexible model that should be able to have marts built upon it
● More details: http://www.anchormodeling.com/

23
Textual Disambiguation
● Developed by Bill Inmon
● Breaking semi-structured data down by context
● Converts the data into structured format, consumable by tools
● Store data within the data warehouse – 8th/9th normal form
● White papers and more details are on Bill's website:
http://www.forestrimtech.com/

24Source: http://www.slideshare.net/Roenbaeck/anchor-modeling-8140128

25
Working With “Unstructured” Data
● Most data tools require structure (Database schema, clear-cut data formatting)
● Business and technical knowledge required
– Business to provide the pattern “the grammar or syntax”
– Technical to provide the “how”

26
Working With “Unstructured” Data
“The car is hot.”

27
Identifying Context
● It's a really nice car.
● It's internal temperature requires adjustment
● It's hot to the touch
● It's on fire

29
How to Implement
● Map/Reduce code, Hive queries, data integration tools (Pentaho, Talend)
● Have to create the grammar/syntax rules for particular business
● MDM is _not_ the solution
● Best to have a data warehouse based on subject/relationships
– Data Vault
– Anchor
– Textual Disambiguation

30
Data Symbiosis
● Data in data lake can't stand on it's own
– Ties back to rest of the structured data
– Requires firm understanding of business rules/logic
● Provides richer data sets
● Difficult to do before data lakes, after adding a data lake the problems magnify
– But so do the rewards!

31
Data Quality
● Not just a problem for Data Warehouses!
● Measuring “fit for purpose”
● Same rules used for data warehouses
apply to big data

32
Principles of Data Quality
● Consistency
● Correctness
● Timeliness
● Precision
● Unambiguous
● Completeness
● Reliability
● Accuracy
● Objectivity
● Conciseness
● Usefulness
● Usability
● Relevance
● Quantity
Source: Data Quality Fundamentals, The Data Warehouse Institute

33
Why Data Quality?
● Main way to control/tame your data
problems
● Most hidden costs because it's hardest to
fix
● Target upstream for problem solutions

34
How to Implement
● Data integration tools
● Custom coding (Map/Reduce, etc.)
● Data Profiling
● MDM (as central “dictionary”/”grammar”
handler)

36
Does Your Tool Chain...
● Support Hadoop?
● Interface with non-traditional database solutions (i.e. not an RDBMS)?
● Allow for integration across disparate sources?
● Support data quality?

38
Hadoop Ecosystem
● Bridges some of the gaps
– Hive – SQL to Hadoop interface (jdbc support)
● Provides even more power
https://hadoopecosystemtable.github.io/
Plus dozens of others... and growing

39
Sources
● http://en.wikipedia.org/wiki/File:Pitfall!_Coverart.png
● http://www.networkcomputing.com/big-data-defined/d/d-id/1204588
● http://www.appliedi.net/
● http://imgbuddy.com/internet-of-things-icon.asp
● http://www.smashingapps.com/, et. al.
● http://www.colleenkerriganphotographs.com/p663330184/h217016CE#h2170
16ce

Big Data Pitfalls

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Big Data Pitfalls

Similar to Big Data Pitfalls (20)

More from Alex Meadows

More from Alex Meadows (13)

Recently uploaded

Recently uploaded (20)

Big Data Pitfalls