This document summarizes a presentation on self-service data analysis, data wrangling, data munging, and how they fit together with data modeling. It discusses how these techniques allow business stakeholders and data scientists to prepare and transform data for analysis without extensive technical expertise. While these tools increase flexibility, they can also decrease governance if not used properly. The document advocates finding a balance between managed data assets and exploratory analysis to maximize insights while maintaining data quality.
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling – How Do They Fit Together?
1. Self-Service Data Analysis, Data Wrangling, Data Munging,
and Data Modeling - How Do They Fit Together?
Donna Burbank
Global Data Strategy Ltd.
Lessons in Data Modeling DATAVERSITY Series
June 22nd, 2017
2. Global Data Strategy, Ltd. 2017
Donna Burbank
Donna is a recognised industry expert in
information management with over 20
years of experience in data strategy,
information management, data modeling,
metadata management, and enterprise
architecture. Her background is multi-
faceted across consulting, product
development, product management, brand
strategy, marketing, and business
leadership.
She is currently the Managing Director at
Global Data Strategy, Ltd., an international
information management consulting
company that specializes in the alignment
of business drivers with data-centric
technology. In past roles, she has served in
key brand strategy and product
management roles at CA Technologies and
Embarcadero Technologies for several of
the leading data management products in
the market.
As an active contributor to the data
management community, she is a long
time DAMA International member, Past
President and Advisor to the DAMA Rocky
Mountain chapter, and was recently
awarded the Excellence in Data
Management Award from DAMA
International in 2016. She was on the
review committee for the Object
Management Group’s (OMG) Information
Management Metamodel (IMM) and the
Business Process Modeling Notation
(BPMN). Donna is also an analyst at the
Boulder BI Train Trust (BBBT) where she
provides advices and gains insight on the
latest BI and Analytics software in the
market.
She has worked with dozens of Fortune
500 companies worldwide in the Americas,
Europe, Asia, and Africa and speaks
regularly at industry conferences. She has
co-authored two books: Data Modeling for
the Business and Data Modeling Made
Simple with ERwin Data Modeler and is a
regular contributor to industry
publications. She can be reached at
donna.burbank@globaldatastrategy.com
Donna is based in Boulder, Colorado, USA.
2
Follow on Twitter @donnaburbank
Today’s hashtag: #LessonsDM
3. Global Data Strategy, Ltd. 2017
Lessons in Data Modeling Series
• January 26th How Data Modeling Fits Into an Overall Enterprise Architecture
• February 23rd Data Modeling and Business Intelligence
• March Conceptual Data Modeling – How to Get the Attention of Business Users
• April The Evolving Role of the Data Architect – What does it mean for your Career?
• May Data Modeling & Metadata Management
• June Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling
• July Data Modeling & Metadata for Graph Databases
• August Data Modeling & Data Integration
• September Data Modeling & MDM
• October Agile & Data Modeling – How Can They Work Together?
• December Data Modeling, Data Quality & Data Governance
3
This Year’s Line Up
Related topic – Self Service BI
4. Global Data Strategy, Ltd. 2017
Agenda
• What is Self Service Data Prep, “Data Munging” and “Data Wrangling”?
• The Good, the Bad, and the Ugly
• Integrating the Data Warehouse & Data Lake
• Data Governance & Organizational Considerations
4
What we’ll cover today
5. Global Data Strategy, Ltd. 2017
What is Data Wrangling, Munging & Self-Service Data Prep?
Data wrangling (sometimes referred to as Data munging) is the process of transforming and
mapping data from one "raw" data form into another format with the intent of making it more
appropriate and valuable for a variety of downstream purposes such as analytics.
- Wikipedia, June 2017
Data munging … is sometimes used for vague data transformation steps that are not yet clear to
the speaker.
- Wikipedia, June 2017
As their name implies, the key ingredient of data preparation platforms is their ability to
provide self-service capabilities that allow knowledgeable users (but who are not IT experts) to
combine, transform and cleanse relevant data prior to analysis: to "prepare" it. Most tools in this
category are targeted at business analysts but there are products aimed more at data
scientists.
- Philip Howard, Bloor Research
5
6. Global Data Strategy, Ltd. 2017
Aimed at Business Stakeholders & Data Scientists
• According to a recent DATAVERSITY survey on Emerging Trends in Data Architecture, new and
disparate roles are often involved in developing a data architecture.
• Below is a “sneak peak” of the results (due to be published in October).
6
Answer Response Percent
Data Architect 90.0%
Data Modeler 65.3%
Enterprise Architect 66.5%
Business Architect 51.2%
Systems Developer 17.1%
Programmer 16.5%
Database Administrator (DBA) 37.6%
Data Scientist 27.6%
ETL or Database Developer 36.5%
Business Stakeholder(s) 32.9%
Program Manager 12.9%
Data Quality Administrator 30.0%
Data Governance Officer 50.0%
Don't know 2.4%
Other (please specify) 8.2%
What role(s) are typically responsible for creating a Data Architecture? [Select all that apply]
While Data Architects & related roles are still
responsible for the bulk of data architecture decisions,
often with traditional ETL techniques.
Business Stakeholders and Data Scientists also play a
significant role, often with self-service data prep tools.
7. Global Data Strategy, Ltd. 2017
Sample Tools in the Self Service Data Prep
• The following list of products and vendors are commonly considered in the Self Service Data
Preparation category.
• This list is not inclusive and is not an endorsement of any product, but is meant to indicate the
type of product we’re talking about today.
7
• Pure Play Vendors
• Alation
• Alteryx
• Paxata
• Tamr
• Trifacta
• Traditional data integration vendors
• Informatica
• Syncsort (Unify)
• Etc.
• BI vendors
• Pentaho
• Tableau
• Qlik
• Etc.
8. Global Data Strategy, Ltd. 2017
Good Wrangling and Bad Wrangling
8
Bad Wrangling Good Wrangling
• Performed because a
solid data architecture is
lacking – i.e. work-
arounds & cleanup.
• Done to avoid data
governance restrictions.
• Increases Confusion &
Decreases Time to
Insight
• Part of data exploration
& analysis
• Done within data
governance restrictions.
• Leverages defined
standards (e.g.
Reference Data)
• Produces Faster Time to
Insight
9. Global Data Strategy, Ltd. 2017
The Reluctant Wrangler
9
Raw data used in Self-Service Analytics and BI environments is
often so poor that many data scientists and BI professionals
spend an estimated 50 – 90% of their time cleaning and
reformatting data to make it fit for purpose.(4
Source: DataCenterJournal.com
Correcting poor data quality is a Data Scientist’s least favorite
task, consuming on average 80% of their working day
Source: Forbes 2016
11. Global Data Strategy, Ltd. 2017
Reporting is Only as Good as the Underlying Architecture & Definitions
11
• Modern tools make it easy to create visual reports & graphs from data.
• But without business context, or “metadata”, these reports are of little value.
What does ‘F2’ refer to?
Are there standard code sets?
Does this number represent a date?
Computing report…elapsed time
10 hours, 27 seconds…
Why does it take so long for the report to run?
• A robust data architecture provides data sets that have:
• Business context & definition
• Common structure & formatting
• Fast & easily-reportable data sets
12. Global Data Strategy, Ltd. 2017
Today’s Reporting Data Sets are Complex
• Reporting today goes beyond traditional relational databases, which adds to the
complexity of preparing data to create effective and intuitive reports and analytics.
12
COBOL
Legacy Systems
JCL
Spreadsheets
Media
Social
Media
IoTOpen Data
Databases
Data Models
Documents
Data
In Motion
13. Global Data Strategy, Ltd. 2017
Disparate Data Sources
• The 2016 DATAVERSITY Emerging Trends in Metadata survey revealed some interesting findings
about what types of data & metadata organizations will be managing now and in the future.
• Not all are easily managed in traditional data modeling tools (although many are…)
13
= Supported by most data modeling tools
Now Future
15. Global Data Strategy, Ltd. 2017
Paradigm Shift in the Way We Look at “Reporting”
Traditional
• Top-Down, Hierarchical
• Design, then Implement
• “Passive”, Push technology
• “Manageable” volumes of information
• “Stable” rate of change
• Business Intelligence
“Big Data” / Exploration
• Distributed, Democratic
• Discover and Analyze
• Collaborative, Interactive
• Massive volumes of information
• Rapid and Exponential rate of growth
• Data Science
Design Implement Discover Analyze
16. Global Data Strategy, Ltd. 2017
“Traditional” way of Looking at the World: Hierarchies
• Carolus Linnaeus in 1735 established a hierarchy/taxonomy for organizing and identifying
biological systems.
Kingdom
Phylum
Class
Order
Family
Genus
Species
17. Global Data Strategy, Ltd. 2017
“New” Way of Looking at the World - Emergence
In philosophy, systems theory, science, and art, emergence is
the way complex systems and patterns arise out of a
multiplicity of relatively simple interactions.
- Wikipedia
I love my new
Levis jeans.
Is Levi coming
to my party?
Sale #LEVIS
20% at Macys.
LOL. TTYL.
Leving soon.
18. Global Data Strategy, Ltd. 2017
Data Warehouse vs. Data Lake
18
Data Warehouse Data Lake
A Data Lake is a storage repository that holds a vast
amount of raw data in its native format, including
structured, semi-structured, and unstructured data.
The data structure & requirements are not defined until
the data is needed.
A Data Warehouse is a storage repository that holds current
and historical data used for creating analytical reports. Data
structures & requirements are pre-defined, and data is
organized & stored according to these definitions.
19. Global Data Strategy, Ltd. 2017
Integrating the Data Lake & Traditional Data Sources
• The Data Lake has a different architecture & purpose than traditional data sources
such as data warehouses.
• But the two environments can co-exist to share relevant information.
19
Data Analysis & Discovery – Data Lake Enterprise Systems of Record
Data Governance & Collaboration
Master &
Reference Data
Data Warehouse
Data MartsOperational Data
Security & Privacy
Sandbox
Lightly Modeled
Data
Data
Exploration
Reporting & Analytics
Advanced
Analytics
Self-Service BI
Standard BI
Reports
20. Global Data Strategy, Ltd. 2017
Combining DW & Big Data Can Provide Valuable Information
• There are numerous ways to gain value from data
• Relational Database and Data Warehouse systems are one key source of value
• Customer information
• Product information
• Big Data can offer new insights from data
• From new data sources (e.g. social media, IoT)
• By correlating multiple new and existing data sources (e.g. network patterns & customer data)
• Integrating DW and Big Data can provide valuable new insights.
• Examples include:
• Customer Experience Optimization
• Churn Management
• Products & Services Innovation
New
InsightsData
Warehouse
20
21. Global Data Strategy, Ltd. 2017
Organizational Siloes
21
Data Lake & Data
Scientist
• Exploratory projects
• Quick wins
• Often Little documentation &
governance
Data Warehouse & Data
Architects
• Enterprise reporting
• Long-term projects
• Data Standards
• Metadata & Governance
Data
Warehouse
• Too often, there are organizational & cultural silos that limit the sharing between the
Data Lake and Data Warehouse
Data Lake
22. Global Data Strategy, Ltd. 2017
Organizational Siloes
22
Self-Service Data
Prep & BI Reporting
• Exploratory projects
• Quick wins
• Little documentation &
governance
Data Warehouse &
Traditional BI Reporting
• Enterprise reporting
• Long-term projects
• Data Standards
• Metadata & Governance
Data
Warehouse
• Unfortunately, these siloes often also exist between business users and traditional
data warehouse & BI architects
Report requirements thrown
‘over the wall’….and wait…
Departmental
Database
23. Global Data Strategy, Ltd. 2017
Reducing Time to Insight is a Key Driver for
Self Service Data Prep
• According to a TDWI’s Best Practices Report on “Improving Data Preparation for
Business Analytics” from Q3 2016, the following are key drivers for Self-Service
Data Preparation
• 81% Shorten time to business insight
• 76% Increase data-driven decision making
• 53% Improve reaction time to business conditions
• 49% Operational efficiency for frontline works
• 43% Gain a single, complete view of relevant data
23
• The most popular sources include traditional ones:
• 87% Relational databases
• 83% Data warehouse
• 79% Spreadsheet or desktop database
Departmental
Database
24. Global Data Strategy, Ltd. 2017
Finding Balance – Model What Matters
24
• It’s important to find a balance between
• Managing & modeling “trusted data sets”
• Giving users the flexibility to explore.
• Most users will find these trusted data sets a welcome asset, but don’t want to be restricted from
doing data exploration when appropriate.
IoT
Log Files
Data Warehouse
Master Data
Reference Data
Structure Flexibility & Exploration
25. Global Data Strategy, Ltd. 2017
Find a Balance in Implementing Data Architecture
• Find the Right Balance
• Data Architecture projects can have the reputation for being overly “academic”, long, expensive, etc.
• No architecture at all can cause chaos.
• When done correctly, Data Architecture helps improve efficiency and better align with business priorities
25
Focus on Business Value
Business Value
Too Academic, nothing
gets done
Too “Wild West”, nothing
gets done - chaos
26. Global Data Strategy, Ltd. 2017
Implement Fit-for-Purpose Data Modeling & Governance
• The data modeling & governance rigor depends on the usage and purpose of data
• As a general rule, the more the data is shared across & beyond the organization, the more formal governance needs to be
26
Core Enterprise
Data
Functional & Operational
Data
Exploratory Data
Reference &
Master Data
Core Enterprise Data
• Common data elements used by multiple
stakeholders across Bus, LOBs, functional areas,
applications, etc.
• Highly governed
• Highly published & shared
Functional & Operational Data
• Lightly modeled & prepared data for
limited sharing & reuse
• Collaboration-based governance
• May be future candidates for core data
Exploratory Data
• Raw or lightly prepped data for
exploratory analysis
• Mainly ad hoc, one-off analysis
• Light touch governance
Examples
• Operational Reporting
• Non-productionized analytical model data
• Ad hoc reporting & discovery
Examples
• Raw data sets for exploratory analytics
• External & Open data sources
Examples
• Common Financial Metrics: for Financial & Regulatory Reporting
• Common Attributes: Core attributes reused across multiple areas
(e.g. Customer name, Account ID, Address)
Master & Reference Data
• Common data elements used by multiple stakeholders
across functional areas, applications, etc.
• Highly governed
• Highly published & shared
Examples
• Reference Data: Procedure codes, Country Codes, etc
• Master Data: Location, Customer, Product
27. Global Data Strategy, Ltd. 2017
Summary
• As more business stakeholders see the value of data, Self Service Data Preparation is on the rise
• Common users include data scientists and business stakeholders
• While the use cases for these two stakeholder categories are different, both are driven by the need for:
• Time to Value
• Freedom to Explore
• Create a Data Governance Framework that provides “just enough” governance
• Allowing flexibility where appropriate
• Applying rigor and structure where necessary
• Providing trusted data sets for all
• Data Modeling used correctly will:
• Increase time to insight
• Increase collaboration
• Increase business value
• Happy Wrangling!
28. Global Data Strategy, Ltd. 2017
About Global Data Strategy, Ltd
• Global Data Strategy is an international information management consulting company that specializes
in the alignment of business drivers with data-centric technology.
• Our passion is data, and helping organizations enrich their business opportunities through data and
information.
• Our core values center around providing solutions that are:
• Business-Driven: We put the needs of your business first, before we look at any technology solution.
• Clear & Relevant: We provide clear explanations using real-world examples.
• Customized & Right-Sized: Our implementations are based on the unique needs of your organization’s
size, corporate culture, and geography.
• High Quality & Technically Precise: We pride ourselves in excellence of execution, with years of
technical expertise in the industry.
28
Data-Driven Business Transformation
Business Strategy
Aligned With
Data Strategy
Visit www.globaldatastrategy.com for more information
29. Global Data Strategy, Ltd. 2017
Contact Info
• Email: donna.burbank@globaldatastrategy.com
• Twitter: @donnaburbank
@GlobalDataStrat
• Website: www.globaldatastrategy.com
29
30. Global Data Strategy, Ltd. 2017
White Paper: Emerging Trends in Metadata Management
30
Free Download
• Download from www.dataversity.net
• Also available on www.globaldatastategy.com
31. Global Data Strategy, Ltd. 2017
Lessons in Data Modeling Series
• January 26th How Data Modeling Fits Into an Overall Enterprise Architecture
• February 23rd Data Modeling and Business Intelligence
• March Conceptual Data Modeling – How to Get the Attention of Business Users
• April The Evolving Role of the Data Architect – What does it mean for your Career?
• May Data Modeling & Metadata Management
• June Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling
• July Data Modeling & Metadata for Graph Databases
• August Data Modeling & Data Integration
• September Data Modeling & MDM
• October Agile & Data Modeling – How Can They Work Together?
• December Data Modeling, Data Quality & Data Governance
31
This Year’s Line Up