This document discusses the rise of big data and data science. It notes that while data volumes are growing exponentially, data alone is just an asset - it is data scientists that create value by building data products that provide insights. The document outlines the data science workflow and highlights both the tools used and challenges faced by data scientists in extracting value from big data.
1. Big Data and the
Art of Data Science
Andrew B. Gardner, PhD
www.linkedin.com/in/andywocky/
agardner@momentics.com
www.momentics.com
2. Big Data is Not New
Big Data Challenge
tion
e
old
8
1880 census – 50M people
The First Big Data Solution
• Hollerith Tabulating
System
• Punched cards – 80
variables
• Used for 1890 census
• 6 weeks instead of 7+
years
9
Hollerith Tabulation System
{age, number of insanes, …} 7 years 6 weeks
Image Credit – http://en.wikipedia.org/wiki/File:1880_census_Edison.gif
Image Credit – http://en.wikipedia.org/wiki/File:Hollerith_Punched_Card.jpg
Image Credit – http://en.wikipedia.org/wiki/File:HollerithMachine.CHM.jpg
3. Big Data Is More Than 3 Vs*
Volume Variety Velocity
*2001 (Meta) / 2012 (Gartner) Definition of Big Data
IDC Report 2011
8 billion TB in 2015
40 billion TB in 2020
90% of all data < 2 years
storage transport
processing
relational, graph
time series, sensor,
audio, video, text,
geo, scientific, …
80% unstructured
facebook 500 TB/day
Large Hadron 35 GB/sec
twitter 300K tweets/min
real time stream
4. Big Data Opportunities
“… big data market will grow from $3.2B (2010) to $16.9B (2015)…”
“… gains of 5-6% productivity and profitability …”
“… business volume will double every 1.2 years …”
“… required for companies to stay innovative and competitive …”
“… retail 60% increase in net margin attainable …”
“… manufacturing production costs decrease 50% …”
“… $300B annual savings in healthcare …”
IBM | The Economist | McKinsey & Company | PWC | KPMG | Accenture
5. Big Data Successes
Walmart
• 10-15% online sales lift
• $1B incremental revenue
• Recommendations
• Engineered content
• 2012 Presidential Election • Fleet telematics save fuel
7. 1: Growth of Data
Amount of data in the world…
2005
100 EB
2012
2800 EB
2013
8000 EB
1 EB = 1 Exabyte = 1 billion GB
… doubles every 2 years
8. 2: Connectedness & Sources
More non-human
nodes online than
people
50B+ non-human
nodes online
The Internet of Things (IoT)
Source: Swan, M. Sensor Mania! The Internet of Things, Objective
Metrics, and the Quantified Self 2.0. J Sens Actuator Netw (2012) 1(3),
217-253.
social
mobile
web
enriched data
science
IoT
Data Sources
10. 4: Economics
Attention economy not information economy!
• Data is bountiful
• Storage is cheap
• Computing is cheap
• Analysis is cheap
• Talent is expensive
• Time is expensive
11. Big Data Disruption
• define schema
• pour in data
• analyze
Better Cycle Times and Better Questions Win!
(few) well calculated
questions first
• collect data
• explore
• schema as needed
data first then
exploratory decision
making
unknown unknowns = insight gold
OLD NEW
12. Rumsfeld Analytics
Things we
know
don’t know
we know
we don’t
know
we know
we don’t
know
Facts – could be wrong.
Questions – do reporting.
Intuition – quantify to improve.
Exploration– unfair advantages.
Goal: data discoveries = insights = game changers = unknown unknowns.
13. Data Alone is Just An Asset
• Depreciating
• Liability
• Useful lifetime
• Expense
Finished goods create value
from raw materials
data
$$ data product $$
14. Enter the Data Scientist
• mathematical
• developer
• data talented
• problem solver
• insight whisperer
• product savvy
Source: FICO Infographic
data + data scientist
$$ data product $$
15. A Brief History of Data Science
BC - The Greeks
1974 Peter Naur @ UoC
2001 William S. Cleveland @ CSU
2003 Journal of Data Science
2009 Jeff Hammerbacher @Facebook
2010 Hillary Mason & Chris Wiggins @ Dataists
2010 Mike Loukadis @ O'Reilly
2011 DJ Patil @ LinkedIn
16. Famous Definitions – New Blend
Conway’s “Data Science” Venn Diagram (2010)
Image credit: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
new skill blend:
one stop rock star
19. Many Flavors of Data Scientist
Alternatively, Data Roles × Skill Sets
Harlan Harris, et al.
datacommunitydc.org/ blog/ wp- content/ uploads/
Analyzing the Analyze
Harlan Harris, S
Marck Vaisman
O’Reilly, 2013
amazon.com/ dp
… from research
to development
to business-focused
Source / Image Credit: H. Harris, S. Murphy, M. Vaisman. “Analyzing the Analyzers.” O’Reilly Media, Jun 2013.
role
skill
2012-3 Survey
20. Universal Agreement: Scarcity
In 2018
Huge shortage of analytic
talent (140K+).
Gap of 1.5M managers that
can make decisions based on
data analysis
McKinsey Prediction
• Talent is the biggest resource
• There is a raging talent war
Source: J. Manyika et al., “Big data: The next frontier for innovation, competition, and productivity.” McKinsey Global Institute (2011).
http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation
21. The Data Scientist’s Craft
• Discover unknown unknowns in data
• Obtain predictive, actionable insight
• Communicate business data stories
• Build business decision confidence
• Create valuable Data Products
23. Building Data Products
Objectives
Levers
Data
Models
What outcome am I trying to achieve?
What inputs can we control?
What data can we collect?
How do the levers impact the data?
Source / Adapted From: J. Howard,. “Designing Great Data Products.” O’Reilly Media, Mar 2012.
28. Data Science Workflow
Source: Josh Wills, Senior Director of Data Science, Cloudera. “From the Lab to
the Factory: Building a Production Machine Learning Infrastructure.”
+ creative exploration
30. Challenges for Data Scientists
• Stakeholder naivetee
– 2-3 days, right?
• Red tape
– No access allowed
• Terminology
– What’s a wonkulator?
• Real world data
– Messy, noisy, missing,
…
• Unknown need
– What’s the business goal?
• Stakeholder alignment
– CMO, CIO, Prod, DevOps
• Analysis distrust
– … but I don’t like that result
31. Some Practical Tips
Rapid Iteration
Implement Implement
Feedback
Visualize, Draw, Sketch, Share
Start Simple, Start Small Goal, But Not Perfection
32. Big Data Science & Sensemaking
Source: HP “Monetizing Big Data” Perspective.
33. A Final Word of Caution
big data
hypehope happy
time
expectations
cloud computing
2013 2018-2023
Adapted from: Gartner’s 2013 Hype Cycle Special Report (Jul 2013).
34. Notable Quotes
Simple models and a lot of data trump more elaborate
models based on less data
- Peter Norvig
- W.E. Deming
In God we trust, all others bring data.
- Harvard Prof. Gary King
Big data is not about the data! The value in big data
[is in] the analytics.
35. Conclusion
• Data is an asset, talent is
a more valuable asset.
• Big data represents a
disruptive shift.
• Data science is the magic
enabler via Data Products.
• Better + faster
explorations &
questions win.
Andrew B. Gardner, PhD
http://linkd.in/1byADxC
agardner@momentics.com
www.momentics.com
Notas do Editor
Herman HollerithObsolete1880 – 50,189,2091890 – 62,947,714
~ 15 mins via 10Gbps LAN to transfer 1TB~ 220 hrs for 1 PB => move the servers?
Harlan Harris
Data is the new currency of business.Understand customer use, behavior, and interests. Targeted products and marketing offers Understand customer experience across network, services, and social conversation.Network optimization Connect with OTT players, advertisers, and verticals. New business models