Big Data, Big Deal: For Future Big Data Scientists
Big Data, Big Deal
For Future Big Data Scientists
Prepared By: Wei-Yen Lin
May, 2013
Outline
A Buzzword: Big Data
What Is Big Data: Big Data 101
What Make It Happen: Drivers For Big Data
How To Deal With: Existing Big Data Technologies
How To Improve: Challenges For Big Data
How To Be A Good Big Data Scientist
Big Data, Big Deal 2013 Page 2
Trying To Answer ....
Origin Of The Term
First ACM article to use the term
(Michael Cox and David Ellsworth, Ames Research Center, 1997)
“…data sets are generally quite large, taxing the capacities of main
memory, local disk, and even remote disk. We call this the problem of big
data.”
First definition
(Francis Diebold, University of Pennsylvania, 2000)
“Big Data refers to the explosion in the quantity (and sometimes, quality)
of available and potentially relevant data, largely the result of recent and
unprecedented advancements in data recording and storage technology.”
Big Data, Big Deal 2013 Page 9
Big Data, Big Deal 2013 Page 10
Commonly accepted 3 V‟s of Big Data
Doug Laney with the Meta Group, 2001
Volume, Velocity, Variety: Examples
Volume – Terabyte records, transactions, tables, files
– a Jumbo jet create 640TB on one Atlantic crossing X 25,000 flights flown
each day
Velocity – batch, near-time, real-time, streams.
– Today’s on-line ad serving requires 40ms to respond with a decision.
– Financial services need near 1MS to calculate customer scoring
probabilities
Variety – structures, unstructured, semi-structured, and all the
above in a mix.
–WalMart processes 1M customer transactions per hour and feeds
information to a database estimated at 2.5PB (petabytes).
–There are old and new data sources like RFID, sensors, mobile payments,
in-vehicle tracking, etc.
Big Data, Big Deal 2013 Page 11
Three Top-Level Elements
Data storage infrastructure, and resources to manipulate it
Big Data, Big Deal 2013 Page 12
Data Management
Data Analysis
Technologies and tools to analyze the data and glean insight from it
Data Use
Putting Big Data insights to work in Business Intelligence and end-user
applications
Source: Martin Hall, 2011
To Sum Up, Big Data Is …
Big Data, Big Deal 2013 Page 13
Big Data is high-volume, high-velocity, and/or high-
variety information assets that require new forms of
processing to enable enhanced decision making,
insight discovery and process optimization.
Characteristic
Goal
Solution
Key Drivers of Big Data Technology Demand
Scientific experiments and tools are becoming heavily based on data
processing
Big Data, Big Deal 2013 Page 15
Modern Science in search for new knowledge
Google and Facebook: have driven many advances in Big Data efficiency
Technical Drivers (1)
Google handles number of search queries at 3 billion per day
Twitter handles some 400 million tweets per day count for 12 terabytes
per day
The McKinsey Quarterly:The demand for storage has grown more than
50% annually in recent years
Big Data, Big Deal 2013 Page 16
Data collected and stored continues to grow exponentially
Data is increasingly everywhere and in many formats
Key Drivers of Big Data Technology Demand
Technical Drivers (2)
Genomic research, drugs development, Healthcare
High-tech industry, CAD/CAM, weather/climate, etc.
Big Data, Big Deal 2013 Page 17
Traditional data intensive industry
Business (retail) uses Big Data technologies “to search” for customers
Delivering directly to customers requires prediction of customer behavior
Key Drivers of Big Data Technology Demand
Business Drivers (1)
Captures preferences by the user and makes recommendations based
on previous record
Big Data, Big Deal 2013 Page 18
Consumer products and services delivery
The rise of public opinion stored in platforms
Key Drivers of Big Data Technology Demand
Business Drivers (2)
Social media
Managing public campaigns , e.g. election, integrated public relations
Big Data, Big Deal 2013 Page 20
Big Data Techniques
Few Examples
Supervised Learning – Support Vector Machine
Unsupervised learning – Cluster Analysis
Data fusion – Signal processing, Natural Language Processing
Optimization – Genetic Algorithm, Neural Networks
Predictive Modeling – Regression, Time Series Analysis
Big Data, Big Deal 2013 Page 21
Big Data Technologies
Where processing is hosted?
— Distributed Servers/Cloud (e.g. Amazon EC2)
Where data is stored?
— Distributed Storage (e.g. HadoopDFS)
What is programming model?
— Distributed Processing (e.g. MapReduce)
How data is stored& indexed?
— High-performance schema-free database (e.g. Cassandra)
What operations are performed?
— Data Analytics, Semantic Processing (e.g. R)
Big Data, Big Deal 2013 Page 22
Big Data Landscape
Source: Forbes, 2012
From Data Mining To Big Data Mining
Big Data, Big Deal 2013 Page 24
Source: Robert J. Abate, 2012
The Life Cycle Of Big Data Method Should Be ...
Big Data, Big Deal 2013 Page 25
Source: Robert J. Abate, 2012
Challenge For Big Data
How to find high-quality data from the vast collections of data? How good
is the data? How broad is the coverage?
Big Data, Big Deal 2013 Page 26
Data quality
Data comprehensiveness
Data
Are there areas without coverage? What are the implications?
Data Reliability and Validity
How to determine the quality of data sets and relevance to particular
issues
Challenge For Big Data
To handle/discover new data structures and multi-type data relations
To respond to specific use cases and operations over data
Big Data, Big Deal 2013 Page 27
Data mining/data intelligence algorithms
Processing
Data interpretation
Understand the output and model it through some form of simulation.
Domain experts must continue to play a role. Must be wary of becoming
too beholden to the numbers.
Challenge For Big Data
Is Cloud Computing a right technology? Any alternative?
Highspeed network infrastructure, on-demand provisioning
To respond to specific use cases and operations over data
Big Data, Big Deal 2013 Page 28
Infrastructure support for storing, moving data, on-demand processing
Management
Security, trustworthiness and data centric security
Much of this information is about people. How to extract enough
information to help people without extracting so much as to compromise
their privacy?
Big Data Talent
Big Data, Big Deal 2013 Page 30
Three Groups: Deep Analytical, Big Data Savvy, Supporting Tech.
Source: U.S. Bureau Of Labor Statistics, McKinsey
Technical expertise
have deep expertise in some scientific discipline.
Curiosity
a desire to go beneath the surface
Storytelling
the ability to use data to tell a story and
to be able to communicate it effectively.
Cleverness
the ability to look at a problem in different,
creative ways.
Qualities Of Data Scientists
Big Data, Big Deal 2013 Page 31
Advice From DJ Patil, The World's 7 Most Powerful Data Scientists(Forbes)