2. Big Data
• When data is too VVV (volume, variety, velocity) to manage with traditional
RDBMS, then you enter BIG DATA!
• Data Storage and Manipulation, at Scale
– MapReduce, Hadoop, relationship to databases (Framework)
– Key-value stores and NoSQL; tradeoffs of SQL and NoSQL (Database type)
– Entity resolution, record linkage, data cleaning (data integration)
• Analytics (Machine Learning)
– Basic statistical modeling, experiment design, overfitting
– Supervised learning: overview, simple nearest neighbor, decision trees/forests, regression
– Unsupervised learning: k-means, multi-dimensional scaling
– Graph Analytics: PageRank, community detection, recursive queries, iterative processing
– Text Analytics: latent semantic analysis
– Collaborative Filtering: slope-one
• Communicating Results
– Visualization, data products, visual data analytics
3. Outline
• What is Big Data?
• Why is this important now?
• Key Concepts
– Hadoop ,MapReduce – Storage, Processing
– Machine Learning – Analytics
– Visualization
4. Big Data Everywhere!
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
– Social Network
Unknown Hidden Relationships within this Data !!!
5.
6. How much data?
• Google processes 20 PB a day (2008)
• Wayback Machine has 3 PB + 100 TB/month (3/2009)
• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• CERN’s Large Hydron Collider (LHC) generates 15 PB a
year
640K ought to be
enough for anybody.
7. Type of Data
• Relational Data (Tables/Transaction/Legacy Data)
• Unstructured Text Data
– Log data, Comments, User generated text
• Semi-structured Data (XML)
• Graph Data
– Social Network, Semantic Web (RDF)
• Real time Data
– You can only scan the data once and need to do
analytics quickly
8. What does Big Data Give You?
• Without Big Data
– Many data warehouses that were separate and on non distributed
architectures
– Had to modify data structures and unique programming to merge databases
together
– Scaling database size is a continual problem
– Any large scale analytics took days and weeks and large coordination effort
within IT to get database accesses
– Data analysis is a large effort and lots of data tend to remain unanalyzed or
even worse not stored
• With Big Data
– Hadoop provides a single view of all databases that can be distributed
– Database size is a non issue
– Ability to perform advanced statistical analysis on very large datasets very
quickly
– Data analysis is the competitive edge for many companies since barriers of
entry are continually dropping through the development of platforms
9. Examples
• Norwegian Food Safety Authority
– accumulates data on all farm animals
– birth, death, movements, medication, samples, ...
• Hafslund
– time series from hydroelectric dams, power prices, meters of individual
customers, ...
• Social Security Administration
– data on individual cases, actions taken, outcomes...
• Statoil
– massive amounts of data from oil exploration, operations, logistics,
engineering, ...
• Retailers
– see Target example above
– also, connection between what people buy, weather forecast, logistics, ...
12. Outline
• What is Big Data?
• Why is this important now?
• Key Concepts
– Hadoop ,MapReduce – Storage, Processing
– Machine Learning – Analytics
– Visualization
13. Hadoop
• A framework that allows for distributed
processing of large data sets across clusters of
commodity computers using a simple
programming model (I.e. MapReduce)
– Distributed data processing
– Works with structured and unstructured data
– Open source
– Master-slave architecture
– Fault tolerant using commodity hardware
14. MapReduce
• Programming model on top of Hadoop
• Basic concept is to provide a programming model that
immediately supports parallel processing (SQL on the
other hand does not natively encourage parallel
processing)
• Pig is a framework and programming language to
develop MapReduce
• Note – MapReduce is great for extremely large data
sets with simple relations. SQL is great for medium size
data sets but with complex relationships
– I.e. you have to decide the right technology depending on
your problem space
15. A Simple Example
• Counting words in a large set of documents
map(string value)
//key: document name
//value: document contents
for each word w in value
EmitIntermediate(w, “1”);
reduce(string key, iterator values)
//key: word
//values: list of counts
int results = 0;
for each v in values
result += ParseInt(v);
Emit(AsString(result));
17. Outline
• What is Big Data?
• Why is this important now?
• Key Concepts
– Hadoop, MapReduce – Storage architecture
– Machine Learning – Analytics
– Visualization
18. Machine Learning
• Essentially ways to analyze data to extract
valuable information with or without training
data
– Prediction
• predicting a variable from data
– Classification
• assigning records to predefined groups
– Clustering
• splitting records into groups based on similarity
– Association learning
• seeing what often appears together with what
– And many others….
19.
20. Now you have an optimization
metric by which you can automate
the exploration of all possible
hypotheses !
Problems with this approach??
21. Two kinds of learning
21
• Supervised
– we have training data with correct answers
– use training data to prepare the algorithm
– then apply it to data without a correct
answer
• Unsupervised
– no training data
– throw data into the algorithm, hope it
makes some kind of sense out of the data
22. Example: Collaborative Filtering
• Goal: predict what movies/books/… a person may be interested in,
on the basis of
– Past preferences of the person
– Other people with similar past preferences
– The preferences of such people for a new movie/book/…
• One approach based on repeated clustering
– Cluster people on the basis of preferences for movies
– Then cluster movies on the basis of being liked by the same clusters of
people
– Again cluster people based on their preferences for (the newly created
clusters of) movies
– Repeat above till equilibrium
• Above problem is an instance of collaborative filtering, where users
collaborate in the task of filtering information to find information of
interest
22
23. Outline
• What is Big Data?
• Why is this important now?
• Key Concepts
– Hadoop, MapReduce – Storage architecture
– Machine Learning – Analytics
– Visualization
33. Conclusion
• Big Data is a huge field that combines
expertise from different domains in order to
find interesting information from data
• Extracting interesting information from data is
the next competitive edge for many
companies as information becomes available,
instantly anywhere