Big data Intro - Presentation to OCHackerz Meetup Group

Introduction to Big Data
Sri Kanajan

Big Data
• When data is too VVV (volume, variety, velocity) to manage with traditional
RDBMS, then you enter BIG DATA!
• Data Storage and Manipulation, at Scale
– MapReduce, Hadoop, relationship to databases (Framework)
– Key-value stores and NoSQL; tradeoffs of SQL and NoSQL (Database type)
– Entity resolution, record linkage, data cleaning (data integration)
• Analytics (Machine Learning)
– Basic statistical modeling, experiment design, overfitting
– Supervised learning: overview, simple nearest neighbor, decision trees/forests, regression
– Unsupervised learning: k-means, multi-dimensional scaling
– Graph Analytics: PageRank, community detection, recursive queries, iterative processing
– Text Analytics: latent semantic analysis
– Collaborative Filtering: slope-one
• Communicating Results
– Visualization, data products, visual data analytics

Outline
• What is Big Data?
• Why is this important now?
• Key Concepts
– Hadoop ,MapReduce – Storage, Processing
– Machine Learning – Analytics
– Visualization

Big Data Everywhere!
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
– Social Network
Unknown Hidden Relationships within this Data !!!

How much data?
• Google processes 20 PB a day (2008)
• Wayback Machine has 3 PB + 100 TB/month (3/2009)
• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• CERN’s Large Hydron Collider (LHC) generates 15 PB a
year
640K ought to be
enough for anybody.

Type of Data
• Relational Data (Tables/Transaction/Legacy Data)
• Unstructured Text Data
– Log data, Comments, User generated text
• Semi-structured Data (XML)
• Graph Data
– Social Network, Semantic Web (RDF)
• Real time Data
– You can only scan the data once and need to do
analytics quickly

What does Big Data Give You?
• Without Big Data
– Many data warehouses that were separate and on non distributed
architectures
– Had to modify data structures and unique programming to merge databases
together
– Scaling database size is a continual problem
– Any large scale analytics took days and weeks and large coordination effort
within IT to get database accesses
– Data analysis is a large effort and lots of data tend to remain unanalyzed or
even worse not stored
• With Big Data
– Hadoop provides a single view of all databases that can be distributed
– Database size is a non issue
– Ability to perform advanced statistical analysis on very large datasets very
quickly
– Data analysis is the competitive edge for many companies since barriers of
entry are continually dropping through the development of platforms

Examples
• Norwegian Food Safety Authority
– accumulates data on all farm animals
– birth, death, movements, medication, samples, ...
• Hafslund
– time series from hydroelectric dams, power prices, meters of individual
customers, ...
• Social Security Administration
– data on individual cases, actions taken, outcomes...
• Statoil
– massive amounts of data from oil exploration, operations, logistics,
engineering, ...
• Retailers
– see Target example above
– also, connection between what people buy, weather forecast, logistics, ...

Power of Distribution
45 Minutes! 4.5 Minutes!

Hadoop
• A framework that allows for distributed
processing of large data sets across clusters of
commodity computers using a simple
programming model (I.e. MapReduce)
– Distributed data processing
– Works with structured and unstructured data
– Open source
– Master-slave architecture
– Fault tolerant using commodity hardware

MapReduce
• Programming model on top of Hadoop
• Basic concept is to provide a programming model that
immediately supports parallel processing (SQL on the
other hand does not natively encourage parallel
processing)
• Pig is a framework and programming language to
develop MapReduce
• Note – MapReduce is great for extremely large data
sets with simple relations. SQL is great for medium size
data sets but with complex relationships
– I.e. you have to decide the right technology depending on
your problem space

A Simple Example
• Counting words in a large set of documents
map(string value)
//key: document name
//value: document contents
for each word w in value
EmitIntermediate(w, “1”);
reduce(string key, iterator values)
//key: word
//values: list of counts
int results = 0;
for each v in values
result += ParseInt(v);
Emit(AsString(result));

Outline
• What is Big Data?
• Why is this important now?
• Key Concepts
– Hadoop, MapReduce – Storage architecture
– Machine Learning – Analytics
– Visualization

Machine Learning
• Essentially ways to analyze data to extract
valuable information with or without training
data
– Prediction
• predicting a variable from data
– Classification
• assigning records to predefined groups
– Clustering
• splitting records into groups based on similarity
– Association learning
• seeing what often appears together with what
– And many others….

Now you have an optimization
metric by which you can automate
the exploration of all possible
hypotheses !
Problems with this approach??

Two kinds of learning
21
• Supervised
– we have training data with correct answers
– use training data to prepare the algorithm
– then apply it to data without a correct
answer
• Unsupervised
– no training data
– throw data into the algorithm, hope it
makes some kind of sense out of the data

Example: Collaborative Filtering
• Goal: predict what movies/books/… a person may be interested in,
on the basis of
– Past preferences of the person
– Other people with similar past preferences
– The preferences of such people for a new movie/book/…
• One approach based on repeated clustering
– Cluster people on the basis of preferences for movies
– Then cluster movies on the basis of being liked by the same clusters of
people
– Again cluster people based on their preferences for (the newly created
clusters of) movies
– Repeat above till equilibrium
• Above problem is an instance of collaborative filtering, where users
collaborate in the task of filtering information to find information of
interest
22

Is this an effective visual
representation?

Diagrams Showing O-Ring Damage
that was Used to Decide to Launch
Challenger in 1987

Representation of the Same Data

Strategies to Increase the Information
Encoded by Spatial Position
• Composition
– Orthogonal placement of axes
– Creates a 2D metric space

Strategies to Increase the Information
Encoded by Spatial Position
• Alignment

Folding
• Continuation of the Axes

Conclusion
• Big Data is a huge field that combines
expertise from different domains in order to
find interesting information from data
• Extracting interesting information from data is
the next competitive edge for many
companies as information becomes available,
instantly anywhere

Big data Intro - Presentation to OCHackerz Meetup Group

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Big data Intro - Presentation to OCHackerz Meetup Group

Semelhante a Big data Intro - Presentation to OCHackerz Meetup Group (20)

Último

Último (20)

Big data Intro - Presentation to OCHackerz Meetup Group