1. Big Data Concepts &
Practice
Vladimir Suvorov
vladimir.suvorov@emc.com
EMC &
DataScienceSquad.com
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 1
2. About myself
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 2
4. …by the end of 2011, this was about 30
In 2005 there were 1.3 billion RFID
billion and growing even faster
tags in circulation…
4 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 4
5. An increasingly sensor-enabled and instrumented
business environment generates HUGE volumes of
data with MACHINE SPEED characteristics…
1 BILLION lines of code
EACH engine generating 10 TB every 30 minutes!
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 5
6. 350B
Transactions/Year
Meter Reads
every 15 min.
120M – meter reads/month 3.65B – meter reads/day
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 6
7. In August of 2010, Adam
Savage, of “Myth Busters,”
took a photo of his vehicle
using his smartphone. He
then posted the photo to his
Twitter account including the
phrase “Off to work.”
Since the photo was taken by
his smartphone, the image
contained metadata revealing
the exact geographical
location the photo was taken
By simply taking and posting a
photo, Savage revealed the
exact location of his home,
the vehicle he drives, and the
time he leaves for work
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 7
8. The Social Layer in an Instrumented Interconnected World
4.6
30 billion billion
RFID tags today
camera
12+ TBs (1.3B in 2005)
phones
of tweet data world
every day wide
100s of
millions
of GPS
data every
of
enabled
? TBs
devices
day
sold
annually
25+ TBs of 2+
log data billion
every day people
on the
76 million smart Web by
meters in 2009… end
200M by 2014 2011
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 8
9. Twitter Tweets per Second Record Breakers of 2011
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 9
10. Extract Intent, Life Events, Micro Segmentation
Attributes
Pauline
Name, Birthday, Family
Tom Sit
Not Relevant - Noise
Tina Mu
Monetizable Intent
Jo Jobs
Not Relevant - Noise
Location Wishful Thinking
Relocation
Monetizable Intent
SPAMbots
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 10
11. Big Data Includes Any of the following Characteristics
Extracting insight from an immense volume, variety and velocity of data, in
context, beyond what was previously possible
Variety: Manage the complexity of
data in many different
structures, ranging from
relational, to logs,
to raw text
Velocity: Streaming data and large
volume data movement
Volume: Scale from Terabytes to
Petabytes (1K TBs) to
Zetabytes (1B TBs)
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 11
12. Bigger and Bigger Volumes of Data
• Retailers collect click-stream data from Web site interactions and loyalty card data
– This traditional POS information is used by retailer for shopping basket analysis,
inventory replenishment, +++
– But data is being provided to suppliers for customer buying analysis
• Healthcare has traditionally been dominated by paper-based systems, but this information is
getting digitized
• Science is increasingly dominated by big science initiatives
– Large-scale experiments generate over 15 PB of data a year and can’t be stored within
the data center; sent to laboratories
• Financial services are seeing large and large volumes through smaller trading sizes,
increased market volatility, and technological improvements in automated and algorithmic
trading
• Improved instrument and sensory technology
– Large Synoptic Survey Telescope’s GPixel camera generates 6PB+ of image data per
year or consider Oil and Gas industry
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 12
13. The Big Data Conundrum
• The percentage of available data an enterprise can analyze is decreasing
proportionately to the available to it
Quite simply, this means as enterprises, we are getting
“more naive” about our business over time
We don’t know what we could already know….
Data AVAILABLE to
an organization
Data an organization
can PROCESS
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 13
14. Why Not All of Big Data Before: Didn’t have the Tools?
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 14
15. Applications for Big Data Analytics
Smarter Healthcare Multi-channel Finance Log Analysis
sales
Homeland Security Traffic Control Telecom Search Quality
Manufacturing Trading Fraud and Retail: Churn,
Analytics Risk NBO
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 15
16. Most Requested Uses of Big Data
• Log Analytics & Storage
• Smart Grid / Smarter Utilities
• RFID Tracking & Analytics
• Fraud / Risk Management & Modeling
• 360° View of the Customer
• Warehouse Extension
• Email / Call Center Transcript Analysis
• Call Detail Record Analysis
16
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 16
17. What companies &
analytics think of Big
Data
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 17
18. Gartner & McKinsley
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 18
19. Hype Cycle of Big Data
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 19
20. Priority matrix
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 20
21. Key vision
• Predictive modeling is gaining momentum with property
and casualty (P&C) companies who are using them to
support claims analysis, CRM, risk management, pricing
and actuarial workflows, quoting, and underwriting.
• Social content is the fastest growing category of new
content in the enterprise and will eventually attain 20%
market penetration.
• Gartner reports that 45% as sales management teams
identify sales analytics as a priority to help them
understand sales performance, market conditions and
opportunities.
• Over 80% of Web Analytics solutions are delivered via
Software-as-a-Service (SaaS).
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 21
22. Big Data deliverables by McKinsley
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 22
23. Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 23
24. Intel
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 24
25. Intel Big Data Cluster Example
Application Big Data Algorithms Compute
Style
Scientific study Ground model Earthquake HPC
(e.g. earthquake simulation, thermal
study) conduction, …
Internet library Historic web Data mining MapReduce
search snapshots
Virtual world Virtual world Data mining TBD
analysis database
Language Text corpuses, Speech recognition, MapReduce &
translation audio archives,… machine translation, HPC
text-to-speech, …
Video search Video data Object/gesture MapReduce
identification, face
recognition, …
There has been more video uploaded to YouTube in the last 2 months than if ABC,
NBC, and CBS had been airing content 24/7/365 continuously since 1948. - Gartner
25
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 25
26. Example Motivating Application:
Online Processing of Archival Video
• Research project: Develop a context recognition system that is 90% accurate over
90% of your day
• Leverage a combination of low- and high-rate sensing for perception
• Federate many sensors for improved perception
• Big Data: Terabytes of archived video from many egocentric cameras
• Example query 1: “Where did I leave my briefcase?”
• Sequential search through all video streams [Parallel Camera]
• Example query 2: “Now that I’ve found my briefcase, track it”
• Cross-cutting search among related video streams [Parallel Time]
Big Data Cluster
26
26
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 26
27. Oracle
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 27
28. Big Data Use Cases
Today’s Challenge New Data What’s Possible
Healthcare Remote patient Preventive care,
Expensive office visits monitoring reduced hospitalization
Manufacturing Automated diagnosis,
Product sensors
In-person support support
Location-Based
Services Geo-advertising, traffic,
Real time location data
Based on home zip local search
code
Public Sector Tailored services,
Citizen surveys
Standardized services cost reductions
Retail
Sentiment analysis
One size fits all Social media
segmentation
marketing
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 28
29. What’s in Big Data for Public Sector
•Operational efficiency and productivity
•Fraud detection and prevention
•Close tax gaps
•Value for money for citizens
•Prevent crime waves
•Customize actions based on population
segments
•Public utilities to reduce consumption
•Produce safety from farm to fork
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 29
31. New opportunities
Measures and ranks online user
Increases ad revenue by processing 3.5 influence by processing 3 billion signals Improving investigation time by analyzing
billion events per day per day large volume & variety of data
Massive Volumes Cloud Connectivity Real-Time Insight
Processes 464 billion rows per quarter, Connects across 15 social networks via Cut investigation time from 2 years to
with average query time under 10 secs. the cloud for data and API access 15 days
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 31
32. Microsoft’s Approach to Big Data
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 32
33. A Holistic Big Data Solution from Microsoft
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 33
34. Data
Scientist
Job
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 34
35. Sexy Job of Data Scientist
Tom Davenport, who is teaching an executive
program in Big Data and analytics at Harvard
University, said some data scientists are
earning annual salaries as high as $300,000,
which is “pretty good for somebody that
doesn't have anyone else working for them.”
Davenport also said such workers are
motivated by the problems and opportunities
data provides.
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 35
36. What EMC Think of Data Scientists
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 36
37. Job evolution
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 37
38. What Forbes think of Data Scientists
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 38
39. Data
Science
Courses
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 39
40. Course Modules and Navigation Icons
Data Science and Big Data Analytics
1. Introduction to Big Data Analytics
2. Data Analytics Lifecycle + Lab
3. Review of Basic Data Analytics Methods Using R +
Labs
4. Advanced Analytics - Theory & Methods + Labs
5. Advanced Analytics - Technology & Tools + Labs
6. The Endgame, or Putting it All Together + Final Lab
40
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 40
41. Topics : DataofScience and Big Advanced Analytics
Introducti Review Basic Advanced Data The Endgame,
on to Big Data Analytic Analytics – Analytics - or Putting it All
Course Methods Using R Theory and Technology
Data Together
Analytics Methods and Tools +
+ Final Lab on Big
Data Data Analytics
Analytics
Lifecycle
Big Data Using R to Look at K-means Analytics for Operationalizing
Overview Data - Clustering Unstructured an Analytics
Introduction to R Data Project
State of Association (MapReduce
the Analyzing and Rules and Hadoop) Creating the
Practice in Exploring the Data Final
Analytics Linear The Hadoop Deliverables
Statistics for Regression Ecosystem
The Data Model Building Data
Scientist and Evaluation Logistic In-database Visualization
Regression Analytics – Techniques
Big Data SQL Essentials
Analytics Naive + Final Lab –
in Bayesian Advanced SQL Application of
Industry Classifier and MADlib for the Data
Verticals In-database Analytics
Decision Trees Analytics Lifecycle to a
Data Big Data
Analytics Time Series Analytics
Lifecycle Analysis Challenge
Text Analysis
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 41
41
42. Hadoop
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 42
43. Top companies need Hadoop
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 43
44. What is Hadoop and Where did it start?
• Created by Doug Cutting, formerly of Yahoo!
Now Cloudera
– HDFS (storage) & MapReduce (compute)
– Inspired by Google’s MapReduce and Google
File System (GFS) papers
• Much of the initial work on Hadoop was done
by Yahoo
• It is now a top-level Apache project backed by
large open source development community
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 44
45. What is Hadoop?
Two Core Components
HDFS MapReduce
Storage in the Compute via the
Hadoop Distributed MapReduce distributed
File System processing platform
• Storage & Compute in 1 Framework
• Open Source Project of the Apache Software Foundation
• Written in Java
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 45
46. Hadoop cluster architecture
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 46
47. MapReduce example
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 47
48. Hadoop versions
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 48
49. Hadoop Wave Report
“EMC Greenplum is the first mover in Hadoop
appliances. EMC Greenplum the first EDW vendor to
provide a full-featured enterprise-grade Hadoop
appliance and roll out an appliance family that integrates
its Hadoop, EDW, and data integration in a single rack. It
provides its own open source Hadoop distribution
software, integrates EMC’s strong storage product
portfolio in its appliances, and has an extensive
professional services force of EMC technical consultants
and data scientists with Hadoop expertise.”
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 49
50. Hadoop Players Today
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 50
51. Get Started With Hadoop Today
Data Scientists & Hadoop Architecture teams deliver customer success
Hadoop Architecture Services
– POC planning and deployment
– Installation and best practices
– Educate the team
Greenplum Analytics Labs
– Leverage the expertise of Greenplum’s
Data Scientists
– Packaged solutions that produce business
value and actionable results
– Accelerate Hadoop capabilities on your
data with your analysts
Establish a strategic vision
– Roadmap for Hadoop and unified analytics
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 51
52. The Greenplum Unified Analytics Platform
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 52
53. NoSQL
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 53
54. Definition
from nosql-databases.org
• Next Generation Databases mostly addressing
some of the points: being non-relational,
distributed, open-source and horizontal
scalable. The original intention has been modern
web-scale databases. The movement began
early 2009 and is growing rapidly. Often more
characteristics apply as: schema-free, easy
replication support, simple API, eventually
consistent /BASE (not ACID), a huge data
amount, and more. So the misleading term "nosql"
(the community now translates it mostly with "not
only sql") should be seen as an alias to
something like the definition above.
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 54
55. NoSQL
http://nosql-database.org/
• Non relational
• Scalability
– Vertically
• Add more data
– Horizontally
• Add more storage
• Collection of structures
– Hashtables, maps, dictionaries
• No pre-defined schema
• No join operations
• CAP not ACID
– Consistency, Availability and Partitioning (but not all three at
once!)
– Atomicity, Consistency, Isolation and Durability
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 55
56. Advantages of NoSQL
• Cheap, easy to implement
• Data are replicated and can be partitioned
• Easy to distribute
• Don't require a schema
• Can scale up and down
• Quickly process large amounts of data
• Relax the data consistency requirement (CAP)
• Can handle web-scale data, whereas Relational
DBs cannot
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 56
57. Disadvantages of NoSQL
• New and sometimes buggy
• Data is generally duplicated, potential for
inconsistency
• No standardized schema
• No standard format for queries
• No standard language
• Difficult to impose complicated structures
• Depend on the application layer to enforce data
integrity
• No guarantee of support
• Too many options, which one, or ones to pick
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 57
58. NoSQL Options
Key-Value Stores
• This technology you know and love and use all the
time
– Hashmap for example
• Put(key,value)
• value = Get(key)
• Examples
– Redis (my favorite!!) – in memory store
– Memcached
– and 100s more
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 58
59. Column Stores
• Not to be confused with the relational-db version
of this
– Sybase-IQ etc.
• Multi-dimensional map
• Not all entries are relevant each time
– Column families
• Examples
– Cassandra
– Hbase
– Amazon SimpleDB
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 59
60. Document Stores
• Key-document stores
– However the document can be seen as a value so
you can consider this is a super-set of key-value
• Big difference is that in document stores one can
query also on the document, i.e. the document
portion is structured (not just a blob of data)
• Examples
– MongoDB
– CouchDB
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 60
61. Graph Stores
• Use a graph structure
– Labeled, directed, attributed multi-graph
• Label for each edge
• Directed edges
• Multiple attributes per node
• Multiple edges between nodes
– Relational DBs can model graphs, but an edge
requires a join which is expensive
• Example Neo4j
– http://www.infoq.com/articles/graph-nosql-neo4j
Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 61
62. Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 62