SlideShare uma empresa Scribd logo
1 de 92
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Big Data Analytics, Data Science & Fast Data
1
Kunal Joshi
joshik@vmware.com
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Agenda
1. Introduction to Big Data Analytics
2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
BIG DATA
DATA SCIENCE
FAST DATA
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Agenda
1. Introduction to Big Data Analytics
2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Big Data Pioneers
1,000,000,000 Queries A Day
250,000,000 New Photo‟s / Day
290,000,000 Updates / Day
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Other Companies using Big Data
4,000,000 Claims / Day
2,800,000,000 Trades / Day
31,000,000,000 Interactions / Day
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Moore’s Law
Gordon Moore (Founder of Intel)
Number of
transistors that
can be placed in a
processor
DOUBLES in
approximately
every TWO years.
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Introduction to Big Data Analytics
What is Big Data?
What makes data, “Big” Data?
7
Your Thoughts?
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
8Copyright © 2011 EMC Corporation. All Rights Reserved.
• “Big Data” is data whose scale, distribution, diversity,
and/or timeliness require the use of new technical
architectures and analytics to enable insights that unlock
new sources of business value.
 Requires new data architectures, analytic sandboxes
 New tools
 New analytical methods
 Integrating multiple skills into new role of data scientist
• Organizations are deriving business benefit from analyzing
ever larger and more complex data sets that increasingly
require real-time or near-real time capabilities
Big Data Defined
Source: McKinsey May 2011 article Big Data: The next frontier for innovation, competition, and productivity
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
9Copyright © 2011 EMC Corporation. All Rights Reserved.
1. Data Volume
 44x increase from 2010 to 2020
(1.2zettabytes to 35.2zb)
2. Processing Complexity
 Changing data structures
 Use cases warranting additional transformations and
analytical techniques
3. Data Structure
 Greater variety of data structures to mine and analyze
Key Characteristics of Big Data
Module 1: Introduction to BDA
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Big Data Characteristics: Data Structures
Data Growth is Increasingly Unstructured
Module 1: Introduction to BDA 10
Structure
d
Semi-
Structured
“Quasi”
Structured
Unstructured
• Data containing a defined data type, format, structure
• Example: Transaction data and OLAP
• Data that has no inherent
structure and is usually stored
as different types of files.
• Example: Text
documents, PDFs, images and
video
• Textual data with erratic data formats, can
be formatted with effort, tools, and time
• Example: Web clickstream data that
may contain some inconsistencies in data
values and formats
• Textual data files with a discernable
pattern, enabling parsing
• Example: XML data files that are self
describing and defined by an xml schema
MoreStructured
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Four Main Types of Data Structures
Module 1: Introduction to BDA 11
http://www.google.com/#hl=en&sugexp=kjrmc&cp=8&gs_id=2m&xhr=t&q=data+scientist&
pq=big+data&pf=p&sclient=psyb&source=hp&pbx=1&oq=data+sci&aq=0&aqi=g4&aql=f&gs
_sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.,cf.osb&fp=d566e0fbd09c8604&biw=1382&bih=651
The Red Wheelbarrow, by
William Carlos Williams
View  Source
Structured Data
Semi-Structured Data
Quasi-Structured Data
Unstructured Data
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Driver Examples
Desire to optimize business
operations
Sales, pricing, profitability, efficiency
Desire to identify business risk Customer churn, fraud, default
Predict new business
opportunities
Upsell, cross-sell, best new customer
prospects
Comply with laws or regulatory
requirements
Anti-Money Laundering, Fair Lending,
Basel II
Business Drivers for Big Data Analytics
1
2
3
4
Current Business Problems Provide Opportunities for Organizations to
Become More Analytical & Data Driven
Module 1: Introduction to BDA 12
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Challenges with a Traditional Data Warehouse
Departmental
Warehouse
Enterprise
Applications
Reporting
Non-Prioritized Data Provisioning
Non-Agile Models
“Spread
Marts”
Data
Sources
Siloed
Analytics
Static schemas
accrete over time
Prioritized
Operational
Processes
Errant data & marts
Departmental
Warehouse
1
2
3
13
4
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Implications of a Traditional Data Warehouse
14
• High-value data is hard to reach and leverage
• Predictive analytics & data mining activities are last
in line for data
 Queued after prioritized operational processes
• Data is moving in batches from EDW to local
analytical tools
 In-memory analytics (such as R, SAS, SPSS, Excel)
 Sampling can skew model accuracy
• Isolated, ad hoc analytic projects, rather than
centrally-managed harnessing of analytics
 Non-standardized initiatives
 Frequently, not aligned with corporate business goals
Slow
“time-to-insight”
&
reduced
business impact
Module 1: Introduction to BDA
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Opportunities for a New Approach to Analytics
New Applications Driving Data Volume
Module 1: Introduction to BDA 15
2000‟s
(CONTENT & DIGITAL ASSET
MANAGEMENT)
1990‟s
(RDBMS & DATA
WAREHOUSE)
2010‟s
(NO-SQL & KEY/VALUE)
VOLUMEOFINFORMATION
LARGE
SMALL
MEASURED IN
TERABYTES
1TB = 1,000GB
MEASURED IN
PETABYTES
1PB = 1,000TB
WILL BE MEASURED IN
EXABYTES
1EB = 1,000PB
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Considerations for Big Data Analytics
1. Speed of decision making
2. Throughput
3. Analysis flexibility
Analytic Sandbox
Data assets gathered from multiple sources
and technologies for analysis
• Enables high performance analytics
using in-db processing
• Reduces costs associated with data
replication into "shadow" file
systems
• “Analyst-owned” rather than “DBA
owned”
Criteria for Big Data Projects New Analytic Architecture
1. Speed of decision making
2. Throughput
3. Analysis flexibility
16
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
State of the Practice in Analytics: Mini-Case Study
Big Data Enabled Loan Processing at XYZ bankUnderwritingRisk
Traditional
Underwriting
Risk Level
TRADITIONAL DATA LEVERAGED BIG DATA LEVERAGED
Big Data Enabled
Underwriting
Risk Level
17Module 1: Introduction to BDA
Your Thoughts?
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Agenda
1. Introduction to Big Data Analytics
2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Big Data Analytics: Industry Examples
Module 1: Introduction to BDA 19
Health Care
•Reducing Cost of Care
Public Services
•Preventing Pandemics
Life Sciences
•Genomic Mapping
IT Infrastructure
•Unstructured Data Analysis
Online Services
•Social Media for Professionals
RetailPhone/TV
Government Internet
Medical
Financial
Data
Collectors
1
2
3
4
5
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Big Data Analytics: Healthcare
Use of Big Data
Key
Outcomes
Situation
• Poor police response and problems with medical care, triggered
by shooting of a Rutgers student
• The event drove local doctor to map crime data and examine
local health care
• Dr. Jeffrey Brenner generated his own crime maps from medical
billing records of 3 hospitals
• City hospitals & ER‟s provided expensive care, low quality care
• Reduced hospital costs by 56% by realizing that 80% of city‟s
medical costs came from 13% of its residents, mainly low-
income or elderly
• Now offers preventative care over the phone or through home
visits
1
20Module 1: Introduction to BDA
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Big Data Analytics: Public Services
Use of Big Data
Key
Outcomes
Situation
• Threat of global pandemics has increased exponentially
• Pandemics spreads at faster rates, more resistant to antibiotics
• Created a network of viral listening posts
• Combines data from viral discovery in the field, research in
disease hotspots, and social media trends
• Using Big Data to make accurate predications on spread of new
pandemics
• Identified a fifth form of human malaria, including its origin
• Identified why efforts failed to control swine flu
• Proposing more proactive approaches to preventing outbreaks
2
Module 1: Introduction to BDA 21
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Big Data Analytics: Life Sciences
Use of Big Data
Key
Outcomes
Situation • Broad Institute (MIT & Harvard) mapping the Human Genome
• In 13 yrs, mapped 3 billion genetic base pairs; 8 petabytes
• Developed 30+ software packages, now shared publicly, along
with the genomic data
• Using genetic mappings to identify cellular mutations causing
cancer and other serious diseases
• Innovating how genomic research informs new pharmaceutical
drugs
3
Module 1: Introduction to BDA 22
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Big Data Analytics: IT Infrastructure
Use of Big Data
Key
Outcomes
Situation • Explosion of unstructured data required new technology to
analyze quickly, and efficiently
• Doug Cutting created Hadoop to divide large processing tasks
into smaller tasks across many computers
• Analyzes social media data generated by hundreds of
thousands of users
• New York Times used Hadoop to transform its entire public
archive, from 1851 to 1922, into 11 million PDF files in 24 hrs
• Applications range from social media, sentiment analysis,
wartime chatter, natural language processing
4
Module 1: Introduction to BDA 23
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Big Data Analytics: Online Services
Use of Big Data
Key
Outcomes
Situation • Opportunity to create social media space for professionals
• Collects and analyzes data from over 100 million users
• Adding 1 million new users per week
• LinkedIn Skills, InMaps, Job Recommendations, Recruiting
• Established a diverse data scientist group, as founder believes
this is the start of Big Data revolution
5
Module 1: Introduction to BDA 24
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Agenda
1. Introduction to Big Data Analytics
2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Greenplum Unified Analytic Platform
Partner Tools & Services
GREENPLUM CHORUS – Analytic Productivity Layer
Greenplum gNet
GREENPLUM
DATABASE
Data
Scientist
Data
Engineer
Data Analyst Bl
Analyst
LOB
User
Data
Platform
Admin
DATASCIENCETEAM
Cloud, x86 Infrastructure, or Appliance
GREENPLUM
HD
Unify your team
Drive Collaboration
Keep Your Options Open
The Power of Data
Co-Processing
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Greenplum Hadoop
STRUCTURED UNSTRUCTURED
Hive
MapReduce
Pig
XML, JSON, … Flat files
Schema on load
Directories
No ETL
Java
SequenceFile
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Greenplum Database
STRUCTURED UNSTRUCTURED
SQL
RDBMS
Tables and Schemas
Greenplum
MapReduce
Indexing
Partitioning
BI Tools
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
• A framework for handling big data
 An implementation of the MapReduce paradigm
 Hadoop glues the storage and analytics together and provides reliability,
scalability, and management
What do we Mean by Hadoop
Storage (Big Data)
 HDFS – Hadoop Distributed
File System
 Reliable, redundant,
distributed file system
optimized for large files
MapReduce (Analytics)
 Programming model for
processing sets of data
 Mapping inputs to outputs and
reducing the output of multiple
Mappers to one (or a few)
answer(s)
Two Main Components
30Module 5: Advanced Analytics - Technology and Tools
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Hadoop Distributed File System
31Module 5: Advanced Analytics - Technology and Tools
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
MapReduce and HDFS
Task Tracker
Task Tracker Task Tracker
Job Tracker
Hadoop Distributed File System (HDFS)
Client/Dev
Large Data Set
(Log files, Sensor Data)
Map Job
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job
2
1
3
4
32Module 5: Advanced Analytics - Technology and Tools
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
• As you move from Pig to Hive to
HBase, you are increasingly
moving away from the mechanics
of Hadoop and get an RDBMS
view of the Big Data world
Components of Hadoop
HBase
Queries
against defined
tables
Hive SQL-based
language
Pig
Data flow
language &
Execution
environment
More Hadoop
Visible
Less Hadoop
Visible
DBMS View
Mechanics of
Hadoop
33Module 5: Advanced Analytics - Technology and Tools
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Greenplum Database
Extreme Performance for Analytics
• Optimized for BI and analytics
 Deep integration with statistical packages
 High performance parallel implementations
• Simple and automatic parallelization
 Just load and query like any database
 Tables are automatically distributed
across nodes
 No need for manual partitioning or tuning
• Extremely scalable
 MPP* shared-nothing architecture
 All nodes can scan and process in parallel
 Linear scalability by adding nodes where each
node adds storage, query & load performance
*MPP – Massive Parallel Processing
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Greenplum DB & HD
Massively Parallel Access and Movement
Maximize Solution
Flexibility
Minimize Data
Duplication
Access Hadoop
Data in Real Time
From Greenplum
DB
Import and export
in Text, Binary
and Compressed
Formats
Custom formats via user-written MapReduce Java
program And GPDB Format classes
gNet
10Gb Ethernet
Greenplum DB Hadoop
Node 1
Node 2
Node 3
Segment 1
Segment 2
Segment 3
GP DB
Master
Host
Map
Reduce
User-
Defined
Binary
TextExternal
Tables
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Analytical Software
Exploiting Parallelism
In-Database Analytics
Analytic
Results
Interconnect
Storage
Independent Segment
Processors
Independent Memory
Independent
Direct Storage
Connection
Master Segment Processor
Interconnect
Switch
Math & Statistical
Functions
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Agenda
1. Introduction to Big Data Analytics
2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Big Data Requires Data Science
Data Science
• Predictive analysis
• What if…..?
Business
Intelligence
• Standard reporting
• What happened?
High
FuturePast TIME
BUSINESS
VALUE
Business
Intelligence
Data
Science
Low
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Data science and business intelligence
“BIG DATA ANALYTICS”
“TRADITIONAL BI”
GBs to 10s of TBs
Operational
Structured
Repetitive
10s of TB to Pb‟s
External + Operational
Mostly Semi-Structured
Experimental, Ad Hoc
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Profile of a Data Scientist
Module 1: Introduction to BDA 46
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
People and Ecosystem
Domain
Data Science as a Process
Data Prep
Variable
Selection
Model
Building
Model
Execution
Communication
& Operationalization
Evaluate
• People
• Scientists / Analysts
• Business Analysts
• Consumers of analysis
• Stakeholders
• EMC sales and services
• Ecosystem
• Sector (Telecom, banking, security agency etc.)
• Modeling software and other tools used by analysts
(MADlib, SAS, R etc.)
• Database (Greenplum) & Data Sources
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
People and Ecosystem
Domain
Data Science as a Process
Data Prep
Variable
Selection
Model
Building
Model
Execution
Communication
& Operationalization
Evaluate
Discovery & prioritized identification of
opportunities
• Customer Retention
• Fraud detection
• Pricing
• Marketing effectiveness and optimization
• Product Recommendation
• Others……
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
People and Ecosystem
Domain
Data Science as a Process
Data Prep
Variable
Selection
Model
Building
Model
Execution
Communication
& Operationalization
Evaluate
• What are the data sources?
• Do we have access to them?
• How big are they?
• How often are they updated?
• How far back do they go?
• Which of these data sources are being used for
analysis? Can we use a data source which is currently
unused? What problems would that help us solve?
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
People and Ecosystem
Domain
Data Science as a Process
Data Prep
Variable
Selection
Model
Building
Model
Execution
Communication
& Operationalization
Evaluate
• Selection of raw variables which are
potentially relevant to problem being
solved
• Transformations to create a set of
candidate variables
• Clustering and other types of
categorization which could provide
insights
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
People and Ecosystem
Domain
Data Science as a Process
Data Step
Variable
Selection
Model
Building
Model
Execution
Communication
& Operationalization
Evaluate
Pick suitable statistics, or suitable model form and algorithm
and build model
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
People and Ecosystem
Domain
Data Science as a Process
Data Prep
Variable
Selection
Model
Building
Model
Execution
Communication
& Operationalization
Evaluate
The model needs to be executable in database on big data
with reasonable execution time
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
People and Ecosystem
Domain
Data Science as a Process
Data Prep
Variable
Selection
Model
Building
Model
Execution
Communication
& Operationalization
Evaluate
The model results need to be communicated &
operationalized to have a measurable impact on the
business
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
People and Ecosystem
Domain
Data Science as a Process
Data Prep
Variable
Selection
Model
Building
Model
Execution
Communication
& Operationalization
Evaluate
• Accuracy of results and forecasts
• Analysis of real-world experiments
• A/B testing on target samples
• End-user and LOB feedback
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Agenda
1. Introduction to Big Data Analytics
2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Use Case 1 Trip modeling
Problem: Analyze behaviour of
visitors to MakeMyTrip.com
Particularly interested in
unregistered visitors
– About 99% of total visitor traffic
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Applications of model
• Tailor promotions for popular types of trips
 Most popular types probably already well-known; potential in
next tier down
• ... and for different types of customers
• Present customised promotions to visitors based on clicks
• Ad optimization: present ads based on modelled behavior
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Hypertargeting
• Serving content to customers based on individual
characteristics and preferences, rather than broad
generalizations
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Available data
• Data available from server:
 Date/time
 IP address
 Parts of site visited
• Geographic location can be obtained via geo lookup on IP
• Personal information available for registered visitors only
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Approach
• Use clustering to identify trip/visitor types
 Sport (IPL,F1, Football, etc)
 Festivals
 Other seasonal movements
• Decision trees to predict which type of trip a visitor is likely
to make
 Based on successively more information as they move
through the site
• Use registered visitor info to augment models
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Use Case 2 Municipal traffic analysis
• Client domain: Municipal city government
• Available data:
Cross-city loop detectors measuring traffic volume
Detailed city bus movement information from Bluetooth devices
Video detection of traffic volume, velocity
• Goal: Exploit available data for unrealized business insights and
values
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Data loading and manipulation
• Parallel data loading
– Data loaded from local file system and distributed across Greenplum
servers in parallel.
– Loading 9 months of traffic volume data (16 GB, 464 million rows) in 69.4
seconds.
• SQL data manipulation
– Standard SQL permits city personnel to use existing skillsets.
– Greenplum SQL extensions offer the control over data distribution.
– Open source packages (e.g. in Python, R) can be conveniently deployed
within Greenplum for visualization and analytics purposes.
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Basic reporting on traffic volume
• Easy generation of reports via straightforward user-defined functions
• Standard graphing utilities called from within Greenplum to create figures
• Detector downtimes can be clearly spotted in the figure, or via an SQL query, thus
mitigating maintenance challenges
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Basic reporting on city buses
• Data from Bluetooth devices has a wealth of information on city
buses that we can report on:
 Travel route of each bus
 Deviations of arrival times compared to provided timetable
 Occurrences of driver errors (e.g. taking a wrong turn) and possible
causes
 Occurrences where the same bus service arrives at the same stop
within seconds of each other
 Whether new bus services translates into lower traffic volume on
introduced roads
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Result visualizations (Google Earth)
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Applications for traffic network modelling
• Compute the fastest path between any two locations at a
future time point
• Identify potential bottlenecks in the traffic
• Identify phase transition points for massive traffic congestion
using simulation techniques
• Study the likely impact of new roads and traffic policies,
without having to observe real disruptive events to
determine the impact
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
• Greenplum‟s parallel architecture permits traffic network analysis on a
city scale
• Travel time can be predicted via model learning, involving hundreds of
thousands of optimizations in parallel, across the entire traffic network
• Variables that can be considered include
Distance between two locations
Concurrent traffic volume
Time of day
Weather
Construction work
• Computationally prohibitive for traditional non-parallel database
environments
Parallel traffic network modelling
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Use Case 3 - Product Recommendation Analysis
• Eight banks became one
 Branches across the US
• Consolidation of products and customers
 Employees faced with new products and
customers
 Visibility into churn and retention was
challenged
• Analytics focus was historically reporting-
centric
 Descriptive “hindsight”`
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Customer Segmentation
Customer segments
– First, define a measurement of
customer value
– Then create clusters of
customers based on customer
value, and then product
profiles.
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Association Rules
Product associations
– Now find products that are
common in the segment, but
not owned by the given
household.
Product A
Product B
Product X
Product Y
Product Z
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Product Recommendations
Next best offer
– Now, filter down to products
associated with high-value
customers in the same segment.
Product A
Product B
Product X
Product Y
Product Z
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Increased customer value
Customer Comments
– “The Greenplum Solution has
scaled from 6 to 11 TB of data.”
– Moved from 7 hours /month of
data to 7.5 hours / 2.5 years of
data
Product Recommender
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Agenda
1. Introduction to Big Data Analytics
2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Module #: Module Name 74
Ferrari Freight Train
0-100 KMPH 2.3 seconds 100 seconds
Top Speed 360 KMPH 140 KMPH
Stops / hr 1000 5
Horse Power 660 bhp 16,000 bhp
Throughput 220 KG in 27 mins 55000000 KG in 60 mins
VS
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Module #: Module Name 75
Fast Data Big Data
Transactions /
Second
100000+ per second n.a
Concurrent hits 10000 + per sec 10 per second
Update Patterns Read / Write Appends
Data Complexity Simple Joins on a few tables Can be highly complex
Data Volumes GB‟s / TB PB to ZB
Access Tools GemFire / SQLFire GP DB, GP Hadoop
VSFast Data Big Data
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Not a fast OLTP DB!
APPLICATION(S)
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Fast Data is
• More than just an OLTP DB
• Super Fast access to Data
• Server side flexibility
• Data is HA
• Supports transactions
• Setup is fault tolerant
• Can handle thousands of concurrent hits
• Distributed hence horizontally scalable
• Runs on cheap x86 hardware
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
CAP Theorem
A distributed system can only
achieve TWO out of the
three qualities of
Consistency, Availability and
Partition Tolerance
onsistency vailability artition Tolerence
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Fast Data =
• Service Loose Coupling
• Data Transformation
• System Integration
+ Service Bus
• Guaranteed Delivery
• Event Propagation
• Data Distribution
+ Messaging System
• Event Driven Architectures
• Real-time Analysis
• Business Event Detection
+ Complex Event Processor
Fast Data combines select features from all of these products and combines
them into a low-latency, linearly scalable, memory-based data fabric
• Storage
• Persistence
• Transactions
• Queries
Database
• High Availability
• Load Balancing
• Data Replication
• L1 Caching
• Map-Reduce, Scatter-Gather
• Distributed Task Assignment
• Task Decomposition
+ Grid Controller
• Result Summarization
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
A Typical Fast Data Setup
Web Tier
Application Tier
Load Balancer
Add/remove
web/application/data servers
Add/remove storage
Database Tier
Storage Tier
Disks may be direct or network
attached
Optional reliable, asynchronous
feed
to a Big Data Store
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Memory-based Performance
Perform
Fast Data uses memory on a peer machine to make data updates
durable, allowing the updating thread to return 10x to 100x faster than
updates that must be written through to disk, without risking any data
loss. Typical latencies are in the few hundreds of microseconds
instead of in the tens to hundreds of milliseconds.
One can optionally write updates to disk / data warehouse / big data store
asynchronously and reliably.
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
WAN Distribution
Distribute
Fast Data can keep clusters that are distributed around the world synchronized in real-
time and can operate reliably in Disconnected, Intermittent and Low-Bandwidth network
environments.
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Distributed Events
Targeted, guaranteed delivery, event
notification and Continuous Queries
Notify
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Parallel Queries
Batch Controller
or Client
Scatter-Gather (Map-Reduce)
Queries
Compute
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Data-Aware Routing
Execute
Fast Data provides „data aware function routing‟ – moving the behavior to
the correct data instead of moving the data to the behavior.
Batch Controller
or Client
Data Aware Function
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Accessing Fast Data
Stores Objects (Java, C++, C#, .NET) or unstructured data
Spring-GemFire
Stores Relational Data with SQL interface
Supports JDBC, ODBC, Java and .NET interfaces
Key-Value store with OQL Queries
Uses existing relational tools
Order
Order Line Item
Quantity
Discount
Product
SKU
Unit Price
L2 Cache plugin for Hibernate
HTTP Session replication module
GemFire
SQLFire
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Agenda
1. Introduction to Big Data Analytics
2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Use Cases
Applying the technology
A few examples of Fast Data technology
applied to real business cases
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
A mainframe-based, nightly customer account reconciliation batch run
Mainframe Migration
min
0 12060
I/O Wait
9%
CPU Busy
15%
Mainframe
CPU Unavailable
76%
COTS Cluster
Batch now runs in 60 seconds
93% Network Wait! Time could have been reduced further with higher network bandwidth
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Mainframe Migration
So What? So the batch runs faster – who cares?
1. It ran on cheaper, modern, scalable hardware
2. If something goes wrong with the batch, you only wait 60
seconds to find out
3. Now, the hardware and the data are available to do other
things in the remaining 119 minutes:
• Fraud detection
• Regulatory compliance
• Re-run risk calculations with 119 different scenarios
• Up sell customers
4. You can move from batch to real-time processing!
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Online Betting
A popular online gambling site attracts new players through ads on affiliate sites
Customized Banner Ad on affiliate site
Affiliate's Web Server
1 Banner Ad Server
2
3
4
In a fraction of a second, the banner ad sever must:
Generate a tracking id specific to the request
Apply temporal, sequential, regional, contractual and other
policies in order to decide which banner to deliver
Customize the banner
Record that the banner ad was delivered
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Online Betting (Contd.)
Their initial RDBMS-based system
Limited their ability to sign up new affiliates
Limited their ability to add new products on their site
Limited the delivery performance experienced by their
affiliates and their customers
Limited their ability to add additional internal applications
and policies to the process
Their new Fast Data based system
Responded with sub-millisecond latency
Met their target of 2500 banner ad deliveries per second
Provides for future scalability
Improved performance to the browser by 4x
Cost less
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Asset/Position Monitoring
Centralized data storage was not possible
Multi-agency, multi-force integration
Numerous Applications needed access to multiple data sources
simultaneously
Networks constantly changing, unreliable, mobile deployments
Upwards of 60,000 object updates each minute
Over 70 data feeds
Needed a real-time situational awareness system to track assets that could be used by the war fighters in theatre
Northrop Grumman (integrator) investigated the following technologies before deciding on GemFire
•RDBMS – Oracle, Sybase, Postgres, TimesTen, MySQL
•ODBMS - Objectivity
•jCache – GemFire, Oracle Coherence
•JMS – SonicMQ, BEA Weblogic, IBM, jBoss
•TIBCO Rendezvous
•Web Services
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Asset/Position Monitoring
655 sites, 11 thousand users
Real-time, 3 dimensional, NASA World Wind User Interface
60,000 Position updates per minute
Real time info available on the desk of
President of the United States
US Secretary of Defense
Each of the Joint Chiefs of Staff
Every commander in the US Military
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Low-latency trade insertion
Permanent Archival of every trade
Kept pace with fast ticking market data
Rapid, Event Based Position Calculation
Distribution of Position Updates Globally
Consistent Global Views of Positions
Pass the Book
Regional Close-of-day
High Availability
Disaster Recovery
Regional Autonomy
The project achieved:
Global Foreign Exchange Trading System
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Global Foreign Exchange Trading System
In that same application, Fast Data replaced:
Sybase Database In Every Region
Still need 1 instance for archival purposes
TIBCO Rendezvous for Local Area Messaging
IBM MQ Series for WAN Distribution
Veritas N+1 Clustering for H/A
In fact, we save the physical +1 node itself
3DNS or Wide IP
Admin personnel reduced from 1.5 to 0.5
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Agenda
1. Introduction to Big Data Analytics
2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Application High Level Overview
APPLICATION(S)
Single DB cant handle both
OLTP and OLAP
workloads
Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Copyright © 2011 EMC Corporation. All Rights Reserved.
Big Data Setup
APPLICATION(S)
How to get the best of Fast & Big Data
Fast Data
Setup
In case record isn't available
Concurrent hits

Mais conteúdo relacionado

Mais procurados

Big data ppt
Big data pptBig data ppt
Big data pptYash Raj
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewSivashankar Ganapathy
 
Big Data and Data Analytics in Homeland Security and Public Safety Market 201...
Big Data and Data Analytics in Homeland Security and Public Safety Market 201...Big Data and Data Analytics in Homeland Security and Public Safety Market 201...
Big Data and Data Analytics in Homeland Security and Public Safety Market 201...Homeland Security Research Corp.
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its ChallengesKathirvel Ayyaswamy
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesKathirvel Ayyaswamy
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataRichard Vidgen
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 
Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)SiamAhmed16
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035Neelam Rawat
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysisPoonam Kshirsagar
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxPankajkumar496281
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and howbobosenthil
 
Big Data - Insights & Challenges
Big Data - Insights & ChallengesBig Data - Insights & Challenges
Big Data - Insights & ChallengesRupen Momaya
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentalsrjain51
 
Introducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaIntroducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaStudent
 

Mais procurados (20)

Big data ppt
Big data pptBig data ppt
Big data ppt
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
Chapter 1 big data
Chapter 1 big dataChapter 1 big data
Chapter 1 big data
 
Big Data and Data Analytics in Homeland Security and Public Safety Market 201...
Big Data and Data Analytics in Homeland Security and Public Safety Market 201...Big Data and Data Analytics in Homeland Security and Public Safety Market 201...
Big Data and Data Analytics in Homeland Security and Public Safety Market 201...
 
Sina Sohangir Presentation on IWMC 2015
Sina Sohangir Presentation on IWMC 2015Sina Sohangir Presentation on IWMC 2015
Sina Sohangir Presentation on IWMC 2015
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its Challenges
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
What is big data?
What is big data?What is big data?
What is big data?
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysis
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptx
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
 
Big Data - Insights & Challenges
Big Data - Insights & ChallengesBig Data - Insights & Challenges
Big Data - Insights & Challenges
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Introducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaIntroducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by Jaseela
 

Destaque

Big Data Marketing Analytics
Big Data Marketing AnalyticsBig Data Marketing Analytics
Big Data Marketing AnalyticsAkash Tyagi
 
EMC-ISILON_MphasiS_Walk_through
EMC-ISILON_MphasiS_Walk_throughEMC-ISILON_MphasiS_Walk_through
EMC-ISILON_MphasiS_Walk_throughprakashjjaya
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation17aroumougamh
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksDatabricks
 
Data Science Associate (EMCDSA) certificate
Data Science Associate (EMCDSA) certificateData Science Associate (EMCDSA) certificate
Data Science Associate (EMCDSA) certificateJoe Pingitore
 
Technology Architect, Isilon Solutions Specialist Version 1.0 (EMCTA) certifi...
Technology Architect, Isilon Solutions Specialist Version 1.0 (EMCTA) certifi...Technology Architect, Isilon Solutions Specialist Version 1.0 (EMCTA) certifi...
Technology Architect, Isilon Solutions Specialist Version 1.0 (EMCTA) certifi...Joe Pingitore
 
Azure as a Chatbot Service: From Purpose To Production With A Cloud Bot Archi...
Azure as a Chatbot Service: From Purpose To Production With A Cloud Bot Archi...Azure as a Chatbot Service: From Purpose To Production With A Cloud Bot Archi...
Azure as a Chatbot Service: From Purpose To Production With A Cloud Bot Archi...Paul Prae
 
Chatbot: What is it ?
Chatbot: What is it ?Chatbot: What is it ?
Chatbot: What is it ?Carl Gonthier
 
Big Data and Fast Data - big and fast combined, is it possible?
Big Data and Fast Data - big and fast combined, is it possible?Big Data and Fast Data - big and fast combined, is it possible?
Big Data and Fast Data - big and fast combined, is it possible?Guido Schmutz
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceCaserta
 
Chatbot AI Aeromexico (public)
Chatbot AI Aeromexico (public)Chatbot AI Aeromexico (public)
Chatbot AI Aeromexico (public)Brian Gross
 
Introduction to (Big) Data Science
Introduction to (Big) Data ScienceIntroduction to (Big) Data Science
Introduction to (Big) Data ScienceInfoFarm
 
The Chatbot Rush
The Chatbot Rush   The Chatbot Rush
The Chatbot Rush Yoav Barel
 
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Data Science London
 
Chatbot in Sale Management
Chatbot in Sale ManagementChatbot in Sale Management
Chatbot in Sale ManagementVõ Duy Tuấn
 
Chatbot 101 - Robert McGovern
Chatbot 101 - Robert McGovernChatbot 101 - Robert McGovern
Chatbot 101 - Robert McGovernRobert McGovern
 
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Lightbend
 

Destaque (20)

Khas gilt 4 x4
Khas gilt 4 x4Khas gilt 4 x4
Khas gilt 4 x4
 
Big Data Marketing Analytics
Big Data Marketing AnalyticsBig Data Marketing Analytics
Big Data Marketing Analytics
 
EMC-ISILON_MphasiS_Walk_through
EMC-ISILON_MphasiS_Walk_throughEMC-ISILON_MphasiS_Walk_through
EMC-ISILON_MphasiS_Walk_through
 
Big Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning GuruBig Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning Guru
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with Databricks
 
Data Science Associate (EMCDSA) certificate
Data Science Associate (EMCDSA) certificateData Science Associate (EMCDSA) certificate
Data Science Associate (EMCDSA) certificate
 
Technology Architect, Isilon Solutions Specialist Version 1.0 (EMCTA) certifi...
Technology Architect, Isilon Solutions Specialist Version 1.0 (EMCTA) certifi...Technology Architect, Isilon Solutions Specialist Version 1.0 (EMCTA) certifi...
Technology Architect, Isilon Solutions Specialist Version 1.0 (EMCTA) certifi...
 
Azure as a Chatbot Service: From Purpose To Production With A Cloud Bot Archi...
Azure as a Chatbot Service: From Purpose To Production With A Cloud Bot Archi...Azure as a Chatbot Service: From Purpose To Production With A Cloud Bot Archi...
Azure as a Chatbot Service: From Purpose To Production With A Cloud Bot Archi...
 
Chatbot: What is it ?
Chatbot: What is it ?Chatbot: What is it ?
Chatbot: What is it ?
 
Big Data and Fast Data - big and fast combined, is it possible?
Big Data and Fast Data - big and fast combined, is it possible?Big Data and Fast Data - big and fast combined, is it possible?
Big Data and Fast Data - big and fast combined, is it possible?
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Chatbot AI Aeromexico (public)
Chatbot AI Aeromexico (public)Chatbot AI Aeromexico (public)
Chatbot AI Aeromexico (public)
 
Data Science Thailand Meetup#11
Data Science Thailand Meetup#11Data Science Thailand Meetup#11
Data Science Thailand Meetup#11
 
Introduction to (Big) Data Science
Introduction to (Big) Data ScienceIntroduction to (Big) Data Science
Introduction to (Big) Data Science
 
The Chatbot Rush
The Chatbot Rush   The Chatbot Rush
The Chatbot Rush
 
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
 
Chatbot in Sale Management
Chatbot in Sale ManagementChatbot in Sale Management
Chatbot in Sale Management
 
Chatbot 101 - Robert McGovern
Chatbot 101 - Robert McGovernChatbot 101 - Robert McGovern
Chatbot 101 - Robert McGovern
 
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
 

Semelhante a Big data, data science & fast data

Building Confidence in Big Data - IBM Smarter Business 2013
Building Confidence in Big Data - IBM Smarter Business 2013 Building Confidence in Big Data - IBM Smarter Business 2013
Building Confidence in Big Data - IBM Smarter Business 2013 IBM Sverige
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Denodo
 
Beware of the Risk Behind Big Data
Beware of the Risk Behind Big DataBeware of the Risk Behind Big Data
Beware of the Risk Behind Big DataEMC
 
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...mattdenesuk
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
 
Smarter planet and mega trends presentation 2012
Smarter planet and mega trends presentation 2012Smarter planet and mega trends presentation 2012
Smarter planet and mega trends presentation 2012Joergen Floes
 
Eecs6893 big dataanalytics-lecture1
Eecs6893 big dataanalytics-lecture1Eecs6893 big dataanalytics-lecture1
Eecs6893 big dataanalytics-lecture1Aravindharamanan S
 
Big Data Vendor Panel - Data Stax
Big Data Vendor Panel - Data StaxBig Data Vendor Panel - Data Stax
Big Data Vendor Panel - Data StaxMikan Associates
 
Qo Introduction V2
Qo Introduction V2Qo Introduction V2
Qo Introduction V2Joe_F
 
What's Next with Government Big Data
What's Next with Government Big Data What's Next with Government Big Data
What's Next with Government Big Data GovLoop
 
A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)Denodo
 
Maximizing Oil and Gas (Data) Asset Utilization with a Logical Data Fabric (A...
Maximizing Oil and Gas (Data) Asset Utilization with a Logical Data Fabric (A...Maximizing Oil and Gas (Data) Asset Utilization with a Logical Data Fabric (A...
Maximizing Oil and Gas (Data) Asset Utilization with a Logical Data Fabric (A...Denodo
 
MT30 Best practices for data lake adoption
MT30 Best practices for data lake adoptionMT30 Best practices for data lake adoption
MT30 Best practices for data lake adoptionDell EMC World
 
A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)Denodo
 
Create your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouseCreate your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouseJeff Kelly
 
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSISCASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSISIRJET Journal
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)Denodo
 

Semelhante a Big data, data science & fast data (20)

Bigdata-Intro.pptx
Bigdata-Intro.pptxBigdata-Intro.pptx
Bigdata-Intro.pptx
 
Building Confidence in Big Data - IBM Smarter Business 2013
Building Confidence in Big Data - IBM Smarter Business 2013 Building Confidence in Big Data - IBM Smarter Business 2013
Building Confidence in Big Data - IBM Smarter Business 2013
 
Module 1.pptx
Module 1.pptxModule 1.pptx
Module 1.pptx
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)
 
Beware of the Risk Behind Big Data
Beware of the Risk Behind Big DataBeware of the Risk Behind Big Data
Beware of the Risk Behind Big Data
 
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Smarter planet and mega trends presentation 2012
Smarter planet and mega trends presentation 2012Smarter planet and mega trends presentation 2012
Smarter planet and mega trends presentation 2012
 
Eecs6893 big dataanalytics-lecture1
Eecs6893 big dataanalytics-lecture1Eecs6893 big dataanalytics-lecture1
Eecs6893 big dataanalytics-lecture1
 
Big Data Vendor Panel - Data Stax
Big Data Vendor Panel - Data StaxBig Data Vendor Panel - Data Stax
Big Data Vendor Panel - Data Stax
 
Qo Introduction V2
Qo Introduction V2Qo Introduction V2
Qo Introduction V2
 
What's Next with Government Big Data
What's Next with Government Big Data What's Next with Government Big Data
What's Next with Government Big Data
 
A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)
 
Maximizing Oil and Gas (Data) Asset Utilization with a Logical Data Fabric (A...
Maximizing Oil and Gas (Data) Asset Utilization with a Logical Data Fabric (A...Maximizing Oil and Gas (Data) Asset Utilization with a Logical Data Fabric (A...
Maximizing Oil and Gas (Data) Asset Utilization with a Logical Data Fabric (A...
 
MT30 Best practices for data lake adoption
MT30 Best practices for data lake adoptionMT30 Best practices for data lake adoption
MT30 Best practices for data lake adoption
 
A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)
 
Create your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouseCreate your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouse
 
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSISCASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)
 

Big data, data science & fast data

  • 1. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Big Data Analytics, Data Science & Fast Data 1 Kunal Joshi joshik@vmware.com
  • 2. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Agenda 1. Introduction to Big Data Analytics 2. Big Data Analytics - Use Cases 3. Technologies for Big Data Analytics 4. Introduction to Data Science 5. Data Science - Use Cases 6. Introduction to Fast Data 7. Fast Data - Use Cases 8. Fast Data meets Big Data BIG DATA DATA SCIENCE FAST DATA
  • 3. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Agenda 1. Introduction to Big Data Analytics 2. Big Data Analytics - Use Cases 3. Technologies for Big Data Analytics 4. Introduction to Data Science 5. Data Science - Use Cases 6. Introduction to Fast Data 7. Fast Data - Use Cases 8. Fast Data meets Big Data
  • 4. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Big Data Pioneers 1,000,000,000 Queries A Day 250,000,000 New Photo‟s / Day 290,000,000 Updates / Day
  • 5. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Other Companies using Big Data 4,000,000 Claims / Day 2,800,000,000 Trades / Day 31,000,000,000 Interactions / Day
  • 6. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Moore’s Law Gordon Moore (Founder of Intel) Number of transistors that can be placed in a processor DOUBLES in approximately every TWO years.
  • 7. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Introduction to Big Data Analytics What is Big Data? What makes data, “Big” Data? 7 Your Thoughts?
  • 8. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL 8Copyright © 2011 EMC Corporation. All Rights Reserved. • “Big Data” is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value.  Requires new data architectures, analytic sandboxes  New tools  New analytical methods  Integrating multiple skills into new role of data scientist • Organizations are deriving business benefit from analyzing ever larger and more complex data sets that increasingly require real-time or near-real time capabilities Big Data Defined Source: McKinsey May 2011 article Big Data: The next frontier for innovation, competition, and productivity
  • 9. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL 9Copyright © 2011 EMC Corporation. All Rights Reserved. 1. Data Volume  44x increase from 2010 to 2020 (1.2zettabytes to 35.2zb) 2. Processing Complexity  Changing data structures  Use cases warranting additional transformations and analytical techniques 3. Data Structure  Greater variety of data structures to mine and analyze Key Characteristics of Big Data Module 1: Introduction to BDA
  • 10. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Big Data Characteristics: Data Structures Data Growth is Increasingly Unstructured Module 1: Introduction to BDA 10 Structure d Semi- Structured “Quasi” Structured Unstructured • Data containing a defined data type, format, structure • Example: Transaction data and OLAP • Data that has no inherent structure and is usually stored as different types of files. • Example: Text documents, PDFs, images and video • Textual data with erratic data formats, can be formatted with effort, tools, and time • Example: Web clickstream data that may contain some inconsistencies in data values and formats • Textual data files with a discernable pattern, enabling parsing • Example: XML data files that are self describing and defined by an xml schema MoreStructured
  • 11. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Four Main Types of Data Structures Module 1: Introduction to BDA 11 http://www.google.com/#hl=en&sugexp=kjrmc&cp=8&gs_id=2m&xhr=t&q=data+scientist& pq=big+data&pf=p&sclient=psyb&source=hp&pbx=1&oq=data+sci&aq=0&aqi=g4&aql=f&gs _sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.,cf.osb&fp=d566e0fbd09c8604&biw=1382&bih=651 The Red Wheelbarrow, by William Carlos Williams View  Source Structured Data Semi-Structured Data Quasi-Structured Data Unstructured Data
  • 12. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Driver Examples Desire to optimize business operations Sales, pricing, profitability, efficiency Desire to identify business risk Customer churn, fraud, default Predict new business opportunities Upsell, cross-sell, best new customer prospects Comply with laws or regulatory requirements Anti-Money Laundering, Fair Lending, Basel II Business Drivers for Big Data Analytics 1 2 3 4 Current Business Problems Provide Opportunities for Organizations to Become More Analytical & Data Driven Module 1: Introduction to BDA 12
  • 13. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Challenges with a Traditional Data Warehouse Departmental Warehouse Enterprise Applications Reporting Non-Prioritized Data Provisioning Non-Agile Models “Spread Marts” Data Sources Siloed Analytics Static schemas accrete over time Prioritized Operational Processes Errant data & marts Departmental Warehouse 1 2 3 13 4
  • 14. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Implications of a Traditional Data Warehouse 14 • High-value data is hard to reach and leverage • Predictive analytics & data mining activities are last in line for data  Queued after prioritized operational processes • Data is moving in batches from EDW to local analytical tools  In-memory analytics (such as R, SAS, SPSS, Excel)  Sampling can skew model accuracy • Isolated, ad hoc analytic projects, rather than centrally-managed harnessing of analytics  Non-standardized initiatives  Frequently, not aligned with corporate business goals Slow “time-to-insight” & reduced business impact Module 1: Introduction to BDA
  • 15. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Opportunities for a New Approach to Analytics New Applications Driving Data Volume Module 1: Introduction to BDA 15 2000‟s (CONTENT & DIGITAL ASSET MANAGEMENT) 1990‟s (RDBMS & DATA WAREHOUSE) 2010‟s (NO-SQL & KEY/VALUE) VOLUMEOFINFORMATION LARGE SMALL MEASURED IN TERABYTES 1TB = 1,000GB MEASURED IN PETABYTES 1PB = 1,000TB WILL BE MEASURED IN EXABYTES 1EB = 1,000PB
  • 16. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Considerations for Big Data Analytics 1. Speed of decision making 2. Throughput 3. Analysis flexibility Analytic Sandbox Data assets gathered from multiple sources and technologies for analysis • Enables high performance analytics using in-db processing • Reduces costs associated with data replication into "shadow" file systems • “Analyst-owned” rather than “DBA owned” Criteria for Big Data Projects New Analytic Architecture 1. Speed of decision making 2. Throughput 3. Analysis flexibility 16
  • 17. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. State of the Practice in Analytics: Mini-Case Study Big Data Enabled Loan Processing at XYZ bankUnderwritingRisk Traditional Underwriting Risk Level TRADITIONAL DATA LEVERAGED BIG DATA LEVERAGED Big Data Enabled Underwriting Risk Level 17Module 1: Introduction to BDA Your Thoughts?
  • 18. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Agenda 1. Introduction to Big Data Analytics 2. Big Data Analytics - Use Cases 3. Technologies for Big Data Analytics 4. Introduction to Data Science 5. Data Science - Use Cases 6. Introduction to Fast Data 7. Fast Data - Use Cases 8. Fast Data meets Big Data
  • 19. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Big Data Analytics: Industry Examples Module 1: Introduction to BDA 19 Health Care •Reducing Cost of Care Public Services •Preventing Pandemics Life Sciences •Genomic Mapping IT Infrastructure •Unstructured Data Analysis Online Services •Social Media for Professionals RetailPhone/TV Government Internet Medical Financial Data Collectors 1 2 3 4 5
  • 20. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Big Data Analytics: Healthcare Use of Big Data Key Outcomes Situation • Poor police response and problems with medical care, triggered by shooting of a Rutgers student • The event drove local doctor to map crime data and examine local health care • Dr. Jeffrey Brenner generated his own crime maps from medical billing records of 3 hospitals • City hospitals & ER‟s provided expensive care, low quality care • Reduced hospital costs by 56% by realizing that 80% of city‟s medical costs came from 13% of its residents, mainly low- income or elderly • Now offers preventative care over the phone or through home visits 1 20Module 1: Introduction to BDA
  • 21. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Big Data Analytics: Public Services Use of Big Data Key Outcomes Situation • Threat of global pandemics has increased exponentially • Pandemics spreads at faster rates, more resistant to antibiotics • Created a network of viral listening posts • Combines data from viral discovery in the field, research in disease hotspots, and social media trends • Using Big Data to make accurate predications on spread of new pandemics • Identified a fifth form of human malaria, including its origin • Identified why efforts failed to control swine flu • Proposing more proactive approaches to preventing outbreaks 2 Module 1: Introduction to BDA 21
  • 22. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Big Data Analytics: Life Sciences Use of Big Data Key Outcomes Situation • Broad Institute (MIT & Harvard) mapping the Human Genome • In 13 yrs, mapped 3 billion genetic base pairs; 8 petabytes • Developed 30+ software packages, now shared publicly, along with the genomic data • Using genetic mappings to identify cellular mutations causing cancer and other serious diseases • Innovating how genomic research informs new pharmaceutical drugs 3 Module 1: Introduction to BDA 22
  • 23. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Big Data Analytics: IT Infrastructure Use of Big Data Key Outcomes Situation • Explosion of unstructured data required new technology to analyze quickly, and efficiently • Doug Cutting created Hadoop to divide large processing tasks into smaller tasks across many computers • Analyzes social media data generated by hundreds of thousands of users • New York Times used Hadoop to transform its entire public archive, from 1851 to 1922, into 11 million PDF files in 24 hrs • Applications range from social media, sentiment analysis, wartime chatter, natural language processing 4 Module 1: Introduction to BDA 23
  • 24. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Big Data Analytics: Online Services Use of Big Data Key Outcomes Situation • Opportunity to create social media space for professionals • Collects and analyzes data from over 100 million users • Adding 1 million new users per week • LinkedIn Skills, InMaps, Job Recommendations, Recruiting • Established a diverse data scientist group, as founder believes this is the start of Big Data revolution 5 Module 1: Introduction to BDA 24
  • 25. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Agenda 1. Introduction to Big Data Analytics 2. Big Data Analytics - Use Cases 3. Technologies for Big Data Analytics 4. Introduction to Data Science 5. Data Science - Use Cases 6. Introduction to Fast Data 7. Fast Data - Use Cases 8. Fast Data meets Big Data
  • 26. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Greenplum Unified Analytic Platform Partner Tools & Services GREENPLUM CHORUS – Analytic Productivity Layer Greenplum gNet GREENPLUM DATABASE Data Scientist Data Engineer Data Analyst Bl Analyst LOB User Data Platform Admin DATASCIENCETEAM Cloud, x86 Infrastructure, or Appliance GREENPLUM HD Unify your team Drive Collaboration Keep Your Options Open The Power of Data Co-Processing
  • 27. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Greenplum Hadoop STRUCTURED UNSTRUCTURED Hive MapReduce Pig XML, JSON, … Flat files Schema on load Directories No ETL Java SequenceFile
  • 28. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Greenplum Database STRUCTURED UNSTRUCTURED SQL RDBMS Tables and Schemas Greenplum MapReduce Indexing Partitioning BI Tools
  • 29. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL • A framework for handling big data  An implementation of the MapReduce paradigm  Hadoop glues the storage and analytics together and provides reliability, scalability, and management What do we Mean by Hadoop Storage (Big Data)  HDFS – Hadoop Distributed File System  Reliable, redundant, distributed file system optimized for large files MapReduce (Analytics)  Programming model for processing sets of data  Mapping inputs to outputs and reducing the output of multiple Mappers to one (or a few) answer(s) Two Main Components 30Module 5: Advanced Analytics - Technology and Tools
  • 30. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Hadoop Distributed File System 31Module 5: Advanced Analytics - Technology and Tools
  • 31. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL MapReduce and HDFS Task Tracker Task Tracker Task Tracker Job Tracker Hadoop Distributed File System (HDFS) Client/Dev Large Data Set (Log files, Sensor Data) Map Job Reduce Job Map Job Reduce Job Map Job Reduce Job Map Job Reduce Job Map Job Reduce Job Map Job Reduce Job 2 1 3 4 32Module 5: Advanced Analytics - Technology and Tools
  • 32. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL • As you move from Pig to Hive to HBase, you are increasingly moving away from the mechanics of Hadoop and get an RDBMS view of the Big Data world Components of Hadoop HBase Queries against defined tables Hive SQL-based language Pig Data flow language & Execution environment More Hadoop Visible Less Hadoop Visible DBMS View Mechanics of Hadoop 33Module 5: Advanced Analytics - Technology and Tools
  • 33. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Greenplum Database Extreme Performance for Analytics • Optimized for BI and analytics  Deep integration with statistical packages  High performance parallel implementations • Simple and automatic parallelization  Just load and query like any database  Tables are automatically distributed across nodes  No need for manual partitioning or tuning • Extremely scalable  MPP* shared-nothing architecture  All nodes can scan and process in parallel  Linear scalability by adding nodes where each node adds storage, query & load performance *MPP – Massive Parallel Processing
  • 34. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Greenplum DB & HD Massively Parallel Access and Movement Maximize Solution Flexibility Minimize Data Duplication Access Hadoop Data in Real Time From Greenplum DB Import and export in Text, Binary and Compressed Formats Custom formats via user-written MapReduce Java program And GPDB Format classes gNet 10Gb Ethernet Greenplum DB Hadoop Node 1 Node 2 Node 3 Segment 1 Segment 2 Segment 3 GP DB Master Host Map Reduce User- Defined Binary TextExternal Tables
  • 35. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Analytical Software Exploiting Parallelism In-Database Analytics Analytic Results Interconnect Storage Independent Segment Processors Independent Memory Independent Direct Storage Connection Master Segment Processor Interconnect Switch Math & Statistical Functions
  • 36. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Agenda 1. Introduction to Big Data Analytics 2. Big Data Analytics - Use Cases 3. Technologies for Big Data Analytics 4. Introduction to Data Science 5. Data Science - Use Cases 6. Introduction to Fast Data 7. Fast Data - Use Cases 8. Fast Data meets Big Data
  • 37. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Big Data Requires Data Science Data Science • Predictive analysis • What if…..? Business Intelligence • Standard reporting • What happened? High FuturePast TIME BUSINESS VALUE Business Intelligence Data Science Low
  • 38. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Data science and business intelligence “BIG DATA ANALYTICS” “TRADITIONAL BI” GBs to 10s of TBs Operational Structured Repetitive 10s of TB to Pb‟s External + Operational Mostly Semi-Structured Experimental, Ad Hoc
  • 39. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Profile of a Data Scientist Module 1: Introduction to BDA 46
  • 40. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL People and Ecosystem Domain Data Science as a Process Data Prep Variable Selection Model Building Model Execution Communication & Operationalization Evaluate • People • Scientists / Analysts • Business Analysts • Consumers of analysis • Stakeholders • EMC sales and services • Ecosystem • Sector (Telecom, banking, security agency etc.) • Modeling software and other tools used by analysts (MADlib, SAS, R etc.) • Database (Greenplum) & Data Sources
  • 41. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL People and Ecosystem Domain Data Science as a Process Data Prep Variable Selection Model Building Model Execution Communication & Operationalization Evaluate Discovery & prioritized identification of opportunities • Customer Retention • Fraud detection • Pricing • Marketing effectiveness and optimization • Product Recommendation • Others……
  • 42. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL People and Ecosystem Domain Data Science as a Process Data Prep Variable Selection Model Building Model Execution Communication & Operationalization Evaluate • What are the data sources? • Do we have access to them? • How big are they? • How often are they updated? • How far back do they go? • Which of these data sources are being used for analysis? Can we use a data source which is currently unused? What problems would that help us solve?
  • 43. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL People and Ecosystem Domain Data Science as a Process Data Prep Variable Selection Model Building Model Execution Communication & Operationalization Evaluate • Selection of raw variables which are potentially relevant to problem being solved • Transformations to create a set of candidate variables • Clustering and other types of categorization which could provide insights
  • 44. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL People and Ecosystem Domain Data Science as a Process Data Step Variable Selection Model Building Model Execution Communication & Operationalization Evaluate Pick suitable statistics, or suitable model form and algorithm and build model
  • 45. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL People and Ecosystem Domain Data Science as a Process Data Prep Variable Selection Model Building Model Execution Communication & Operationalization Evaluate The model needs to be executable in database on big data with reasonable execution time
  • 46. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL People and Ecosystem Domain Data Science as a Process Data Prep Variable Selection Model Building Model Execution Communication & Operationalization Evaluate The model results need to be communicated & operationalized to have a measurable impact on the business
  • 47. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL People and Ecosystem Domain Data Science as a Process Data Prep Variable Selection Model Building Model Execution Communication & Operationalization Evaluate • Accuracy of results and forecasts • Analysis of real-world experiments • A/B testing on target samples • End-user and LOB feedback
  • 48. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Agenda 1. Introduction to Big Data Analytics 2. Big Data Analytics - Use Cases 3. Technologies for Big Data Analytics 4. Introduction to Data Science 5. Data Science - Use Cases 6. Introduction to Fast Data 7. Fast Data - Use Cases 8. Fast Data meets Big Data
  • 49. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Use Case 1 Trip modeling Problem: Analyze behaviour of visitors to MakeMyTrip.com Particularly interested in unregistered visitors – About 99% of total visitor traffic
  • 50. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Applications of model • Tailor promotions for popular types of trips  Most popular types probably already well-known; potential in next tier down • ... and for different types of customers • Present customised promotions to visitors based on clicks • Ad optimization: present ads based on modelled behavior
  • 51. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Hypertargeting • Serving content to customers based on individual characteristics and preferences, rather than broad generalizations
  • 52. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Available data • Data available from server:  Date/time  IP address  Parts of site visited • Geographic location can be obtained via geo lookup on IP • Personal information available for registered visitors only
  • 53. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Approach • Use clustering to identify trip/visitor types  Sport (IPL,F1, Football, etc)  Festivals  Other seasonal movements • Decision trees to predict which type of trip a visitor is likely to make  Based on successively more information as they move through the site • Use registered visitor info to augment models
  • 54. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Use Case 2 Municipal traffic analysis • Client domain: Municipal city government • Available data: Cross-city loop detectors measuring traffic volume Detailed city bus movement information from Bluetooth devices Video detection of traffic volume, velocity • Goal: Exploit available data for unrealized business insights and values
  • 55. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Data loading and manipulation • Parallel data loading – Data loaded from local file system and distributed across Greenplum servers in parallel. – Loading 9 months of traffic volume data (16 GB, 464 million rows) in 69.4 seconds. • SQL data manipulation – Standard SQL permits city personnel to use existing skillsets. – Greenplum SQL extensions offer the control over data distribution. – Open source packages (e.g. in Python, R) can be conveniently deployed within Greenplum for visualization and analytics purposes.
  • 56. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Basic reporting on traffic volume • Easy generation of reports via straightforward user-defined functions • Standard graphing utilities called from within Greenplum to create figures • Detector downtimes can be clearly spotted in the figure, or via an SQL query, thus mitigating maintenance challenges
  • 57. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Basic reporting on city buses • Data from Bluetooth devices has a wealth of information on city buses that we can report on:  Travel route of each bus  Deviations of arrival times compared to provided timetable  Occurrences of driver errors (e.g. taking a wrong turn) and possible causes  Occurrences where the same bus service arrives at the same stop within seconds of each other  Whether new bus services translates into lower traffic volume on introduced roads
  • 58. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Result visualizations (Google Earth)
  • 59. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Applications for traffic network modelling • Compute the fastest path between any two locations at a future time point • Identify potential bottlenecks in the traffic • Identify phase transition points for massive traffic congestion using simulation techniques • Study the likely impact of new roads and traffic policies, without having to observe real disruptive events to determine the impact
  • 60. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. • Greenplum‟s parallel architecture permits traffic network analysis on a city scale • Travel time can be predicted via model learning, involving hundreds of thousands of optimizations in parallel, across the entire traffic network • Variables that can be considered include Distance between two locations Concurrent traffic volume Time of day Weather Construction work • Computationally prohibitive for traditional non-parallel database environments Parallel traffic network modelling
  • 61. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Use Case 3 - Product Recommendation Analysis • Eight banks became one  Branches across the US • Consolidation of products and customers  Employees faced with new products and customers  Visibility into churn and retention was challenged • Analytics focus was historically reporting- centric  Descriptive “hindsight”`
  • 62. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Customer Segmentation Customer segments – First, define a measurement of customer value – Then create clusters of customers based on customer value, and then product profiles.
  • 63. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Association Rules Product associations – Now find products that are common in the segment, but not owned by the given household. Product A Product B Product X Product Y Product Z
  • 64. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Product Recommendations Next best offer – Now, filter down to products associated with high-value customers in the same segment. Product A Product B Product X Product Y Product Z
  • 65. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Increased customer value Customer Comments – “The Greenplum Solution has scaled from 6 to 11 TB of data.” – Moved from 7 hours /month of data to 7.5 hours / 2.5 years of data Product Recommender
  • 66. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Agenda 1. Introduction to Big Data Analytics 2. Big Data Analytics - Use Cases 3. Technologies for Big Data Analytics 4. Introduction to Data Science 5. Data Science - Use Cases 6. Introduction to Fast Data 7. Fast Data - Use Cases 8. Fast Data meets Big Data
  • 67. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Module #: Module Name 74 Ferrari Freight Train 0-100 KMPH 2.3 seconds 100 seconds Top Speed 360 KMPH 140 KMPH Stops / hr 1000 5 Horse Power 660 bhp 16,000 bhp Throughput 220 KG in 27 mins 55000000 KG in 60 mins VS
  • 68. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Module #: Module Name 75 Fast Data Big Data Transactions / Second 100000+ per second n.a Concurrent hits 10000 + per sec 10 per second Update Patterns Read / Write Appends Data Complexity Simple Joins on a few tables Can be highly complex Data Volumes GB‟s / TB PB to ZB Access Tools GemFire / SQLFire GP DB, GP Hadoop VSFast Data Big Data
  • 69. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Not a fast OLTP DB! APPLICATION(S)
  • 70. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Fast Data is • More than just an OLTP DB • Super Fast access to Data • Server side flexibility • Data is HA • Supports transactions • Setup is fault tolerant • Can handle thousands of concurrent hits • Distributed hence horizontally scalable • Runs on cheap x86 hardware
  • 71. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL CAP Theorem A distributed system can only achieve TWO out of the three qualities of Consistency, Availability and Partition Tolerance onsistency vailability artition Tolerence
  • 72. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Fast Data = • Service Loose Coupling • Data Transformation • System Integration + Service Bus • Guaranteed Delivery • Event Propagation • Data Distribution + Messaging System • Event Driven Architectures • Real-time Analysis • Business Event Detection + Complex Event Processor Fast Data combines select features from all of these products and combines them into a low-latency, linearly scalable, memory-based data fabric • Storage • Persistence • Transactions • Queries Database • High Availability • Load Balancing • Data Replication • L1 Caching • Map-Reduce, Scatter-Gather • Distributed Task Assignment • Task Decomposition + Grid Controller • Result Summarization
  • 73. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL A Typical Fast Data Setup Web Tier Application Tier Load Balancer Add/remove web/application/data servers Add/remove storage Database Tier Storage Tier Disks may be direct or network attached Optional reliable, asynchronous feed to a Big Data Store
  • 74. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Memory-based Performance Perform Fast Data uses memory on a peer machine to make data updates durable, allowing the updating thread to return 10x to 100x faster than updates that must be written through to disk, without risking any data loss. Typical latencies are in the few hundreds of microseconds instead of in the tens to hundreds of milliseconds. One can optionally write updates to disk / data warehouse / big data store asynchronously and reliably.
  • 75. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL WAN Distribution Distribute Fast Data can keep clusters that are distributed around the world synchronized in real- time and can operate reliably in Disconnected, Intermittent and Low-Bandwidth network environments.
  • 76. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Distributed Events Targeted, guaranteed delivery, event notification and Continuous Queries Notify
  • 77. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Parallel Queries Batch Controller or Client Scatter-Gather (Map-Reduce) Queries Compute
  • 78. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Data-Aware Routing Execute Fast Data provides „data aware function routing‟ – moving the behavior to the correct data instead of moving the data to the behavior. Batch Controller or Client Data Aware Function
  • 79. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Accessing Fast Data Stores Objects (Java, C++, C#, .NET) or unstructured data Spring-GemFire Stores Relational Data with SQL interface Supports JDBC, ODBC, Java and .NET interfaces Key-Value store with OQL Queries Uses existing relational tools Order Order Line Item Quantity Discount Product SKU Unit Price L2 Cache plugin for Hibernate HTTP Session replication module GemFire SQLFire
  • 80. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Agenda 1. Introduction to Big Data Analytics 2. Big Data Analytics - Use Cases 3. Technologies for Big Data Analytics 4. Introduction to Data Science 5. Data Science - Use Cases 6. Introduction to Fast Data 7. Fast Data - Use Cases 8. Fast Data meets Big Data
  • 81. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Use Cases Applying the technology A few examples of Fast Data technology applied to real business cases
  • 82. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. A mainframe-based, nightly customer account reconciliation batch run Mainframe Migration min 0 12060 I/O Wait 9% CPU Busy 15% Mainframe CPU Unavailable 76% COTS Cluster Batch now runs in 60 seconds 93% Network Wait! Time could have been reduced further with higher network bandwidth
  • 83. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Mainframe Migration So What? So the batch runs faster – who cares? 1. It ran on cheaper, modern, scalable hardware 2. If something goes wrong with the batch, you only wait 60 seconds to find out 3. Now, the hardware and the data are available to do other things in the remaining 119 minutes: • Fraud detection • Regulatory compliance • Re-run risk calculations with 119 different scenarios • Up sell customers 4. You can move from batch to real-time processing!
  • 84. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Online Betting A popular online gambling site attracts new players through ads on affiliate sites Customized Banner Ad on affiliate site Affiliate's Web Server 1 Banner Ad Server 2 3 4 In a fraction of a second, the banner ad sever must: Generate a tracking id specific to the request Apply temporal, sequential, regional, contractual and other policies in order to decide which banner to deliver Customize the banner Record that the banner ad was delivered
  • 85. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Online Betting (Contd.) Their initial RDBMS-based system Limited their ability to sign up new affiliates Limited their ability to add new products on their site Limited the delivery performance experienced by their affiliates and their customers Limited their ability to add additional internal applications and policies to the process Their new Fast Data based system Responded with sub-millisecond latency Met their target of 2500 banner ad deliveries per second Provides for future scalability Improved performance to the browser by 4x Cost less
  • 86. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Asset/Position Monitoring Centralized data storage was not possible Multi-agency, multi-force integration Numerous Applications needed access to multiple data sources simultaneously Networks constantly changing, unreliable, mobile deployments Upwards of 60,000 object updates each minute Over 70 data feeds Needed a real-time situational awareness system to track assets that could be used by the war fighters in theatre Northrop Grumman (integrator) investigated the following technologies before deciding on GemFire •RDBMS – Oracle, Sybase, Postgres, TimesTen, MySQL •ODBMS - Objectivity •jCache – GemFire, Oracle Coherence •JMS – SonicMQ, BEA Weblogic, IBM, jBoss •TIBCO Rendezvous •Web Services
  • 87. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Asset/Position Monitoring 655 sites, 11 thousand users Real-time, 3 dimensional, NASA World Wind User Interface 60,000 Position updates per minute Real time info available on the desk of President of the United States US Secretary of Defense Each of the Joint Chiefs of Staff Every commander in the US Military
  • 88. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Low-latency trade insertion Permanent Archival of every trade Kept pace with fast ticking market data Rapid, Event Based Position Calculation Distribution of Position Updates Globally Consistent Global Views of Positions Pass the Book Regional Close-of-day High Availability Disaster Recovery Regional Autonomy The project achieved: Global Foreign Exchange Trading System
  • 89. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Global Foreign Exchange Trading System In that same application, Fast Data replaced: Sybase Database In Every Region Still need 1 instance for archival purposes TIBCO Rendezvous for Local Area Messaging IBM MQ Series for WAN Distribution Veritas N+1 Clustering for H/A In fact, we save the physical +1 node itself 3DNS or Wide IP Admin personnel reduced from 1.5 to 0.5
  • 90. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Agenda 1. Introduction to Big Data Analytics 2. Big Data Analytics - Use Cases 3. Technologies for Big Data Analytics 4. Introduction to Data Science 5. Data Science - Use Cases 6. Introduction to Fast Data 7. Fast Data - Use Cases 8. Fast Data meets Big Data
  • 91. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Application High Level Overview APPLICATION(S) Single DB cant handle both OLTP and OLAP workloads
  • 92. Copyright © 2012 EMC Corporation. All Rights Reserved. EMC 2 PROVEN PROFESSIONAL Copyright © 2011 EMC Corporation. All Rights Reserved. Big Data Setup APPLICATION(S) How to get the best of Fast & Big Data Fast Data Setup In case record isn't available Concurrent hits

Notas do Editor

  1. Let’s start by looking at some of the pioneers in the big data space. These well known, and highly valuable, enterprises have built their business on Big Data. The numbers they support are staggering.
  2. But Big Data is for more than just internet companies. This slide shows some Greenplum customer examples who are leveraging big data to transform the business and drive new revenue streams. We ill talk about these in more detail today.
  3. Think about what Big Data is for a moment. Share your thoughts with the group and write your notes in the space below.Is there a size threshold over which data becomes Big Data?How much does the complexity of its structure influence the designation as Big Data? How new are the analytical techniques?
  4. There are multiple characteristics of big data, but 3 stand out as defining Characteristics: Huge volume of data (for instance, tools that can manage billions of rows and billions of columns)Complexity of data types and structures, with an increasing volume of unstructured data (80-90% of the data in existence is unstructured)….part of the Digital Shadow or “Data Exhaust”Speed or velocity of new data creation In addition, the data, due to its size or level of structure,cannot be efficiently analyzed using only traditional databases or methods.There are many examples of emerging big data opportunities and solutions. Here are a few: Netflix suggesting your next movie rental, dynamic monitoring of embedded sensors in bridges to detect real-time stresses and longer-term erosion, and retailers analyzing digital video streams to optimize product and display layouts and promotional spaces on a store-by-store basis are a few real examples of how big data is involved in our lives today. These kinds of big data problems require new tools/technologies to store, manage and realize the business benefit. The new architectures it necessitates are supported by new tools, processes and procedures that enable organizations to create, manipulate and manage these very large data sets and the storage environments that house them.
  5. Big data can come in multiple forms. Everything from highly structured financial data, to text files, to multi-media files and genetic mappings. The high volume of the data is a consistent characteristic of big data. As a corollary to this, because of the complexity of the data itself, the preferred approach for processing big data is in parallel computing environments and Massively Parallel Processing (MPP), which enable simultaneous, parallel ingest and data loading and analysis. As we will see in the next slide, most of the big data is unstructured or semi-structured in nature, which requires different techniques and tools to process and analyze.Let us examine the most prominent characteristic: its structure.
  6. The graphic shows different types of data structures, with 80-90% of the future data growth coming from non structured data types (semi, quasi and unstructured). Although the image shows four different, separate types of data, in reality, these can be mixed together at times. For instance, you may have a classic RDBMS storing call logs for a software support call center. In this case, you may have typical structured data such as date/time stamps, machine types, problem type, operating system, which were probably entered by the support desk person from a pull-down menu GUI. In addition, you will likely have unstructured or semi-structured data, such as free form call log information, taken from an email ticket of the problem or an actual phone call description of a technical problem and a solution. The most salient information is often hidden in there. Another possibility would be voice logs or audio transcripts of the actual call that might be associated with the structured data. Until recently, most analysts would NOT be able to analyze the most common and highly structured data in this call log history RDBMS, since the mining of the textual information is very labor intensive and could not be easily automated.
  7. Here are examples of what each of the 4 main different types of data structures may look like. People tend to be most familiar with analyzing structured data, while semi-structured data (shown as XML here), quasi-structured (shown as a clickstream string), and unstructured data present different challenges and require different techniques to analyze.For each data type shown, answer these questions: What type of analytics are performed on these data?Who analyzes this kind of data?What types of data repositories are suited for each, or requirements you may have for storing and cataloguing this kind of data?Who consumes the data?Who manages and owns the data?
  8. Here are 4 examples of common business problems that organizations contend with today, where they have an opportunity to leverage advanced analytics to create competitive advantage. Rather than doing standard reporting on these areas, organizations can apply advanced analytical techniques to optimize processes and derive more value from these typical tasks. The first 3 examples listed above are not new problems – companies have been trying to reduce customer churn, increase sales, and cross-sell customers for many years. What’s new is the opportunity to fuse advanced analytical techniques with big data to produce more impactful analyses for these old problems. Example 4 listed above portrays emerging regulatory requirements. Many compliance and regulatory laws have been in existence for decades, but additional requirements are added every year, which mean additional complexity and data requirements for organizations. These laws, such as anti-money laundering and fraud prevention, require advanced analytical techniques to manage well.
  9. The graphic shows a typical data warehouse and some of the challenges that it presents. For source data (1) to be loaded into the EDW, data needs to be well understood, structured and normalized with the appropriate data type definitions. While this kind of centralization enables organizations to enjoy the benefits of security, backup and failover of highly critical data, it also means that data must go through significant pre-processing and checkpoints before it can enter this sort of controlled environment, which does not lend itself to data exploration and iterative analytics.(2) As a result of this level of control on the EDW, shadow systems emerge in the form of departmental warehouses and local data marts that business users create to accommodate their need for flexible analysis. These local data marts do not have the same constraints for security and structure as the EDW does, and allow users across the enterprise to do some level of analysis. However, these one-off systems reside in isolation, often are not networked or connected to other data stores, and are generally not backed up.(3) Once in the data warehouse, data is fed to enterprise applications for business intelligence and reporting purposes. These are high priority operational processes getting critical data feeds from the EDW.(4) At the end of this work flow, analysts get data provisioned for their downstream analytics. Since users cannot run custom or intensive analytics on production databases, analysts create data extracts from the EDW to analyze offline in R or other local analytical tools. Many times these tools are limited to in-memory analytics with desktops analyzing samples of data, rather than the entire population of a data set. Because these analyses are based on data extracts, they live in a separate location and the results of the analysis – and any insights on the quality of the data or anomalies, rarely are fed back into the main EDW repository. Lastly, because data slowly accumulates in the EDW due to the rigorous validation and data structuring process, data is slow to move into the EDW and the schema is slow to change. EDWs may have been originally designed for a specific purpose and set of business needs, but over time evolves to house more and more data and enables business intelligence and the creation of OLAP cubes for analysis and reporting. The EDWs provide limited means to accomplish these goals, achieving the objective of reporting, and sometimes the creation of dashboards, but generally limiting the ability of analysts to iterate on the data in an separate environment from the production environment where they can conduct in-depth analytics, or perform analysis on unstructured data.
  10. Today’s typical data architectures were designed for storing mission critical data, supporting enterprise applications, and enabling enterprise level reporting. These functions are still critical for organizations, although these architectures inhibit data exploration and more sophisticated analysis.
  11. …..describe or refer to NO SQL and KVPEveryone and everything is leaving a digital footprint. The graphic above provides a perspective on sources of big data generated by new applications and the scale and growth rate of the data. These applications provide opportunities for new analytics and driving value for organizations.These data come from multiple sources, including:Medical Information, such as genomic sequencing and MRIsIncreased use of broadband on the Web – including the 2 billion photos each month that Facebook users currently upload as well as the innumerable videos uploaded to YouTube and other multimedia sitesVideo surveillanceIncreased global use of mobile devices – the torrent of texting is not likely to ceaseSmart devices – sensor-based collection of information from smart electric grids, smart buildings and many other public and industry infrastructureNon-traditional IT devices – including the use of RFID readers, GPS navigation systems, and seismic processingThe Big Data trend is generating an enormous amount of information that requires advanced analytics and new market players to take advantage of it.
  12. Big data projects carry with them several considerations that you need to keep in mind to ensure this approach fits with what you are trying to achieve. Due of the characteristics of big data, these projects lend themselves to decision support for high-value, strategic decision making with high processing complexity. The analytic techniques being used in this context need to be iterative and flexible (analysis flexibility), due to the high volume of data and its complexity. These conditions give rise to complex analytical projects (such as predicting customer churn rates) that can be performed with some latency (consider the speed of decision making needed), or by operationalizing these analytical techniques using a combination of advanced analytical methods, big data and machine learning algorithms to provide real time (requires high throughput) or near real time analysis, such as recommendation engines that look at your recent web history and purchasing behavior.In addition, to be successful you will need a different approach to the data architecture than seen in today’s typical EDWs. Analysts need to partner with IT and DBAs to get the data they need within an analytic sandbox, which contains raw data, aggregated data, and data with multiple kinds of structure. The sandbox requires a more savvy user to take advantage of it, and leverage it or exploring data in a more robust way.
  13. The loan process has been honed to a science over the past several decades. Unfortunately today’s realities require that lenders take more care to make better decisions with fewer resources than they’ve had in the past. The typical loan process uses a set of data on which pre-approval and underwriting approval is based, including:Income data, such as pay and income tax recordsEmployment history to establish the ability to meet loan obligationsCredit history including credit scores and outstanding debtAppraisal data associated with the asset for which the loan is made (such as a home, boat, or car)This model works but it’s not perfect, in fact, the loan crisis in the US is proof that using only these data points may not be enough to gauge the risk associated with making sound lending decisions and pricing loans properly.Case Study Exercise:ObjectivesUsing additional data sources, dramatically improve the quality of the loan underwriting process Streamline the process to yield results in less timeDirectionsSuggest kinds of publicly available data (big data) that you can leverage to supplement the traditional lending processSuggest types of analysis you would perform with the data to reduce the bank’s risk and expedite the lending process
  14. This is the standard format we will use for each representative example.
  15. Check http://wiki.apache.org/hadoop/PoweredBy for examples of how people are using Hadoop Check this article on the large scale image conversion: http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/. Check this for an ad for a ‘computer’ from 1892…http://query.nytimes.com/mem/archive-free/pdf?res=9F07E0D81438E233A25751C0A9639C94639ED7CF
  16. Use the space here to record your answers to these questions:
  17. Greenplum is driving the future of Big Data analytics with the industry’s first Unified Analytics Platform (UAP) that delivers:Our award winning Greenplum Database for structured dataOur enterprise Hadoop offering, Greenplum HD, for the analysis and processing of unstructured dataGreenplum Chorus that acts as the productivity layer for the data science teamGreenplum UAP is more than just integrated software working together; it is a single, unified platform enabling powerful and agile analytics that can transform how your organization uses data.What sets this diagram apart from a typically vendor example is the inclusion of people. That is not a mistake. We have introduced the Unified Analytics Platform but there is more to the story than technology and I will talk more about that in a few minutes. UAP is about enabling this emerging group of talent, the new practitioners, that we refer to as the Data Science team. This team can include the data platform administrator, data scientist, analysts, engineers, BI teams, and most importantly the line of business user and how they participate on this data science team.We develop, package, and support this as a unified software platform available over your favorite commodity hardware, cloud infrastructure, or from our modular Data Computing Appliance.MOORE’s LAW (named after Gordon Moore, the founder of Intel) states that the number of transistors that can be placed in a processor will double approximately every two years, for half the cost. But trends in chip design are changing to face new realities. While we can still double the number of transistors per unit area at this pace, this does not necessarily result in faster single-threaded performance. New processors such as Intel Core 2 and Itanium 2 architectures now focus on embedding many smaller CPUs or "cores" onto the same physical device. This allows multiple threads to process twice as much data in parallel, but at the same speed at which they operated previously.
  18. Greenplum database’s strengths are in the structured side of the house. The functionality is based around the fact the data is structured.With GP MapReduce and large text objects, Greenplum database is able to do some things that are considered unstructured data analysis.
  19. Greenplum database’s strengths are in the structured side of the house. The functionality is based around the fact the data is structured.With GP MapReduce and large text objects, Greenplum database is able to do some things that are considered unstructured data analysis.
  20. Unfortunately, people may use the word “Hadoop” to mean multiple things. They may use it to describe the MapReduce paradigm, or they may use if to describe massive unstructured data storage using commodity hardware (although commodity doesn’t mean inexpensive). On the other hand, they may be referring to the Java classes provided by Hadoop that support HDFS file types or provide MapReduce job management. Or they may be referring to HDFS: the Hadoop distributed file system. And they might mean both HDFS and MapReduce.The point is that Hadoop enables the Data Scientist to create MapReduce jobs quickly and efficiently. As we shall see, one can utilize Hadoop at multiple levels: writing MapReduce modules in Java, leveraging streaming mode to write such functions in one of several scripting languages, or utilizing a higher level interface such as Pig or Hive. The Web site http://hadoop.apache.org/ provides a solid foundation for unstructured data mining and management.So what exactly is Hadoop anyway? The quick answer is that Hadoop is a framework for performing Big Data Analytics, and as such is an implementation of the MapReduce programming model. Hadoop is comprised of two main components, HDFS for storing big data and MapReduce for big data analytics. The storage function consists of HDFS (Hadoop Distributed File System) that provides a reliable, redundant, distributed file system optimized for large files. The analytics functions are provided by MapReduce that consists of a Java API as well as software to implement the services that Hadoop needs to function.Hadoop glues the storage and analytics together in a framework that provides reliability, scalability, and management of the data.
  21. Let’s look a little deeper at the HDFS. Between MapReduce and HDFS, Hadoop supports four different node types (a node is a particular machine on the network). The NameNode and the DataNode are part of the HDFS implementation. Apache Hadoop has one NameNode and multiple DataNodes (there may be a secondary NameNode as well, but we won’t consider that here). The NameNode service in Hadoop acts as a regulator/resolver between a client and the various DataNode servers. The NameNode manages that name space by determining which DataNode contains the data requested by the client and redirecting the client to that particular datanode. DataNodes in HDFS are (oddly enough) where the data is actually stored. Hadoop is “rack aware”: that is, the NameNode and the Jobtracker node utilize a data structure that determines what DataNode is preferred based on the “network distance” between them. Nodes that are “closer” are preferred (same rack, different rack, same datacenter). The data itself is replicated across racks: this means that a failure in one rack will not halt data access at the expense of possibly slower response. Since HDFS isn’t suitable for near real-time access, this is acceptable in the majority of cases.
  22. The MapReduce function within Hadoop depend on two different nodes: the JobTracker and the TaskTracker. The JobTracker node exists for each MapReduce implementation. JobTracker nodes are responsible for distributing the Mapper and Reducer functions to available TaskTrackers and monitoring the results, while TaskTracker nodes actually run the jobs and communicate results back to the JobTracker. That communication between nodes is often through files and directories in HDFS so internode (network) communication is minimized.Let’s consider the above example. Initially(1) , we have a very large data set containing log files, sensor data or whatnot. HDFS stores replicas of that data (represented here by the blue, yellow and beige icons) across DataNodes. In Step 2, the client defines and executes a map job and a reduce job on a particular data set, and sends them both to the Jobtracker, where in Step 3, the jobs are in turn distributed to the TaskTrackers nodes. The TaskTracker runs the mapper, and the mapper produces output that itself is stored in the HDFS file system. Lastly, in Step 4, the reduce job runs across the mapped data in order to produce the result.We’ve deliberately skipped much of the complexity involved in the MapReduce implementation, specifically the steps that provide the “sorted by key” guarantee the MapReduce framework offers to its reducers. Hadoop provides a Web-based GUI for the Namenode, Jobtracker and Tasktracker nodes: we’ll see more of this in the lab associated with this lesson.
  23. In Pig and Hive, the presence of HDFS is very noticeable. Pig, for example, directly supports most of the Hadoop file system commands. Likewise, Hive can access data whether it’s local or stored in an HDFS. In either case, data can usually be specified via an HDFS URL (hdfs://<namenode>/path>). In the case of HBase, however, Hadoop is mostly hidden in the HBase framework, and HBase provides data to the client via a programmatic interface (usually Java).Via these interfaces, a Data Scientist can focus on manipulating large datasets without concerning themselves with the inner working of Hadoop. Of course, a Data Scientist must be aware of the constraints associated with using Hadoop for data storage, but doesn’t need to know the exact Hadoop command to check the file system.
  24. Pig is a data flow language and an execution environment to access the MapReduce functionality of Hadoop (as well as HDFS). Pig consists of two main elements:A data flow language called Pig Latin (ig-pay atin-lay) andAn execution environment, either as a standalone system or one using HDFS for data storage.A word of caution is in order: If you only want to touch a small portion of a given dataset, then Pig is not for you, since it only knows how to read all the data presented to it. Pig only supports batch processing of data, so if you need an interactive environment, Pig isn’t for you.
  25. The Hive systemis aimed at the Data Scientist with strong SQL skills. Think of Hive as occupying a space between Pig and an DBMS (although that DBMS doesn’t have to be a Relational DBMS [RDBMS]). In Hive, all data is stored in tables. The schema for each table is managed by Hive itself. Tables can be populated via the Hive interface, or a Hive schema can be applied to existing data stored in HDFS.
  26. HBase represents a further layer of abstraction on Hadoop. HBase has been described as “a distributed column-oriented database [data storage system]” built of top of HDFS. Note that HBase is described as managing structured data. Each record in the table can be described as a key (treated as a byte stream) and a set of variables, each of which may be versioned. It’s not a structure in the same sense as an RDBMS is structured. HBase is a more complex system than what we have seen previously. HBase uses additional Apache Foundation open source frameworks: Zookeeper is used as a co-ordination system to maintain consistency, Hadoop for MapReduce and HDFS, and Oozie for workflow management. As a Data Scientist, you probably won’t be concerned overmuch with implementation, but it is useful to at least know the names of all the moving parts. HBase can be run from the command line, but also supports REST (Representational State Transfer – think HTTP) and Thrift and Avro interfaces via the Siteserver daemon. Thrift and Avro both provide an interface to send and receive serialized data (objects where the data is “flattened” into a byte stream).
  27. Although HBase may look like a traditional DBMS, it isn’t.HBase is a “distributed, column-oriented data storage system that can scale tall (billions of rows), wide (billions of columns), and can be horizontally partitioned and replicated across thousands of commodity servers automatically.”The HBase table schemas mirror physical storage for efficiency; a RDBMS doesn’t. (the RDBMS schema is a logical description of the data, and implies no specific physical structuring.) Most RDBMS systems require that data must be consistent after each transaction (ACID prosperities). DBMS systems like HBase don’t suffer from these constraints, and implement eventual consistency. This means that for some systems you cannot write a value into the database and immediately read it back in. Strange, but true. Another of HBase’s strengths is in its wide open view of data – HBASE will accept almost anything it can cram into an HBase table.
  28. Mahout is a set of machine learning algorithms that leverages Hadoop to provide both data storage and the MapReduce implementation.The mahout command is itself a script that wraps the Hadoop command and executes a requested algorithm from the Mahout job jar file (jar files are Java ARchives, and are very similar to Linux tar files [tape archives]). Parameters are passed from the command line to the class instance. Mahout mainly supports four use cases:Recommendation mining takes users' behavior and tries to find items users might like. An example of this is LinkedIn’s “People You Might Know” (PYMK). Classificationlearns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. Clusteringtakes documentsand groups them into collections of topically related documents based on word occurrences. Frequent itemset mining takes a set of item groups (for example, terms in a query session, shopping cart content) and identifieswhich individual items usually appear together.If you plan on using Mahout, rememberthat these distributions (Hadoop and Mahout) anticipate running on a *nix machine, although a Cygwin environment on Windows will work as well (or rewriting the command scripts in another language, say as a batch file on Windows). It goes without saying that a compatible working version of Hadoop is required. Lastly, Mahout requires that you program in Java: no other interface outside of the command line is supported.
  29. Greenplum Database utilizes a shared-nothing, massively parallel processing (MPP) architecture that has been designed for complex business intelligence (BI) and analytical processing. Most of today’s general-purpose relational database management systems are designed for Online Transaction Processing (OLTP) applications. The reality is that BI and analytical workloads are fundamentally different from OLTP transaction workloads and require a profoundly different architecture.The Greenplum Database is fully parallel and highly optimized for executing both SQL and MapReduce queries. Additionally, the system offers a new level of parallel analysis capabilities for data scientists, with support for SAS, R, linear algebra, and machine learning primitives; and includes extensibility for functions written in Java, C, Perl, or Python.Because of the shared nothing MPP architecture the system is linearly scalable. Simply add additional nodes and the database performance and capacity improves. Expansions are online keeping the database available for production workloads.
  30. Logical depiction: (top portion): Logically, gNet enables data in multiple formats, that resides in Hadoop HDFS file system, to be used as though it were a table in Greenplum Database. This is the essence of processing – we can select, filter, join, modify, aggregate, essentially, all normal SQL operations, on the combination of RDBMS data in Greenplum Database and data stored in Hadoop, as though all data were in the database. The results are:Real-Time: Fast access to new data as it arrives – no waiting for reformatting and periodic movement processes to copy data into the database.Space-Efficiency: No duplication of data – big data makes any plan to duplicate data very expensive, even on so-called “cheap storage”Query Efficiency: Movement of frequently-accessed data, where moving it for local access in the database results in a desirable reduction in gNet trafficArchival: Information Lifecycles where data arrives in one platform, but as it ages, is moved to another platform to achieve lower cost of retention – consider the cost of HDFS storage – it’s low – so some customers will generate and manipulate data in the database for simplicity, but archive the data in Hadoop. With co-processing over gNet, the data remains available even after it’s been archived in HDFS files.
  31. The Greenplum Database was conceived, designed and engineered to allow customers to take advantage of large clusters of increasingly powerful and economical general purpose servers, storage and ethernet switches. With this approach, EMC Greenplum customers can gain immediate benefit from the industry’s latest computing innovations.Greenplum’s MPP shared-nothing architecture delivers industry-leading performance in big data. You can compare the impact to finding a specific card—let’s say the Ace of Spades—in a deck. If you do it yourself, it could take you up to 52 tries to find the Ace of Spades. If you distribute it to 26 people, it will only take up to 2 tries. Likewise, Greenplum distributes processing across nodes—and these nodes work independently and in parallel to quickly deliver answers.
  32. Please take a moment to answer these questions. Record your answers here.
  33. As background, it is important to understand that Business Intelligence is different than data science and analytics. BI deals with reporting on history. What happened last quarter? How many did we sell, etc.Data science is about predicting the future and understanding why things happen. What is the optimal solution? What will happen next?For many companies data science is a new approach to understanding the business yet an important one to undertake today.
  34. Here are 5 main competency and behavioral characteristics for Data Scientists.Quantitative skills, such as mathematics or statistics Technical aptitude, such as software engineering, machine learning, and programming skills. Skeptical…..this may be a counterintuitive trait, although it is important that data scientists can examine their work critically rather than in a one-sided way.Curious & Creative, data scientists must be passionate about data and finding creative ways to solve problems and portray informationCommunicative & Collaborative: it is not enough to have strong quantitative skills or engineering skills. To make a project resonate, you must be able to articulate the business value in a clear way, and work collaboratively with project sponsors and key stakeholders.
  35. In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
  36. In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
  37. In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
  38. In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
  39. In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
  40. In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
  41. In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
  42. In using Greenplum as the foundation for lab work, we’ve started to converge on a standard set of tools for the various stages of our analyses.For data cleansing and transformation, we do most of our work in SQL. MapReduce is also useful, especially for unstructured data. For data exploration we also use SQL, as well as R, which is particularly useful for generating summary statistics, analyzing significance, and plotting data visualizations such as frequency distributions, densities, scatter plots and so on.For model building, we typically use R. It operates very well on file extracts, but these may be cumbersome and may slow down the modeling process, so it is also useful to read data directly from the database into dataframes via RPostgreSQL (which uses the RDbi interface and is therefore considerably faster than RODBC). For very large data sets, it is often best to use Greenplum’s built-in SQL analytics and the Analytics Library.Models built in R can be executed on file extracts, but in most cases it’s desirable to run them on a complete set of records in the database. In this case, they can run in the database as PL/R after a simple conversion, and for optimal performance they can be converted to SQL.In many cases we work with legacy models that were built in SAS. We are developing methods to convert these to SQL or PL/Java. We are also working with SAS Engineering to co-develop ‘Accelerator’ functions.
  43. We can have an *animated* view of color-coded traffic volumes on Google Earth over a user-specified period. The file that produces the animation is created within Greenplum. The Google Map display is similar to this, but it only provides traffic volume at a specific time.
  44. Eight banks become oneBranches across the USConsolidation of products and customersEmployees faced with new products and customersOld does not necessarily equal newWhat to recommend to customers?Needs to make the bank moneyNeeds to make the customer moneyOverlap with existing products is challengingCost of acquiring a new customer is significantly higher than selling additional products to existing customers
  45. Here’s an example in which we used clustering techniques (grouping similar objects together) and a form of “market basket analysis” (if you bought one set of products, you might be interested in another) to create a simple product recommendation engine.First, we defined a measurement of customer value. (For this particular customer, they already had a way of computing that, but it took 20 hours to run in a separate database. Now it runs in Greenplum in less than an hour, so they run it regularly as part of their ETL process.)Next, we created groups of customers based on product usage. We did this by defining a “distance” between customers so that those who owned a similar assortment of products would be measured as being close. We then used this notion of distance to identify clusters of customers.
  46. Then we used various methods, including “association rules” (the technique used in market basket analysis, on sites such as Amazon), to identify common product associations. In other words, by looking at product usage across millions of customers, we found that certain groups of products tended to be occur together. By restricting our analysis to a certain segment of the population (in this case, based on customer value), we were more likely to find product groupings that made sense for that customer segment.
  47. We used these results to make product recommendations. For a given customer, we used the product associations to determine which new products made sense. Then we filtered out products that were disproportionately associated with customers of lower value. The remaining products were then more likely to move the customer into a higher value. The client referred to this as “filling incomplete baskets”.VerticalsThis applies to any organization that advertises to a sufficiently large number of customers.
  48. Modern applications need to respond faster and capture more information so the business can perform the analysis needed for the making the best business decision. By combining the best online transactional processing (OLTP) product and the best online analytical processing (OLAP) we can create a platform that enables businesses to make the best of historical and real-time data.  By utilizing the strengths of both OLTP and OLAP systems we can create platform that can cover the others weakness. Traditionally OLAP databases excel at handling petabytes of information but are not geared for fine-grained low latency access. Similarly OLTP excel at fine-grained low latency access but may fall short of handling large scale data sets with ad-hoc queries. To solve the OLTP aspect of this problem we have chosen vFabric SQLFire. SQLFire is a memory-optimized shared-nothing distributed SQL database delivering dynamic scalability and high performance for data-intensive modern applications. SQLFire’s memory-optimized architecture minimizes time spent waiting for disk access, the main performance bottleneck in traditional databases. SQLFire achieves dramatic scaling by pooling memory, CPU and network bandwidth across a cluster of machines and can manage data across geographies.  For the OLAP aspect we will be looking at EMC Greenplum. Greenplum was built to support Big Data Analytics, Greenplum Database manages, stores, and analyzes Terabytes to Petabytes of data. Users experience 10 to 100 times better performance over traditional RDBMS products – a result of Greenplum’s shared-nothing massively parallel processing architecture, high-performance parallel dataflow engine, and advanced gNet software interconnect technology.