Big data, data science & fast data

Copyright © 2012 EMC Corporation. All Rights Reserved.
EMC
2
PROVEN PROFESSIONAL
Big Data Analytics, Data Science & Fast Data
1
Kunal Joshi
joshik@vmware.com

EMC
2
PROVEN PROFESSIONAL
Agenda
1. Introduction to Big Data Analytics
2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data
BIG DATA
DATA SCIENCE
FAST DATA

EMC
2
PROVEN PROFESSIONAL
Agenda
1. Introduction to Big Data Analytics
2. Big Data Analytics - Use Cases
3. Technologies for Big Data Analytics
4. Introduction to Data Science
5. Data Science - Use Cases
6. Introduction to Fast Data
7. Fast Data - Use Cases
8. Fast Data meets Big Data

EMC
2
PROVEN PROFESSIONAL
Big Data Pioneers
1,000,000,000 Queries A Day
250,000,000 New Photo‟s / Day
290,000,000 Updates / Day

EMC
2
PROVEN PROFESSIONAL
Other Companies using Big Data
4,000,000 Claims / Day
2,800,000,000 Trades / Day
31,000,000,000 Interactions / Day

EMC
2
PROVEN PROFESSIONAL
Moore’s Law
Gordon Moore (Founder of Intel)
Number of
transistors that
can be placed in a
processor
DOUBLES in
approximately
every TWO years.

EMC
2
PROVEN PROFESSIONAL
Introduction to Big Data Analytics
What is Big Data?
What makes data, “Big” Data?
7
Your Thoughts?

EMC
2
PROVEN PROFESSIONAL
8Copyright © 2011 EMC Corporation. All Rights Reserved.
• “Big Data” is data whose scale, distribution, diversity,
and/or timeliness require the use of new technical
architectures and analytics to enable insights that unlock
new sources of business value.
 Requires new data architectures, analytic sandboxes
 New tools
 New analytical methods
 Integrating multiple skills into new role of data scientist
• Organizations are deriving business benefit from analyzing
ever larger and more complex data sets that increasingly
require real-time or near-real time capabilities
Big Data Defined
Source: McKinsey May 2011 article Big Data: The next frontier for innovation, competition, and productivity

EMC
2
PROVEN PROFESSIONAL
9Copyright © 2011 EMC Corporation. All Rights Reserved.
1. Data Volume
 44x increase from 2010 to 2020
(1.2zettabytes to 35.2zb)
2. Processing Complexity
 Changing data structures
 Use cases warranting additional transformations and
analytical techniques
3. Data Structure
 Greater variety of data structures to mine and analyze
Key Characteristics of Big Data
Module 1: Introduction to BDA

EMC
2
PROVEN PROFESSIONAL
Big Data Characteristics: Data Structures
Data Growth is Increasingly Unstructured
Module 1: Introduction to BDA 10
Structure
d
Semi-
Structured
“Quasi”
Structured
Unstructured
• Data containing a defined data type, format, structure
• Example: Transaction data and OLAP
• Data that has no inherent
structure and is usually stored
as different types of files.
• Example: Text
documents, PDFs, images and
video
• Textual data with erratic data formats, can
be formatted with effort, tools, and time
• Example: Web clickstream data that
may contain some inconsistencies in data
values and formats
• Textual data files with a discernable
pattern, enabling parsing
• Example: XML data files that are self
describing and defined by an xml schema
MoreStructured

EMC
2
PROVEN PROFESSIONAL
Four Main Types of Data Structures
http://www.google.com/#hl=en&sugexp=kjrmc&cp=8&gs_id=2m&xhr=t&q=data+scientist&
pq=big+data&pf=p&sclient=psyb&source=hp&pbx=1&oq=data+sci&aq=0&aqi=g4&aql=f&gs
_sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.,cf.osb&fp=d566e0fbd09c8604&biw=1382&bih=651
The Red Wheelbarrow, by
William Carlos Williams
View  Source
Structured Data
Semi-Structured Data
Quasi-Structured Data
Unstructured Data

EMC
2
PROVEN PROFESSIONAL
Driver Examples
Desire to optimize business
operations
Sales, pricing, profitability, efficiency
Desire to identify business risk Customer churn, fraud, default
Predict new business
opportunities
Upsell, cross-sell, best new customer
prospects
Comply with laws or regulatory
requirements
Anti-Money Laundering, Fair Lending,
Basel II
Business Drivers for Big Data Analytics
1
2
3
4
Current Business Problems Provide Opportunities for Organizations to
Become More Analytical & Data Driven

EMC
2
PROVEN PROFESSIONAL
Challenges with a Traditional Data Warehouse
Departmental
Warehouse
Enterprise
Applications
Reporting
Non-Prioritized Data Provisioning
Non-Agile Models
“Spread
Marts”
Data
Sources
Siloed
Analytics
Static schemas
accrete over time
Prioritized
Operational
Processes
Errant data & marts
Departmental
Warehouse
1
2
3
13
4

EMC
2
PROVEN PROFESSIONAL
Implications of a Traditional Data Warehouse
14
• High-value data is hard to reach and leverage
• Predictive analytics & data mining activities are last
in line for data
 Queued after prioritized operational processes
• Data is moving in batches from EDW to local
analytical tools
 In-memory analytics (such as R, SAS, SPSS, Excel)
 Sampling can skew model accuracy
• Isolated, ad hoc analytic projects, rather than
centrally-managed harnessing of analytics
 Non-standardized initiatives
 Frequently, not aligned with corporate business goals
Slow
“time-to-insight”
&
reduced
business impact
Module 1: Introduction to BDA

EMC
2
PROVEN PROFESSIONAL
Opportunities for a New Approach to Analytics
New Applications Driving Data Volume
2000‟s
(CONTENT & DIGITAL ASSET
MANAGEMENT)
1990‟s
(RDBMS & DATA
WAREHOUSE)
2010‟s
(NO-SQL & KEY/VALUE)
VOLUMEOFINFORMATION
LARGE
SMALL
MEASURED IN
TERABYTES
1TB = 1,000GB
MEASURED IN
PETABYTES
1PB = 1,000TB
WILL BE MEASURED IN
EXABYTES
1EB = 1,000PB

EMC
2
PROVEN PROFESSIONAL
Considerations for Big Data Analytics
1. Speed of decision making
2. Throughput
3. Analysis flexibility
Analytic Sandbox
Data assets gathered from multiple sources
and technologies for analysis
• Enables high performance analytics
using in-db processing
• Reduces costs associated with data
replication into "shadow" file
systems
• “Analyst-owned” rather than “DBA
owned”
Criteria for Big Data Projects New Analytic Architecture
1. Speed of decision making
2. Throughput
3. Analysis flexibility
16

EMC
2
PROVEN PROFESSIONAL
State of the Practice in Analytics: Mini-Case Study
Big Data Enabled Loan Processing at XYZ bankUnderwritingRisk
Traditional
Underwriting
Risk Level
TRADITIONAL DATA LEVERAGED BIG DATA LEVERAGED
Big Data Enabled
Underwriting
Risk Level
17Module 1: Introduction to BDA
Your Thoughts?

EMC
2
PROVEN PROFESSIONAL
Big Data Analytics: Industry Examples
Health Care
•Reducing Cost of Care
Public Services
•Preventing Pandemics
Life Sciences
•Genomic Mapping
IT Infrastructure
•Unstructured Data Analysis
Online Services
•Social Media for Professionals
RetailPhone/TV
Government Internet
Medical
Financial
Data
Collectors
1
2
3
4
5

EMC
2
PROVEN PROFESSIONAL
Big Data Analytics: Healthcare
Use of Big Data
Key
Outcomes
Situation
• Poor police response and problems with medical care, triggered
by shooting of a Rutgers student
• The event drove local doctor to map crime data and examine
local health care
• Dr. Jeffrey Brenner generated his own crime maps from medical
billing records of 3 hospitals
• City hospitals & ER‟s provided expensive care, low quality care
• Reduced hospital costs by 56% by realizing that 80% of city‟s
medical costs came from 13% of its residents, mainly low-
income or elderly
• Now offers preventative care over the phone or through home
visits
1
20Module 1: Introduction to BDA

EMC
2
PROVEN PROFESSIONAL
Big Data Analytics: Public Services
Use of Big Data
Key
Outcomes
Situation
• Threat of global pandemics has increased exponentially
• Pandemics spreads at faster rates, more resistant to antibiotics
• Created a network of viral listening posts
• Combines data from viral discovery in the field, research in
disease hotspots, and social media trends
• Using Big Data to make accurate predications on spread of new
pandemics
• Identified a fifth form of human malaria, including its origin
• Identified why efforts failed to control swine flu
• Proposing more proactive approaches to preventing outbreaks
2

EMC
2
PROVEN PROFESSIONAL
Big Data Analytics: Life Sciences
Use of Big Data
Key
Outcomes
Situation • Broad Institute (MIT & Harvard) mapping the Human Genome
• In 13 yrs, mapped 3 billion genetic base pairs; 8 petabytes
• Developed 30+ software packages, now shared publicly, along
with the genomic data
• Using genetic mappings to identify cellular mutations causing
cancer and other serious diseases
• Innovating how genomic research informs new pharmaceutical
drugs
3

EMC
2
PROVEN PROFESSIONAL
Big Data Analytics: IT Infrastructure
Use of Big Data
Key
Outcomes
Situation • Explosion of unstructured data required new technology to
analyze quickly, and efficiently
• Doug Cutting created Hadoop to divide large processing tasks
into smaller tasks across many computers
• Analyzes social media data generated by hundreds of
thousands of users
• New York Times used Hadoop to transform its entire public
archive, from 1851 to 1922, into 11 million PDF files in 24 hrs
• Applications range from social media, sentiment analysis,
wartime chatter, natural language processing
4

EMC
2
PROVEN PROFESSIONAL
Big Data Analytics: Online Services
Use of Big Data
Key
Outcomes
Situation • Opportunity to create social media space for professionals
• Collects and analyzes data from over 100 million users
• Adding 1 million new users per week
• LinkedIn Skills, InMaps, Job Recommendations, Recruiting
• Established a diverse data scientist group, as founder believes
this is the start of Big Data revolution
5

EMC
2
PROVEN PROFESSIONAL
Greenplum Unified Analytic Platform
Partner Tools & Services
GREENPLUM CHORUS – Analytic Productivity Layer
Greenplum gNet
GREENPLUM
DATABASE
Data
Scientist
Data
Engineer
Data Analyst Bl
Analyst
LOB
User
Data
Platform
Admin
DATASCIENCETEAM
Cloud, x86 Infrastructure, or Appliance
GREENPLUM
HD
Unify your team
Drive Collaboration
Keep Your Options Open
The Power of Data
Co-Processing

EMC
2
PROVEN PROFESSIONAL
Greenplum Hadoop
STRUCTURED UNSTRUCTURED
Hive
MapReduce
Pig
XML, JSON, … Flat files
Schema on load
Directories
No ETL
Java
SequenceFile

EMC
2
PROVEN PROFESSIONAL
Greenplum Database
STRUCTURED UNSTRUCTURED
SQL
RDBMS
Tables and Schemas
Greenplum
MapReduce
Indexing
Partitioning
BI Tools

EMC
2
PROVEN PROFESSIONAL
• A framework for handling big data
 An implementation of the MapReduce paradigm
 Hadoop glues the storage and analytics together and provides reliability,
scalability, and management
What do we Mean by Hadoop
Storage (Big Data)
 HDFS – Hadoop Distributed
File System
 Reliable, redundant,
distributed file system
optimized for large files
MapReduce (Analytics)
 Programming model for
processing sets of data
 Mapping inputs to outputs and
reducing the output of multiple
Mappers to one (or a few)
answer(s)
Two Main Components
30Module 5: Advanced Analytics - Technology and Tools

EMC
2
PROVEN PROFESSIONAL
Hadoop Distributed File System

EMC
2
PROVEN PROFESSIONAL
MapReduce and HDFS
Task Tracker
Task Tracker Task Tracker
Job Tracker
Hadoop Distributed File System (HDFS)
Client/Dev
Large Data Set
(Log files, Sensor Data)
Map Job
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job
Map Job
Reduce Job
2
1
3
4

EMC
2
PROVEN PROFESSIONAL
• As you move from Pig to Hive to
HBase, you are increasingly
moving away from the mechanics
of Hadoop and get an RDBMS
view of the Big Data world
Components of Hadoop
HBase
Queries
against defined
tables
Hive SQL-based
language
Pig
Data flow
language &
Execution
environment
More Hadoop
Visible
Less Hadoop
Visible
DBMS View
Mechanics of
Hadoop

EMC
2
PROVEN PROFESSIONAL
Greenplum Database
Extreme Performance for Analytics
• Optimized for BI and analytics
 Deep integration with statistical packages
 High performance parallel implementations
• Simple and automatic parallelization
 Just load and query like any database
 Tables are automatically distributed
across nodes
 No need for manual partitioning or tuning
• Extremely scalable
 MPP* shared-nothing architecture
 All nodes can scan and process in parallel
 Linear scalability by adding nodes where each
node adds storage, query & load performance
*MPP – Massive Parallel Processing

EMC
2
PROVEN PROFESSIONAL
Greenplum DB & HD
Massively Parallel Access and Movement
Maximize Solution
Flexibility
Minimize Data
Duplication
Access Hadoop
Data in Real Time
From Greenplum
DB
Import and export
in Text, Binary
and Compressed
Formats
Custom formats via user-written MapReduce Java
program And GPDB Format classes
gNet
10Gb Ethernet
Greenplum DB Hadoop
Node 1
Node 2
Node 3
Segment 1
Segment 2
Segment 3
GP DB
Master
Host
Map
Reduce
User-
Defined
Binary
TextExternal
Tables

EMC
2
PROVEN PROFESSIONAL
Analytical Software
Exploiting Parallelism
In-Database Analytics
Analytic
Results
Interconnect
Storage
Independent Segment
Processors
Independent Memory
Independent
Direct Storage
Connection
Master Segment Processor
Interconnect
Switch
Math & Statistical
Functions

EMC
2
PROVEN PROFESSIONAL
Big Data Requires Data Science
Data Science
• Predictive analysis
• What if…..?
Business
Intelligence
• Standard reporting
• What happened?
High
FuturePast TIME
BUSINESS
VALUE
Business
Intelligence
Data
Science
Low

EMC
2
PROVEN PROFESSIONAL
Data science and business intelligence
“BIG DATA ANALYTICS”
“TRADITIONAL BI”
GBs to 10s of TBs
Operational
Structured
Repetitive
10s of TB to Pb‟s
External + Operational
Mostly Semi-Structured
Experimental, Ad Hoc

EMC
2
PROVEN PROFESSIONAL
Profile of a Data Scientist

EMC
2
PROVEN PROFESSIONAL
People and Ecosystem
Domain
Data Science as a Process
Data Prep
Variable
Selection
Model
Building
Model
Execution
Communication
& Operationalization
Evaluate
• People
• Scientists / Analysts
• Business Analysts
• Consumers of analysis
• Stakeholders
• EMC sales and services
• Ecosystem
• Sector (Telecom, banking, security agency etc.)
• Modeling software and other tools used by analysts
(MADlib, SAS, R etc.)
• Database (Greenplum) & Data Sources

EMC
2
PROVEN PROFESSIONAL
Domain
Data Prep
Variable
Selection
Model
Building
Model
Execution
Communication
Evaluate
Discovery & prioritized identification of
opportunities
• Customer Retention
• Fraud detection
• Pricing
• Marketing effectiveness and optimization
• Product Recommendation
• Others……

EMC
2
PROVEN PROFESSIONAL
Domain
Data Prep
Variable
Selection
Model
Building
Model
Execution
Communication
Evaluate
• What are the data sources?
• Do we have access to them?
• How big are they?
• How often are they updated?
• How far back do they go?
• Which of these data sources are being used for
analysis? Can we use a data source which is currently
unused? What problems would that help us solve?

EMC
2
PROVEN PROFESSIONAL
Domain
Data Prep
Variable
Selection
Model
Building
Model
Execution
Communication
Evaluate
• Selection of raw variables which are
potentially relevant to problem being
solved
• Transformations to create a set of
candidate variables
• Clustering and other types of
categorization which could provide
insights

EMC
2
PROVEN PROFESSIONAL
Domain
Data Step
Variable
Selection
Model
Building
Model
Execution
Communication
Evaluate
Pick suitable statistics, or suitable model form and algorithm
and build model

EMC
2
PROVEN PROFESSIONAL
Domain
Data Prep
Variable
Selection
Model
Building
Model
Execution
Communication
Evaluate
The model needs to be executable in database on big data
with reasonable execution time

EMC
2
PROVEN PROFESSIONAL
Domain
Data Prep
Variable
Selection
Model
Building
Model
Execution
Communication
Evaluate
The model results need to be communicated &
operationalized to have a measurable impact on the
business

EMC
2
PROVEN PROFESSIONAL
Domain
Data Prep
Variable
Selection
Model
Building
Model
Execution
Communication
Evaluate
• Accuracy of results and forecasts
• Analysis of real-world experiments
• A/B testing on target samples
• End-user and LOB feedback

EMC
2
PROVEN PROFESSIONAL
Use Case 1 Trip modeling
Problem: Analyze behaviour of
visitors to MakeMyTrip.com
Particularly interested in
unregistered visitors
– About 99% of total visitor traffic

EMC
2
PROVEN PROFESSIONAL
Applications of model
• Tailor promotions for popular types of trips
 Most popular types probably already well-known; potential in
next tier down
• ... and for different types of customers
• Present customised promotions to visitors based on clicks
• Ad optimization: present ads based on modelled behavior

EMC
2
PROVEN PROFESSIONAL
Hypertargeting
• Serving content to customers based on individual
characteristics and preferences, rather than broad
generalizations

EMC
2
PROVEN PROFESSIONAL
Available data
• Data available from server:
 Date/time
 IP address
 Parts of site visited
• Geographic location can be obtained via geo lookup on IP
• Personal information available for registered visitors only

EMC
2
PROVEN PROFESSIONAL
Approach
• Use clustering to identify trip/visitor types
 Sport (IPL,F1, Football, etc)
 Festivals
 Other seasonal movements
• Decision trees to predict which type of trip a visitor is likely
to make
 Based on successively more information as they move
through the site
• Use registered visitor info to augment models

EMC
2
PROVEN PROFESSIONAL
Use Case 2 Municipal traffic analysis
• Client domain: Municipal city government
• Available data:
Cross-city loop detectors measuring traffic volume
Detailed city bus movement information from Bluetooth devices
Video detection of traffic volume, velocity
• Goal: Exploit available data for unrealized business insights and
values

EMC
2
PROVEN PROFESSIONAL
Data loading and manipulation
• Parallel data loading
– Data loaded from local file system and distributed across Greenplum
servers in parallel.
– Loading 9 months of traffic volume data (16 GB, 464 million rows) in 69.4
seconds.
• SQL data manipulation
– Standard SQL permits city personnel to use existing skillsets.
– Greenplum SQL extensions offer the control over data distribution.
– Open source packages (e.g. in Python, R) can be conveniently deployed
within Greenplum for visualization and analytics purposes.

EMC
2
PROVEN PROFESSIONAL
Basic reporting on traffic volume
• Easy generation of reports via straightforward user-defined functions
• Standard graphing utilities called from within Greenplum to create figures
• Detector downtimes can be clearly spotted in the figure, or via an SQL query, thus
mitigating maintenance challenges

EMC
2
PROVEN PROFESSIONAL
Basic reporting on city buses
• Data from Bluetooth devices has a wealth of information on city
buses that we can report on:
 Travel route of each bus
 Deviations of arrival times compared to provided timetable
 Occurrences of driver errors (e.g. taking a wrong turn) and possible
causes
 Occurrences where the same bus service arrives at the same stop
within seconds of each other
 Whether new bus services translates into lower traffic volume on
introduced roads

EMC
2
PROVEN PROFESSIONAL
Result visualizations (Google Earth)

EMC
2
PROVEN PROFESSIONAL
Applications for traffic network modelling
• Compute the fastest path between any two locations at a
future time point
• Identify potential bottlenecks in the traffic
• Identify phase transition points for massive traffic congestion
using simulation techniques
• Study the likely impact of new roads and traffic policies,
without having to observe real disruptive events to
determine the impact

EMC
2
PROVEN PROFESSIONAL
• Greenplum‟s parallel architecture permits traffic network analysis on a
city scale
• Travel time can be predicted via model learning, involving hundreds of
thousands of optimizations in parallel, across the entire traffic network
• Variables that can be considered include
Distance between two locations
Concurrent traffic volume
Time of day
Weather
Construction work
• Computationally prohibitive for traditional non-parallel database
environments
Parallel traffic network modelling

EMC
2
PROVEN PROFESSIONAL
Use Case 3 - Product Recommendation Analysis
• Eight banks became one
 Branches across the US
• Consolidation of products and customers
 Employees faced with new products and
customers
 Visibility into churn and retention was
challenged
• Analytics focus was historically reporting-
centric
 Descriptive “hindsight”`

EMC
2
PROVEN PROFESSIONAL
Customer Segmentation
Customer segments
– First, define a measurement of
customer value
– Then create clusters of
customers based on customer
value, and then product
profiles.

EMC
2
PROVEN PROFESSIONAL
Association Rules
Product associations
– Now find products that are
common in the segment, but
not owned by the given
household.
Product A
Product B
Product X
Product Y
Product Z

EMC
2
PROVEN PROFESSIONAL
Product Recommendations
Next best offer
– Now, filter down to products
associated with high-value
customers in the same segment.
Product A
Product B
Product X
Product Y
Product Z

EMC
2
PROVEN PROFESSIONAL
Increased customer value
Customer Comments
– “The Greenplum Solution has
scaled from 6 to 11 TB of data.”
– Moved from 7 hours /month of
data to 7.5 hours / 2.5 years of
data
Product Recommender

EMC
2
PROVEN PROFESSIONAL
Module #: Module Name 74
Ferrari Freight Train
0-100 KMPH 2.3 seconds 100 seconds
Top Speed 360 KMPH 140 KMPH
Stops / hr 1000 5
Horse Power 660 bhp 16,000 bhp
Throughput 220 KG in 27 mins 55000000 KG in 60 mins
VS

EMC
2
PROVEN PROFESSIONAL
Module #: Module Name 75
Fast Data Big Data
Transactions /
Second
100000+ per second n.a
Concurrent hits 10000 + per sec 10 per second
Update Patterns Read / Write Appends
Data Complexity Simple Joins on a few tables Can be highly complex
Data Volumes GB‟s / TB PB to ZB
Access Tools GemFire / SQLFire GP DB, GP Hadoop
VSFast Data Big Data

EMC
2
PROVEN PROFESSIONAL
Not a fast OLTP DB!
APPLICATION(S)

EMC
2
PROVEN PROFESSIONAL
Fast Data is
• More than just an OLTP DB
• Super Fast access to Data
• Server side flexibility
• Data is HA
• Supports transactions
• Setup is fault tolerant
• Can handle thousands of concurrent hits
• Distributed hence horizontally scalable
• Runs on cheap x86 hardware

EMC
2
PROVEN PROFESSIONAL
CAP Theorem
A distributed system can only
achieve TWO out of the
three qualities of
Consistency, Availability and
Partition Tolerance
onsistency vailability artition Tolerence

EMC
2
PROVEN PROFESSIONAL
Fast Data =
• Service Loose Coupling
• Data Transformation
• System Integration
+ Service Bus
• Guaranteed Delivery
• Event Propagation
• Data Distribution
+ Messaging System
• Event Driven Architectures
• Real-time Analysis
• Business Event Detection
+ Complex Event Processor
Fast Data combines select features from all of these products and combines
them into a low-latency, linearly scalable, memory-based data fabric
• Storage
• Persistence
• Transactions
• Queries
Database
• High Availability
• Load Balancing
• Data Replication
• L1 Caching
• Map-Reduce, Scatter-Gather
• Distributed Task Assignment
• Task Decomposition
+ Grid Controller
• Result Summarization

EMC
2
PROVEN PROFESSIONAL
A Typical Fast Data Setup
Web Tier
Application Tier
Load Balancer
Add/remove
web/application/data servers
Add/remove storage
Database Tier
Storage Tier
Disks may be direct or network
attached
Optional reliable, asynchronous
feed
to a Big Data Store

EMC
2
PROVEN PROFESSIONAL
Memory-based Performance
Perform
Fast Data uses memory on a peer machine to make data updates
durable, allowing the updating thread to return 10x to 100x faster than
updates that must be written through to disk, without risking any data
loss. Typical latencies are in the few hundreds of microseconds
instead of in the tens to hundreds of milliseconds.
One can optionally write updates to disk / data warehouse / big data store
asynchronously and reliably.

EMC
2
PROVEN PROFESSIONAL
WAN Distribution
Distribute
Fast Data can keep clusters that are distributed around the world synchronized in real-
time and can operate reliably in Disconnected, Intermittent and Low-Bandwidth network
environments.

EMC
2
PROVEN PROFESSIONAL
Distributed Events
Targeted, guaranteed delivery, event
notification and Continuous Queries
Notify

EMC
2
PROVEN PROFESSIONAL
Parallel Queries
Batch Controller
or Client
Scatter-Gather (Map-Reduce)
Queries
Compute

EMC
2
PROVEN PROFESSIONAL
Data-Aware Routing
Execute
Fast Data provides „data aware function routing‟ – moving the behavior to
the correct data instead of moving the data to the behavior.
Batch Controller
or Client
Data Aware Function

EMC
2
PROVEN PROFESSIONAL
Accessing Fast Data
Stores Objects (Java, C++, C#, .NET) or unstructured data
Spring-GemFire
Stores Relational Data with SQL interface
Supports JDBC, ODBC, Java and .NET interfaces
Key-Value store with OQL Queries
Uses existing relational tools
Order
Order Line Item
Quantity
Discount
Product
SKU
Unit Price
L2 Cache plugin for Hibernate
HTTP Session replication module
GemFire
SQLFire

EMC
2
PROVEN PROFESSIONAL
Use Cases
Applying the technology
A few examples of Fast Data technology
applied to real business cases

EMC
2
PROVEN PROFESSIONAL
A mainframe-based, nightly customer account reconciliation batch run
Mainframe Migration
min
0 12060
I/O Wait
9%
CPU Busy
15%
Mainframe
CPU Unavailable
76%
COTS Cluster
Batch now runs in 60 seconds
93% Network Wait! Time could have been reduced further with higher network bandwidth

EMC
2
PROVEN PROFESSIONAL
Mainframe Migration
So What? So the batch runs faster – who cares?
1. It ran on cheaper, modern, scalable hardware
2. If something goes wrong with the batch, you only wait 60
seconds to find out
3. Now, the hardware and the data are available to do other
things in the remaining 119 minutes:
• Fraud detection
• Regulatory compliance
• Re-run risk calculations with 119 different scenarios
• Up sell customers
4. You can move from batch to real-time processing!

EMC
2
PROVEN PROFESSIONAL
Online Betting
A popular online gambling site attracts new players through ads on affiliate sites
Customized Banner Ad on affiliate site
Affiliate's Web Server
1 Banner Ad Server
2
3
4
In a fraction of a second, the banner ad sever must:
Generate a tracking id specific to the request
Apply temporal, sequential, regional, contractual and other
policies in order to decide which banner to deliver
Customize the banner
Record that the banner ad was delivered

EMC
2
PROVEN PROFESSIONAL
Online Betting (Contd.)
Their initial RDBMS-based system
Limited their ability to sign up new affiliates
Limited their ability to add new products on their site
Limited the delivery performance experienced by their
affiliates and their customers
Limited their ability to add additional internal applications
and policies to the process
Their new Fast Data based system
Responded with sub-millisecond latency
Met their target of 2500 banner ad deliveries per second
Provides for future scalability
Improved performance to the browser by 4x
Cost less

EMC
2
PROVEN PROFESSIONAL
Asset/Position Monitoring
Centralized data storage was not possible
Multi-agency, multi-force integration
Numerous Applications needed access to multiple data sources
simultaneously
Networks constantly changing, unreliable, mobile deployments
Upwards of 60,000 object updates each minute
Over 70 data feeds
Needed a real-time situational awareness system to track assets that could be used by the war fighters in theatre
Northrop Grumman (integrator) investigated the following technologies before deciding on GemFire
•RDBMS – Oracle, Sybase, Postgres, TimesTen, MySQL
•ODBMS - Objectivity
•jCache – GemFire, Oracle Coherence
•JMS – SonicMQ, BEA Weblogic, IBM, jBoss
•TIBCO Rendezvous
•Web Services

EMC
2
PROVEN PROFESSIONAL
Asset/Position Monitoring
655 sites, 11 thousand users
Real-time, 3 dimensional, NASA World Wind User Interface
60,000 Position updates per minute
Real time info available on the desk of
President of the United States
US Secretary of Defense
Each of the Joint Chiefs of Staff
Every commander in the US Military

EMC
2
PROVEN PROFESSIONAL
Low-latency trade insertion
Permanent Archival of every trade
Kept pace with fast ticking market data
Rapid, Event Based Position Calculation
Distribution of Position Updates Globally
Consistent Global Views of Positions
Pass the Book
Regional Close-of-day
High Availability
Disaster Recovery
Regional Autonomy
The project achieved:
Global Foreign Exchange Trading System

EMC
2
PROVEN PROFESSIONAL
Global Foreign Exchange Trading System
In that same application, Fast Data replaced:
Sybase Database In Every Region
Still need 1 instance for archival purposes
TIBCO Rendezvous for Local Area Messaging
IBM MQ Series for WAN Distribution
Veritas N+1 Clustering for H/A
In fact, we save the physical +1 node itself
3DNS or Wide IP
Admin personnel reduced from 1.5 to 0.5

EMC
2
PROVEN PROFESSIONAL
Application High Level Overview
APPLICATION(S)
Single DB cant handle both
OLTP and OLAP
workloads

EMC
2
PROVEN PROFESSIONAL
Big Data Setup
APPLICATION(S)
How to get the best of Fast & Big Data
Fast Data
Setup
In case record isn't available
Concurrent hits

Big data, data science & fast data

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Big data, data science & fast data

Semelhante a Big data, data science & fast data (20)

Big data, data science & fast data

Notas do Editor