SlideShare uma empresa Scribd logo
1 de 48
1
Big Data Lessons from the Cloud
Jack Norris, MapR Technologies
2
Data Volume
Growing 44x
2020: 35.2
Zettabytes
2010:
1.2
Zettabytes
The Challenge of Big Data
Business Analytics Requires a New Approach
Source: IDC Digital Universe Study, sponsored by EMC, May 2010
IDC
Digital Universe
Study
Data is Growing Faster than Moore’s Law
3
What are the Requirements for Big Data?
 Process it quickly
 Combine multiple
data sources
 Expand analysis
4
Big Data in the Cloud
 Distributed, scalable computing platform
– Data/Compute framework
– Commodity hardware
 Pioneered at Google
 Commercially available as Hadoop
5
Important Drivers for Hadoop
 Data on compute
 You don’t need to know what
questions to ask beforehand
 Simple algorithms on Big Data
 Analysis of unstructured data
6
Hadoop Growth
7
Apache Hadoop Distribution
 Combination of Various
Packages
 Integrated, tested and
hardened
8
Hadoop in the Cloud
9
Amazon Example: Elastic MapReduce (EMR)
EMR provides Hadoop as a Service
in the Cloud
10
How does it work?
EMR
EMR ClusterS3
You can store the
data in S3 and/or on
the cluster (HDFS)
You decide which Hadoop
distribution to run, how many
nodes, and what types of nodes
11
EMR
EMR Cluster
How does it work?
S3
You can easily add
additional nodes
12
How does it work?
EMR ClusterS3
When processing is complete,
you can shut down the cluster
(and stop paying)
13
Launching a Cluster
14
Thousands of customers, 2 million+ clusters
16
Hadoop in the Cloud is a Flexible
Infrastructure for Big Data
17
 MinuteSort - Amount of data that can be sorted in 60.00 seconds.
– Benchmark is technology Agnostic
 Previous record was 1.4TB set by Microsoft Research using
specially designed software across physical hardware
 Previous Hadoop MinuteSort record was 578 GB
17
Cloud Example of Scalability
18
A New MinuteSort World Record
New World Record
1.5 TB in 60seconds
3X more data processed
than the previous
Hadoop Record
19
Previous Record
3452 physical servers
Prepare datacenter
Rack and stack servers
Maintain hardware
2103 instances
Invoke gcutil command
Months Minutes
Cloud Deployment Comparison
20
Previous Record
3452 1U servers x
$4K/server =
2103 n1-standard-4-d x
$.58/instance hour x
60 seconds =
$13,808,000 $20.33
Cost Comparison
21
Use Case 1:
Expand Data for Analysis
22
Comparing an EDW to Hadoop
 Major telecom vendor
 Key step in billing pipeline
handled by data warehouse
(EDW)
 EDW at maximum capacity
 Multiple rounds of software
optimization already done
 Revenue limiting (= career
limiting) bottleneck
23
Transformation
Extract and Load
CDR billing
records
Billing
reports
Data Warehouse
Customer
bills
Original Flow
24
Problem Analysis
 70% of EDW load is related to call detail record (CDR)
normalization
–< 10% of total lines of code
–CDR normalization difficult within the EDW
–Binary extraction and conversion
 Data rates are too high for upstream transform
–Requires high volume joins
25
ETL
CDR billing
records
Billing
reports
Data Warehouse
Customer
billing
With ETL Offload
Hadoop Cluster
26
ETL Offload
Hadoop Distribution
27
Simplified Analysis
 70% of EDW consumed by ETL processing – Offload
frees capacity
 EDW direct hardware cost is approximately $30 million
vs. Hadoop cluster at 1/50 the cost
 Additional EDW only increases capacity by 50% due to
poor division of labor
28
The Results
 EDW strategy
–1.5 x performance
–$30 million
 Hadoop Strategy
–3 x faster
–20x cost/performance advantage for Hadoop strategy
–With High Availability and data protection
29
Use Case 2:
Combine Many Different Data Sources
30
Combining different feeds on one platform
Hadoop and HBase
Storage and Processing
…
Real-time data feed
from social network
Stored in
Hadoop
Historical
Purchase
Information
Predictive Analytics from
Historical data combined with
NoSQL querying on real-time
social networking data
Billing
Data
31
Results
 New Service Rolled out in 1 quarter
 Processing time cut from 20 hours per day to 3
 Recommendation engine load time decreased from 8
hours to 3 minutes
 Includes data versioning support for easier
development and updating of models
32
Collect Data from Dispersed Data Sources
33
Leading Veterinary Equipment Mfgr
 Aggregates data across 6000 veterinary clinics
 Nightly extracts from each clinic
 One job runs once a week for a few hours
 Expanding applications to include vaccination analysis for 300M
vaccinations
 Predictive analytics for disease prevalence and prevention
34
Use Case 3:
New Application from New Data Source
35
Ancestry.com – Family Tree
36
Overview and Requirements
 Collect and Collate information from disparate sources
(Text files, Images, etc.)
 Leverage new data source: Spit
 Machine learning techniques and DNA Matching
Algorithms
37
The Results
 Storage Infrastructure for billions of small and large
files
 Blob Store for large images through NoSQL solutions
 Multi-tenant capability for data-mining and machine-
learning algorithm development
38
Use Case 4:
New Analytics on Existing Data
39
Analytic Flexibility
 MapReduce enabled Machine learning algorithms
 Enhanced Search
 Real-time event processing
 No need to sample the data
Fraud Detection Target Marketing
Consumer
Behavior Analysis …
40
Hadoop Expands Analytics
“Simple algorithms and lots of data
trump complex models ”
Halevy, Norvig, and Pereira, Google
IEEE Intelligent Systems
41
Advanced Simple Analytics
 Fraud detection:
– Detect small frauds using transaction patterns across the entire
portfolio
– Identify compromise signature to prevent further exploits and
provide solid case explanations
 Google Flu Trends vs. Traditional Flu Surveillance
systems and modeling
 Netflix recommendation engine
– Complex models vs. adding IMDB data
42
Combine Them All
43
Clickstream Analysis –
 Big Box Retailer came to Razorfish
– 3.5 billion records
– 71 million unique cookies
– 1.7 million targeted ads required per day
Problem: Improve Return on Ad Spend (ROAS)
44
Clickstream Analysis –
Targeted Ad
User recently
purchased a
sports movie and
is searching for
video games (1.7 Million per day)
45
Clickstream Analysis –
Processing time dropped from 2+ days to 8 hours
(with lots more data)
46
Clickstream Analysis –
Increased Return On Ad Spend by 500%
47
Hadoop in the Cloud/EMR applications
 Targeted advertising / Clickstream analysis
 Security: anti-virus, fraud detection, image recognition
 Pattern matching / Recommendations
 Data warehousing / BI
 Bio-informatics (Genome analysis)
 Financial simulation (Monte Carlo simulation)
 File processing (resize jpegs, video encoding)
 Web indexing
48
Big Data Processing
…
99.999%
HA
Data
Protection
Disaster
Recovery
Scalability
&
Performance
Enterprise
Integration
Multi-
tenancy
Map
Reduce
File-Based
Applications
SQL Database Search Stream
Processing
Batch Orientation:
Enterprise Logfile Analysis
ETL Offload
Object Archive
Fraud Detection
Clickstream Analytics
Real-Time Orientation:
Sensor Analysis
“Twitterscraping”
Telematics
Process Optimization
Interactive Orientation:
Forensic Analysis
Analytic Modeling
BI User Focus
49
Big Data Lessons from the Cloud
1. Big Data requires a new approach
2. Hadoop is a paradigm shift
3. Easy to get started with Hadoop in the Cloud
4. Scale clusters up and down in the Cloud
5. Only pay for what you use
6. Expand data for analysis
7. Combine data sources
8. New application from new data source
9. New analytics
10. Wide variety of applications appropriate for Hadoop

Mais conteúdo relacionado

Mais procurados

Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Robert Grossman
 
Big data and computing grid
Big data and computing gridBig data and computing grid
Big data and computing gridThang Nguyen
 
IDEAS 2013 Presentation
IDEAS 2013 PresentationIDEAS 2013 Presentation
IDEAS 2013 PresentationMuntazir Mehdi
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyNati Shalom
 
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...Databricks
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Robert Grossman
 
Big Data on Public Cloud
Big Data on Public CloudBig Data on Public Cloud
Big Data on Public CloudIMC Institute
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Robert Grossman
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Robert Grossman
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataIMC Institute
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
Introduction to Numetric (1)
Introduction to Numetric (1)Introduction to Numetric (1)
Introduction to Numetric (1)Matt Polson
 
Future of Data - Big Data
Future of Data - Big DataFuture of Data - Big Data
Future of Data - Big DataShankar R
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Robert Grossman
 

Mais procurados (20)

Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
Big data and computing grid
Big data and computing gridBig data and computing grid
Big data and computing grid
 
IDEAS 2013 Presentation
IDEAS 2013 PresentationIDEAS 2013 Presentation
IDEAS 2013 Presentation
 
B1803031217
B1803031217B1803031217
B1803031217
 
BigData
BigDataBigData
BigData
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case Study
 
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011
 
Big Data on Public Cloud
Big Data on Public CloudBig Data on Public Cloud
Big Data on Public Cloud
 
An introduction to data mining
An introduction to data miningAn introduction to data mining
An introduction to data mining
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data
Big dataBig data
Big data
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Perspective on HPC-enabled AI
Perspective on HPC-enabled AIPerspective on HPC-enabled AI
Perspective on HPC-enabled AI
 
Introduction to Numetric (1)
Introduction to Numetric (1)Introduction to Numetric (1)
Introduction to Numetric (1)
 
Future of Data - Big Data
Future of Data - Big DataFuture of Data - Big Data
Future of Data - Big Data
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 

Semelhante a Big Data Lessons from the Cloud

Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxPankajkumar496281
 
AWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Germany
 
Big Data LDN 2018: CONNECTING SILOS IN REAL-TIME WITH DATA VIRTUALIZATION
Big Data LDN 2018: CONNECTING SILOS IN REAL-TIME WITH DATA VIRTUALIZATIONBig Data LDN 2018: CONNECTING SILOS IN REAL-TIME WITH DATA VIRTUALIZATION
Big Data LDN 2018: CONNECTING SILOS IN REAL-TIME WITH DATA VIRTUALIZATIONMatt Stubbs
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabatinabati
 
Connecting Silos in Real Time with Data Virtualization
Connecting Silos in Real Time with Data VirtualizationConnecting Silos in Real Time with Data Virtualization
Connecting Silos in Real Time with Data VirtualizationDenodo
 
IBM Power Systems: Designed for Data
IBM Power Systems: Designed for DataIBM Power Systems: Designed for Data
IBM Power Systems: Designed for DataIBM Power Systems
 
Exploring the Wider World of Big Data
Exploring the Wider World of Big DataExploring the Wider World of Big Data
Exploring the Wider World of Big DataNetApp
 
ParStream - Big Data for Business Users
ParStream - Big Data for Business UsersParStream - Big Data for Business Users
ParStream - Big Data for Business UsersParStream Inc.
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Big Data Spain
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Edwin Poot
 
The Evolution of Data Architecture
The Evolution of Data ArchitectureThe Evolution of Data Architecture
The Evolution of Data ArchitectureWei-Chiu Chuang
 
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...exponential-inc
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...oj08
 

Semelhante a Big Data Lessons from the Cloud (20)

Expect More from Hadoop
Expect More from Hadoop Expect More from Hadoop
Expect More from Hadoop
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptx
 
AWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data Analytics
 
Big Data LDN 2018: CONNECTING SILOS IN REAL-TIME WITH DATA VIRTUALIZATION
Big Data LDN 2018: CONNECTING SILOS IN REAL-TIME WITH DATA VIRTUALIZATIONBig Data LDN 2018: CONNECTING SILOS IN REAL-TIME WITH DATA VIRTUALIZATION
Big Data LDN 2018: CONNECTING SILOS IN REAL-TIME WITH DATA VIRTUALIZATION
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
Connecting Silos in Real Time with Data Virtualization
Connecting Silos in Real Time with Data VirtualizationConnecting Silos in Real Time with Data Virtualization
Connecting Silos in Real Time with Data Virtualization
 
IBM Power Systems: Designed for Data
IBM Power Systems: Designed for DataIBM Power Systems: Designed for Data
IBM Power Systems: Designed for Data
 
Exploring the Wider World of Big Data
Exploring the Wider World of Big DataExploring the Wider World of Big Data
Exploring the Wider World of Big Data
 
ParStream - Big Data for Business Users
ParStream - Big Data for Business UsersParStream - Big Data for Business Users
ParStream - Big Data for Business Users
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
 
12575474.ppt
12575474.ppt12575474.ppt
12575474.ppt
 
The Evolution of Data Architecture
The Evolution of Data ArchitectureThe Evolution of Data Architecture
The Evolution of Data Architecture
 
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
 
E018142329
E018142329E018142329
E018142329
 

Mais de MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

Mais de MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Último

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Último (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Big Data Lessons from the Cloud

  • 1. 1 Big Data Lessons from the Cloud Jack Norris, MapR Technologies
  • 2. 2 Data Volume Growing 44x 2020: 35.2 Zettabytes 2010: 1.2 Zettabytes The Challenge of Big Data Business Analytics Requires a New Approach Source: IDC Digital Universe Study, sponsored by EMC, May 2010 IDC Digital Universe Study Data is Growing Faster than Moore’s Law
  • 3. 3 What are the Requirements for Big Data?  Process it quickly  Combine multiple data sources  Expand analysis
  • 4. 4 Big Data in the Cloud  Distributed, scalable computing platform – Data/Compute framework – Commodity hardware  Pioneered at Google  Commercially available as Hadoop
  • 5. 5 Important Drivers for Hadoop  Data on compute  You don’t need to know what questions to ask beforehand  Simple algorithms on Big Data  Analysis of unstructured data
  • 7. 7 Apache Hadoop Distribution  Combination of Various Packages  Integrated, tested and hardened
  • 9. 9 Amazon Example: Elastic MapReduce (EMR) EMR provides Hadoop as a Service in the Cloud
  • 10. 10 How does it work? EMR EMR ClusterS3 You can store the data in S3 and/or on the cluster (HDFS) You decide which Hadoop distribution to run, how many nodes, and what types of nodes
  • 11. 11 EMR EMR Cluster How does it work? S3 You can easily add additional nodes
  • 12. 12 How does it work? EMR ClusterS3 When processing is complete, you can shut down the cluster (and stop paying)
  • 14. 14 Thousands of customers, 2 million+ clusters
  • 15. 16 Hadoop in the Cloud is a Flexible Infrastructure for Big Data
  • 16. 17  MinuteSort - Amount of data that can be sorted in 60.00 seconds. – Benchmark is technology Agnostic  Previous record was 1.4TB set by Microsoft Research using specially designed software across physical hardware  Previous Hadoop MinuteSort record was 578 GB 17 Cloud Example of Scalability
  • 17. 18 A New MinuteSort World Record New World Record 1.5 TB in 60seconds 3X more data processed than the previous Hadoop Record
  • 18. 19 Previous Record 3452 physical servers Prepare datacenter Rack and stack servers Maintain hardware 2103 instances Invoke gcutil command Months Minutes Cloud Deployment Comparison
  • 19. 20 Previous Record 3452 1U servers x $4K/server = 2103 n1-standard-4-d x $.58/instance hour x 60 seconds = $13,808,000 $20.33 Cost Comparison
  • 20. 21 Use Case 1: Expand Data for Analysis
  • 21. 22 Comparing an EDW to Hadoop  Major telecom vendor  Key step in billing pipeline handled by data warehouse (EDW)  EDW at maximum capacity  Multiple rounds of software optimization already done  Revenue limiting (= career limiting) bottleneck
  • 22. 23 Transformation Extract and Load CDR billing records Billing reports Data Warehouse Customer bills Original Flow
  • 23. 24 Problem Analysis  70% of EDW load is related to call detail record (CDR) normalization –< 10% of total lines of code –CDR normalization difficult within the EDW –Binary extraction and conversion  Data rates are too high for upstream transform –Requires high volume joins
  • 26. 27 Simplified Analysis  70% of EDW consumed by ETL processing – Offload frees capacity  EDW direct hardware cost is approximately $30 million vs. Hadoop cluster at 1/50 the cost  Additional EDW only increases capacity by 50% due to poor division of labor
  • 27. 28 The Results  EDW strategy –1.5 x performance –$30 million  Hadoop Strategy –3 x faster –20x cost/performance advantage for Hadoop strategy –With High Availability and data protection
  • 28. 29 Use Case 2: Combine Many Different Data Sources
  • 29. 30 Combining different feeds on one platform Hadoop and HBase Storage and Processing … Real-time data feed from social network Stored in Hadoop Historical Purchase Information Predictive Analytics from Historical data combined with NoSQL querying on real-time social networking data Billing Data
  • 30. 31 Results  New Service Rolled out in 1 quarter  Processing time cut from 20 hours per day to 3  Recommendation engine load time decreased from 8 hours to 3 minutes  Includes data versioning support for easier development and updating of models
  • 31. 32 Collect Data from Dispersed Data Sources
  • 32. 33 Leading Veterinary Equipment Mfgr  Aggregates data across 6000 veterinary clinics  Nightly extracts from each clinic  One job runs once a week for a few hours  Expanding applications to include vaccination analysis for 300M vaccinations  Predictive analytics for disease prevalence and prevention
  • 33. 34 Use Case 3: New Application from New Data Source
  • 35. 36 Overview and Requirements  Collect and Collate information from disparate sources (Text files, Images, etc.)  Leverage new data source: Spit  Machine learning techniques and DNA Matching Algorithms
  • 36. 37 The Results  Storage Infrastructure for billions of small and large files  Blob Store for large images through NoSQL solutions  Multi-tenant capability for data-mining and machine- learning algorithm development
  • 37. 38 Use Case 4: New Analytics on Existing Data
  • 38. 39 Analytic Flexibility  MapReduce enabled Machine learning algorithms  Enhanced Search  Real-time event processing  No need to sample the data Fraud Detection Target Marketing Consumer Behavior Analysis …
  • 39. 40 Hadoop Expands Analytics “Simple algorithms and lots of data trump complex models ” Halevy, Norvig, and Pereira, Google IEEE Intelligent Systems
  • 40. 41 Advanced Simple Analytics  Fraud detection: – Detect small frauds using transaction patterns across the entire portfolio – Identify compromise signature to prevent further exploits and provide solid case explanations  Google Flu Trends vs. Traditional Flu Surveillance systems and modeling  Netflix recommendation engine – Complex models vs. adding IMDB data
  • 42. 43 Clickstream Analysis –  Big Box Retailer came to Razorfish – 3.5 billion records – 71 million unique cookies – 1.7 million targeted ads required per day Problem: Improve Return on Ad Spend (ROAS)
  • 43. 44 Clickstream Analysis – Targeted Ad User recently purchased a sports movie and is searching for video games (1.7 Million per day)
  • 44. 45 Clickstream Analysis – Processing time dropped from 2+ days to 8 hours (with lots more data)
  • 45. 46 Clickstream Analysis – Increased Return On Ad Spend by 500%
  • 46. 47 Hadoop in the Cloud/EMR applications  Targeted advertising / Clickstream analysis  Security: anti-virus, fraud detection, image recognition  Pattern matching / Recommendations  Data warehousing / BI  Bio-informatics (Genome analysis)  Financial simulation (Monte Carlo simulation)  File processing (resize jpegs, video encoding)  Web indexing
  • 47. 48 Big Data Processing … 99.999% HA Data Protection Disaster Recovery Scalability & Performance Enterprise Integration Multi- tenancy Map Reduce File-Based Applications SQL Database Search Stream Processing Batch Orientation: Enterprise Logfile Analysis ETL Offload Object Archive Fraud Detection Clickstream Analytics Real-Time Orientation: Sensor Analysis “Twitterscraping” Telematics Process Optimization Interactive Orientation: Forensic Analysis Analytic Modeling BI User Focus
  • 48. 49 Big Data Lessons from the Cloud 1. Big Data requires a new approach 2. Hadoop is a paradigm shift 3. Easy to get started with Hadoop in the Cloud 4. Scale clusters up and down in the Cloud 5. Only pay for what you use 6. Expand data for analysis 7. Combine data sources 8. New application from new data source 9. New analytics 10. Wide variety of applications appropriate for Hadoop

Notas do Editor

  1. Map Reduce is a paradigm shiftGoogle Poster ChildWhat exactly does Hadoop look like?
  2. There are many drivers for Hadoop adoption…
  3. Let’s start with this chart. To reinforce you’re in the right room you picked the right session…Hadoop Not only is it the fastest growing Big Data technology…It is one of the fastest technologies period….Hadoop adoption is happening across industries and across a wide range of application areas.What’s driving this adoption
  4. This is a Hadoop distribution it includes a series of open source packages that are tested, hardened and combined into a complete suite. With MapR we’ve combined this with our own innovations at the data platform level to make it highly available, dependable and easier to access and integrate through industry standards like NFS, ODBC, etc…
  5. How do you benefit. I mentioned that used wide variety of use cases…I’ve generalized these into 4 groups… The first
  6. Is expanding data….Sampled to all of the transactions, ….. Netflix….recommends 5 movies to you and. It’s because they look at everybody’s movie watching and ratings and identify like clusters of individuals like you….Risk triangles for insurance companies go from zip code level down to the neighborhood street…Trading information going for last 3 months to 7 years….
  7. Let’s look at a specific example…
  8. Load CDR – Call detail records into the data warehouse and transform data into the proper format for processing and analysis…
  9. The problem with this process is that 70% of the EDW load is related to the CDR normalization process AI: Why is this the case?CDR normalization difficult within the EDWBinary extraction and conversion to SQL is difficult
  10. IDEXX (Current client M3 on EMR)  IDEXX is the leader in veterinary equipment and also make software for clinics, etc.  Aggregating some data from veterinary clinics that have IDEXX software. MapR cluster internally with 4-5 servers at the time, using that successfully for a few months. Terry went to the AWS conference in November, and learned about EMR. Tried it out, liked the flexibility especially in their use case where there aren&apos;t jobs all the time. Example: One job runs once a week for a few hours. 6000 veterinary practices. Each night receive a data extract from each one (pipe-delimited file). Includes all the products that were sold that day. Hadoop is used for aggregations, then use Sqoop to load into another Oracle database for the analysts. Now they have another project. This project is compiled using Java 7 and they use some features for Java 7 (and it&apos;s part of a much larger project that uses Java 7). AI Itay: Send them the exact instructions for using Java 7 with MapR/EMR. Processing similar data to the first project. In this case, they are creating a list of vaccinations for each animal. Provide a portal to end-users with all the medical details. 
  11. The first is “simple algorithms and lots of data trump complex models”. This comes from an IEEE article written by 3 research directors at Google. The article was titled the “Unreasonable effectiveness of Data” it was reaction to an article called “The Unreasonable Effectives of Mathematics in Natural Science” This paper made the point that simple formulas can explain the complex natural world. The most famous example being E=MC2 in physics. Their paper talked about how economist were jealous since they lacked similar models to neatly explain human behavior. But they found that in the area of Natural Language Processing an area notoriously complex that has been studied for years with many AI attempts at addressing this. They found that relatively simple approaches on massive data produced stunning results. They cited an example of scene completion. An algorithm is used to eliminate something in a picture a car for instance and based on a corpus of thousands of pictures fill in the the missing background. Well this algorithm did rather poorly until they increased the corpus to millions of photos and with this amount of data the same algorithm performed extremely well. While not a direct example from financial services I think it’s a great analogy. After all aren’t you looking for an approach that can fill in the missing pieces of a picture or pattern.
  12. Okay interesting graphs how does this translate to the real world. Here are some broad examples.
  13. Start with the right platform…Power to address your needs and the flexibility to grow with your expansion..----- Meeting Notes (4/3/13 14:27) -----examples of functionality that makes applicatoins better…custom codeintegrate time to marketproduction gradeRSA - security event management - NFS - pull data easily - 1. Why Hadoop is gamechanging - paradigm shift.2. how can you benefit - use cases categories…- saved 10 million dollars - predictive analytics. Need money. Who is MapRwhat do we do to make htat a realityend pint - what you can do with it to bring value today