SlideShare uma empresa Scribd logo
1 de 24
BIG DATA:
IT’S MORE THAN VOLUME
Nachum Shacham
PayPal
Big Data Innovation Summit
April 2013
IT’S BIG-DATA TIME!
Volume  big platforms
Variety  multiple data types
Velocity  fast response
Value  a treasure of patterns
TECHNOLOGY HYPE CYCLE
3 DM Tech Forum
BIG DATA
MIXED SIGNALS FROM THE PUNDITS
• Data Lake
• “Needle in a hay stack”
• “All hay no needles”
• “Yet another fad”
• “Noth’n new: we’ve been analyzing
data for 30 years”
4 DM Tech Forum
• “Store’em and they’ll come”
• “Don’t ever discard data”
• “$524.752MM ROI in 3 years”
• “Smart” …
• “Hadoop is free”
• “Just…”
USE YOUR OWN FILTER
• Sift facts from MBS
• Seek factual 1-liners
• See through metaphors
• Discount “Smart” (data, algorithms, systems)
• Be skeptical
5 DM Tech Forum
UNLOCK THE VALUE IN BIG DATA
• Data Trumps Algorithms
• Sufficient data further down the long tail
• Wisdom of the crowd  effective recommendations
• Combine signals from different media
6 DM Tech Forum
BUSINESS VALUE IN BIG DATA
7 DM Tech Forum
RISK ANALYSIS
IDENTIFY INFLUENCERS IN
SOCIAL GRAPHONLINE ADS
REVENUE OPTIMIZATION
FRAUD DETECTION
AND PREVENTION
LET’S DIG INTO BIG DATA
• Define KPIs
• Explore
• Model & Measure
• Visualize signals
• Test
• Question test results
• Rinse and Repeat
8 DM Tech Forum
BIG-DATA ANALYTICS
FROM SEMI-STRUCTURED DATA TO BUSINESS SIGNALS
9
MapAttempt TASK_TYPE="SETUP"
TASKID="task_201212150932_52151_m_000051"
TASK_ATTEMPT_ID="attempt_201212150932_52151_m
_000051_0" TASK_STATUS="SUCCESS"
Task TASKID="task_201212150932_52151_m_000051"
TASK_TYPE="SETUP" TASK_STATUS="SUCCESS"
FINISH_TIME="1355822133162"
COUNTERS="{(FileSystemCounters)(FileSystemCounter
s)[(FILE_BYTES_WRITTEN)
Cloud
RDBMS Data Warehouse Hadoop
MPP PLATFORMS AS WORKBENCHES
FOR BIG DATA AND THEIR TOOLS
CLASSES OF ANALYTICS JOBS
Big
Data
Data
organization
for BI
A few
large
models
Many
small
models
11
DATA MANIPULATION
GRAPHICS
MODEL BUILDING
CROSS VALIDATION
PROBLEM MR
FORMULATION
MATCH THE JOB TO THE PLATFORM
Data
Sourcing
Data
Preparation
Exploratory
Data Analysis
Predictive
Models
Visualization
Reporting
R: THE TOOL FOR ALL ANALYTICS STEPS
R
data files
process lines
set sorting key and value
output <key, value>
Collect segment data marked by key
Process segment data
Output processed segment data
Shuffle sort
Reducer.R
Mapper.py
Text processing
Model per segment
BI-LINGUAL HADOOP STREAMING:
LARGE SCALE PARALLEL PREDICTIVE MODELING
SEMI-STRUCTURED DATA  TABULAR DATA
Meta VERSION="1" .
Job JOBID="job_201212150932_52151" JOBNAME=”DataFilter" USER=”user1234” SUBMIT_TIME="1355822133394"
JOBCONF="hdfs://tmp/hadoop-hadoop/mapred/staging/user1234/.staging/job_201212150932_52151/job.xml"
VIEW_JOB=" " MODIFY_JOB=" " JOB_QUEUE=”B" .
Job JOBID="job_201212150932_52151" JOB_PRIORITY="NORMAL" .
Job JOBID="job_201212150932_52151"
LAUNCH_TIME="1355822223576" TOTAL_MAPS="50" TOTAL_REDUCES="0" JOB_STATUS="PREP" .
Task TASKID="task_201212150932_52151_m_000051" TASK_TYPE="SETUP" START_TIME="1355822133148" SPLITS="" .
MapAttempt TASK_TYPE="SETUP" TASKID="task_201212150932_52151_m_000051”
TASK_ATTEMPT_ID="attempt_201212150932_52151_m_000051_0"
START_TIME="1355822133545"
TRACKER_NAME="tracker_dn0492.ebay.com:localhost.localdomain/127.0.0.1:33613" HTTP_PORT="50060" .
MapAttempt TASK_TYPE="SETUP" TASKID="task_201212150932_52151_m_000051"
TASK_ATTEMPT_ID="attempt_201212150932_52151_m_000051_0" TASK_STATUS="SUCCESS"
Task TASKID="task_201212150932_52151_m_000051" TASK_TYPE="SETUP" TASK_STATUS="SUCCESS"
FINISH_TIME="1355822133162" COUNTERS="{(FileSystemCounters)(FileSystemCounters)[(FILE_BYTES_WRITTEN)
(FILE_BYTES_WRITTEN)(27089)]}{(org.apache.hadoop.mapred.Task$Counter)
(Map-Reduce Framework)[(SPILLED_RECORDS)(Spilled Records)(0)]}" .
Job JOBID="job_201212150932_52151" JOB_STATUS="RUNNING" .
Task TASKID="task_201212150932_52151_m_000001" TASK_TYPE="MAP" START_TIME="1355822133163"
attempt,201212171719,248176,m,000013,0,1355499674337,1355499903213,MAP,SUCCESS,default,rack3,lvsaishdc3dn0109,0109
attempt,2012121771719,248176,m,000464,0,1355501042650,1355501253259,MAP,SUCCESS,default,rack5,lvsaishdc3dn0217,0217
attempt,2012121771719,248176,m,000626,0,1355501212902,1355501366476,MAP,SUCCESS,default,rack17,lvsaishdc3dn0776,077
6
attempt,2012121771719,248176,m,001193,0,1355499673762,1355499887662,MAP,SUCCESS,default,rack8,lvsaishdc3dn0366,036
attempt,2012121771719,248176,m,001355,0,1355499673545,1355499908182,MAP,SUCCESS,default,rack9,lvsaishdc3dn0386,0386
attempt,2012121771719,248176,m,001517,0,1355501266524,1355501470527,MAP,SUCCESS,default,rack5,lvsaishdc3dn0236,0236
attempt,2012121771719,248176,m,001850,0,1355501303142,1355501486691,MAP,SUCCESS,default,rack5,lvsaishdc3dn0235,0235
FROM TABULAR DATA TO BI
16 DM Tech Forum
PARALLEL SEGMENTED MODELING
17
R
R
R
R
R
MAPPERS
REDUCERS
MODELS BUILT ON LARGE DATASETS
18
Meta VERSION="1" .
Job JOBID="job_201112150932_52151"
JOBNAME=”DataFilter"
USER=”user1234”
LAUNCH_TIME="1324801865576”
TIME INTERVAL DATA
CONCURRENCY
PERCENTILES
TIME SERIESWORD COUNT
REPRESENTATION
AVOID RAM LIMITATIONS
R STAT
PROCESSING
Cloud
R LEVERAGING RDBMS POWER
teradataR Scidb-R
TERADATAR FUNCTIONS (SAMPLE)
Function Name What it does
td.zscore Zscore Transformation
td.t.paired T Test Paired
td.cor Correlation Matrix
td.f.oneway One way F Test
td.factanal Factor Analysis
td.freq Frequency Analysis
td.hist Histograms
td.kmeans K-Means Clustering
td.ks Kolmogorov Smirnov Test
td.mode Mode Value of Column
td.tapply Apply a function over a database column
td.summary Like R summary()
td.quantiles Quantile Values
td.rank Rank
ANALYSIS OF A TABLE WITH > 1B ROWS
>library(RJDBC)
>library(teradataR)
>tdConnect(”TD_WH", uid = tdlogin, pwd = tdpwd, database = ”myVDM”)
> system.time(myTbldf <- td.data.frame(”myTbl"))
user system elapsed
0.092 0.054 140.071
> dim(myTbldf )
[1] 1,131,670,269 9
> system.time(cor <- td.cor(myTbldf[3:9]))
user system elapsed
0.021 0.003 6.722
C D E F G H I
C 1.0000000 0.7096425 0.22154483 0.24186862 0.13354501 0.4954111 0.19577803
D 0.7096425 1.0000000 0.24272691 0.27590234 0.13358632 0.4279517 0.14634683
E 0.2215448 0.2427269 1.00000000 0.08940507 0.03734827 0.1631614 0.04401034
F 0.2418686 0.2759023 0.08940507 1.00000000 0.07664496 0.1686094 0.04744032
G 0.1335450 0.1335863 0.03734827 0.07664496 1.00000000 0.1247046 0.05837435
H 0.4954111 0.4279517 0.16316144 0.16860940 0.12470460 1.0000000 0.35395733
I 0.1957780 0.1463468 0.04401034 0.04744032 0.05837435 0.3539573 1.00000000
CONCLUSION
• Big data is here. See through the hype
• Analyze big data to extract value
• Multiple technologies & analytics tools are out there
• Match platform, tools and approach
• Delegate massive processing to big clusters
QUESTIONS?
BIG DATA EMPOWERS ALGORITHMS
Banko & Brill “Scaling to Very Very Large Corpora for
Natural Language Disambiguation”

Mais conteúdo relacionado

Destaque

PayPal Behavioral Analytics on Hadoop
PayPal Behavioral Analytics on HadoopPayPal Behavioral Analytics on Hadoop
PayPal Behavioral Analytics on Hadoop
DataWorks Summit
 
eCommerce and ePayments markets in Russia : trends , analytics , perspect...
eCommerce and  ePayments markets in  Russia :  trends ,  analytics , perspect...eCommerce and  ePayments markets in  Russia :  trends ,  analytics , perspect...
eCommerce and ePayments markets in Russia : trends , analytics , perspect...
Data Insight
 

Destaque (6)

PayPal Behavioral Analytics on Hadoop
PayPal Behavioral Analytics on HadoopPayPal Behavioral Analytics on Hadoop
PayPal Behavioral Analytics on Hadoop
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
 
eCommerce and ePayments markets in Russia : trends , analytics , perspect...
eCommerce and  ePayments markets in  Russia :  trends ,  analytics , perspect...eCommerce and  ePayments markets in  Russia :  trends ,  analytics , perspect...
eCommerce and ePayments markets in Russia : trends , analytics , perspect...
 
Paymetrics Deck - Seed Round
Paymetrics Deck - Seed RoundPaymetrics Deck - Seed Round
Paymetrics Deck - Seed Round
 
PayPal: A case study
PayPal: A case studyPayPal: A case study
PayPal: A case study
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Semelhante a Big Data: It's More Than Volume, Paypal

Introduction to CQRS and DDDD
Introduction to CQRS and DDDDIntroduction to CQRS and DDDD
Introduction to CQRS and DDDD
Vladik Khononov
 
The truth behind virtual dom
The truth behind virtual domThe truth behind virtual dom
The truth behind virtual dom
AnhPham348
 
The truth behind virtual dom
The truth behind virtual domThe truth behind virtual dom
The truth behind virtual dom
AnhPham348
 

Semelhante a Big Data: It's More Than Volume, Paypal (20)

Working with MongoDB as MySQL DBA
Working with MongoDB as MySQL DBAWorking with MongoDB as MySQL DBA
Working with MongoDB as MySQL DBA
 
GreenDao Introduction
GreenDao IntroductionGreenDao Introduction
GreenDao Introduction
 
Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & Spark
 
SlamData - How MongoDB Is Powering a Revolution in Visual Analytics
SlamData - How MongoDB Is Powering a Revolution in Visual AnalyticsSlamData - How MongoDB Is Powering a Revolution in Visual Analytics
SlamData - How MongoDB Is Powering a Revolution in Visual Analytics
 
RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK - Nicola Iarocci - Co...
RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK -  Nicola Iarocci - Co...RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK -  Nicola Iarocci - Co...
RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK - Nicola Iarocci - Co...
 
Portfolio Oversight With eazyBI
Portfolio Oversight With eazyBIPortfolio Oversight With eazyBI
Portfolio Oversight With eazyBI
 
OSCON 2011 CouchApps
OSCON 2011 CouchAppsOSCON 2011 CouchApps
OSCON 2011 CouchApps
 
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
 
Hypermedia-driven Web Services with Spring Data REST
Hypermedia-driven Web Services with Spring Data RESTHypermedia-driven Web Services with Spring Data REST
Hypermedia-driven Web Services with Spring Data REST
 
Let your DBAs get some REST(api)
Let your DBAs get some REST(api)Let your DBAs get some REST(api)
Let your DBAs get some REST(api)
 
Mongo db presentation
Mongo db presentationMongo db presentation
Mongo db presentation
 
The Open & Social Web - Kings of Code 2009
The Open & Social Web - Kings of Code 2009The Open & Social Web - Kings of Code 2009
The Open & Social Web - Kings of Code 2009
 
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation Enhancements
 
Building LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDBBuilding LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDB
 
Introduction to CQRS and DDDD
Introduction to CQRS and DDDDIntroduction to CQRS and DDDD
Introduction to CQRS and DDDD
 
The truth behind virtual dom
The truth behind virtual domThe truth behind virtual dom
The truth behind virtual dom
 
The truth behind virtual dom
The truth behind virtual domThe truth behind virtual dom
The truth behind virtual dom
 
Data-Ed Webinar: Data Modeling Fundamentals
Data-Ed Webinar: Data Modeling FundamentalsData-Ed Webinar: Data Modeling Fundamentals
Data-Ed Webinar: Data Modeling Fundamentals
 

Mais de Innovation Enterprise

Mais de Innovation Enterprise (20)

Marketing Technology Organizational Models
Marketing Technology Organizational ModelsMarketing Technology Organizational Models
Marketing Technology Organizational Models
 
BI, INC - BI, INC, Boeing
BI, INC - BI, INC, BoeingBI, INC - BI, INC, Boeing
BI, INC - BI, INC, Boeing
 
Bridging the Gap between Budgets & Reality Oracle's Next Generation S&OP Solu...
Bridging the Gap between Budgets & Reality Oracle's Next Generation S&OP Solu...Bridging the Gap between Budgets & Reality Oracle's Next Generation S&OP Solu...
Bridging the Gap between Budgets & Reality Oracle's Next Generation S&OP Solu...
 
Beyond the Basics: Leveraging S&OP to Deliver Results, Newell Rubbermaid
Beyond the Basics: Leveraging S&OP to Deliver Results, Newell RubbermaidBeyond the Basics: Leveraging S&OP to Deliver Results, Newell Rubbermaid
Beyond the Basics: Leveraging S&OP to Deliver Results, Newell Rubbermaid
 
CHAINalytics, Empowering Fact Based Decisions Across Your Supply Chain
CHAINalytics, Empowering Fact Based Decisions Across Your Supply ChainCHAINalytics, Empowering Fact Based Decisions Across Your Supply Chain
CHAINalytics, Empowering Fact Based Decisions Across Your Supply Chain
 
Sales Transformation: The Role of Sales Strategy & Operations, Dow Jones & Co...
Sales Transformation: The Role of Sales Strategy & Operations, Dow Jones & Co...Sales Transformation: The Role of Sales Strategy & Operations, Dow Jones & Co...
Sales Transformation: The Role of Sales Strategy & Operations, Dow Jones & Co...
 
One Version of the Truth, Driving S&OP from detailed planning tools, Freescale
One Version of the Truth, Driving S&OP from detailed planning tools, FreescaleOne Version of the Truth, Driving S&OP from detailed planning tools, Freescale
One Version of the Truth, Driving S&OP from detailed planning tools, Freescale
 
Making Sales and Operations Planning a Truly Collaborative Process, Dick Ling
Making Sales and Operations Planning a Truly Collaborative Process, Dick LingMaking Sales and Operations Planning a Truly Collaborative Process, Dick Ling
Making Sales and Operations Planning a Truly Collaborative Process, Dick Ling
 
Building a Fast and Flexible Consumer-Driven Supply Chain, Stanley Black & De...
Building a Fast and Flexible Consumer-Driven Supply Chain, Stanley Black & De...Building a Fast and Flexible Consumer-Driven Supply Chain, Stanley Black & De...
Building a Fast and Flexible Consumer-Driven Supply Chain, Stanley Black & De...
 
Strengthen the Processes to reach another level of excellence, Satish Sandhir
Strengthen the Processes to reach another level of excellence, Satish SandhirStrengthen the Processes to reach another level of excellence, Satish Sandhir
Strengthen the Processes to reach another level of excellence, Satish Sandhir
 
How to Keep S&OP From Getting "Stuck", Oliver Wight, JDA
How to Keep S&OP From Getting "Stuck", Oliver Wight, JDAHow to Keep S&OP From Getting "Stuck", Oliver Wight, JDA
How to Keep S&OP From Getting "Stuck", Oliver Wight, JDA
 
S&OP Innovation, Marietta
S&OP Innovation, MariettaS&OP Innovation, Marietta
S&OP Innovation, Marietta
 
Cisco Strategic Planning The Journey, Cisco
Cisco Strategic Planning The Journey, CiscoCisco Strategic Planning The Journey, Cisco
Cisco Strategic Planning The Journey, Cisco
 
Sales and Operations Planning, Supported by Demand Management Capability, Sus...
Sales and Operations Planning, Supported by Demand Management Capability, Sus...Sales and Operations Planning, Supported by Demand Management Capability, Sus...
Sales and Operations Planning, Supported by Demand Management Capability, Sus...
 
Enablers for Maturing your S&OP Processes, SherTrack
Enablers for Maturing your S&OP Processes, SherTrackEnablers for Maturing your S&OP Processes, SherTrack
Enablers for Maturing your S&OP Processes, SherTrack
 
S&OP, Kinaxis
S&OP, KinaxisS&OP, Kinaxis
S&OP, Kinaxis
 
Sales, Inventory & Operations Planning During High Growth, GMCR
Sales, Inventory & Operations Planning During High Growth, GMCRSales, Inventory & Operations Planning During High Growth, GMCR
Sales, Inventory & Operations Planning During High Growth, GMCR
 
Predicting The Future With Big Data: No Crystal Ball Required, TrendSpottr
Predicting The Future With Big Data: No Crystal Ball Required, TrendSpottrPredicting The Future With Big Data: No Crystal Ball Required, TrendSpottr
Predicting The Future With Big Data: No Crystal Ball Required, TrendSpottr
 
Big Data Toronto, Unata
Big Data Toronto, UnataBig Data Toronto, Unata
Big Data Toronto, Unata
 
Big Data in Education, Desire2Learn Inc
Big Data in Education, Desire2Learn IncBig Data in Education, Desire2Learn Inc
Big Data in Education, Desire2Learn Inc
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Big Data: It's More Than Volume, Paypal

  • 1. BIG DATA: IT’S MORE THAN VOLUME Nachum Shacham PayPal Big Data Innovation Summit April 2013
  • 2. IT’S BIG-DATA TIME! Volume  big platforms Variety  multiple data types Velocity  fast response Value  a treasure of patterns
  • 3. TECHNOLOGY HYPE CYCLE 3 DM Tech Forum BIG DATA
  • 4. MIXED SIGNALS FROM THE PUNDITS • Data Lake • “Needle in a hay stack” • “All hay no needles” • “Yet another fad” • “Noth’n new: we’ve been analyzing data for 30 years” 4 DM Tech Forum • “Store’em and they’ll come” • “Don’t ever discard data” • “$524.752MM ROI in 3 years” • “Smart” … • “Hadoop is free” • “Just…”
  • 5. USE YOUR OWN FILTER • Sift facts from MBS • Seek factual 1-liners • See through metaphors • Discount “Smart” (data, algorithms, systems) • Be skeptical 5 DM Tech Forum
  • 6. UNLOCK THE VALUE IN BIG DATA • Data Trumps Algorithms • Sufficient data further down the long tail • Wisdom of the crowd  effective recommendations • Combine signals from different media 6 DM Tech Forum
  • 7. BUSINESS VALUE IN BIG DATA 7 DM Tech Forum RISK ANALYSIS IDENTIFY INFLUENCERS IN SOCIAL GRAPHONLINE ADS REVENUE OPTIMIZATION FRAUD DETECTION AND PREVENTION
  • 8. LET’S DIG INTO BIG DATA • Define KPIs • Explore • Model & Measure • Visualize signals • Test • Question test results • Rinse and Repeat 8 DM Tech Forum
  • 9. BIG-DATA ANALYTICS FROM SEMI-STRUCTURED DATA TO BUSINESS SIGNALS 9 MapAttempt TASK_TYPE="SETUP" TASKID="task_201212150932_52151_m_000051" TASK_ATTEMPT_ID="attempt_201212150932_52151_m _000051_0" TASK_STATUS="SUCCESS" Task TASKID="task_201212150932_52151_m_000051" TASK_TYPE="SETUP" TASK_STATUS="SUCCESS" FINISH_TIME="1355822133162" COUNTERS="{(FileSystemCounters)(FileSystemCounter s)[(FILE_BYTES_WRITTEN)
  • 10. Cloud RDBMS Data Warehouse Hadoop MPP PLATFORMS AS WORKBENCHES FOR BIG DATA AND THEIR TOOLS
  • 11. CLASSES OF ANALYTICS JOBS Big Data Data organization for BI A few large models Many small models 11 DATA MANIPULATION GRAPHICS MODEL BUILDING CROSS VALIDATION PROBLEM MR FORMULATION
  • 12. MATCH THE JOB TO THE PLATFORM
  • 14. data files process lines set sorting key and value output <key, value> Collect segment data marked by key Process segment data Output processed segment data Shuffle sort Reducer.R Mapper.py Text processing Model per segment BI-LINGUAL HADOOP STREAMING: LARGE SCALE PARALLEL PREDICTIVE MODELING
  • 15. SEMI-STRUCTURED DATA  TABULAR DATA Meta VERSION="1" . Job JOBID="job_201212150932_52151" JOBNAME=”DataFilter" USER=”user1234” SUBMIT_TIME="1355822133394" JOBCONF="hdfs://tmp/hadoop-hadoop/mapred/staging/user1234/.staging/job_201212150932_52151/job.xml" VIEW_JOB=" " MODIFY_JOB=" " JOB_QUEUE=”B" . Job JOBID="job_201212150932_52151" JOB_PRIORITY="NORMAL" . Job JOBID="job_201212150932_52151" LAUNCH_TIME="1355822223576" TOTAL_MAPS="50" TOTAL_REDUCES="0" JOB_STATUS="PREP" . Task TASKID="task_201212150932_52151_m_000051" TASK_TYPE="SETUP" START_TIME="1355822133148" SPLITS="" . MapAttempt TASK_TYPE="SETUP" TASKID="task_201212150932_52151_m_000051” TASK_ATTEMPT_ID="attempt_201212150932_52151_m_000051_0" START_TIME="1355822133545" TRACKER_NAME="tracker_dn0492.ebay.com:localhost.localdomain/127.0.0.1:33613" HTTP_PORT="50060" . MapAttempt TASK_TYPE="SETUP" TASKID="task_201212150932_52151_m_000051" TASK_ATTEMPT_ID="attempt_201212150932_52151_m_000051_0" TASK_STATUS="SUCCESS" Task TASKID="task_201212150932_52151_m_000051" TASK_TYPE="SETUP" TASK_STATUS="SUCCESS" FINISH_TIME="1355822133162" COUNTERS="{(FileSystemCounters)(FileSystemCounters)[(FILE_BYTES_WRITTEN) (FILE_BYTES_WRITTEN)(27089)]}{(org.apache.hadoop.mapred.Task$Counter) (Map-Reduce Framework)[(SPILLED_RECORDS)(Spilled Records)(0)]}" . Job JOBID="job_201212150932_52151" JOB_STATUS="RUNNING" . Task TASKID="task_201212150932_52151_m_000001" TASK_TYPE="MAP" START_TIME="1355822133163" attempt,201212171719,248176,m,000013,0,1355499674337,1355499903213,MAP,SUCCESS,default,rack3,lvsaishdc3dn0109,0109 attempt,2012121771719,248176,m,000464,0,1355501042650,1355501253259,MAP,SUCCESS,default,rack5,lvsaishdc3dn0217,0217 attempt,2012121771719,248176,m,000626,0,1355501212902,1355501366476,MAP,SUCCESS,default,rack17,lvsaishdc3dn0776,077 6 attempt,2012121771719,248176,m,001193,0,1355499673762,1355499887662,MAP,SUCCESS,default,rack8,lvsaishdc3dn0366,036 attempt,2012121771719,248176,m,001355,0,1355499673545,1355499908182,MAP,SUCCESS,default,rack9,lvsaishdc3dn0386,0386 attempt,2012121771719,248176,m,001517,0,1355501266524,1355501470527,MAP,SUCCESS,default,rack5,lvsaishdc3dn0236,0236 attempt,2012121771719,248176,m,001850,0,1355501303142,1355501486691,MAP,SUCCESS,default,rack5,lvsaishdc3dn0235,0235
  • 16. FROM TABULAR DATA TO BI 16 DM Tech Forum
  • 18. MODELS BUILT ON LARGE DATASETS 18 Meta VERSION="1" . Job JOBID="job_201112150932_52151" JOBNAME=”DataFilter" USER=”user1234” LAUNCH_TIME="1324801865576” TIME INTERVAL DATA CONCURRENCY PERCENTILES TIME SERIESWORD COUNT REPRESENTATION AVOID RAM LIMITATIONS R STAT PROCESSING
  • 19. Cloud R LEVERAGING RDBMS POWER teradataR Scidb-R
  • 20. TERADATAR FUNCTIONS (SAMPLE) Function Name What it does td.zscore Zscore Transformation td.t.paired T Test Paired td.cor Correlation Matrix td.f.oneway One way F Test td.factanal Factor Analysis td.freq Frequency Analysis td.hist Histograms td.kmeans K-Means Clustering td.ks Kolmogorov Smirnov Test td.mode Mode Value of Column td.tapply Apply a function over a database column td.summary Like R summary() td.quantiles Quantile Values td.rank Rank
  • 21. ANALYSIS OF A TABLE WITH > 1B ROWS >library(RJDBC) >library(teradataR) >tdConnect(”TD_WH", uid = tdlogin, pwd = tdpwd, database = ”myVDM”) > system.time(myTbldf <- td.data.frame(”myTbl")) user system elapsed 0.092 0.054 140.071 > dim(myTbldf ) [1] 1,131,670,269 9 > system.time(cor <- td.cor(myTbldf[3:9])) user system elapsed 0.021 0.003 6.722 C D E F G H I C 1.0000000 0.7096425 0.22154483 0.24186862 0.13354501 0.4954111 0.19577803 D 0.7096425 1.0000000 0.24272691 0.27590234 0.13358632 0.4279517 0.14634683 E 0.2215448 0.2427269 1.00000000 0.08940507 0.03734827 0.1631614 0.04401034 F 0.2418686 0.2759023 0.08940507 1.00000000 0.07664496 0.1686094 0.04744032 G 0.1335450 0.1335863 0.03734827 0.07664496 1.00000000 0.1247046 0.05837435 H 0.4954111 0.4279517 0.16316144 0.16860940 0.12470460 1.0000000 0.35395733 I 0.1957780 0.1463468 0.04401034 0.04744032 0.05837435 0.3539573 1.00000000
  • 22. CONCLUSION • Big data is here. See through the hype • Analyze big data to extract value • Multiple technologies & analytics tools are out there • Match platform, tools and approach • Delegate massive processing to big clusters
  • 24. BIG DATA EMPOWERS ALGORITHMS Banko & Brill “Scaling to Very Very Large Corpora for Natural Language Disambiguation”

Notas do Editor

  1. Big data is here, and corporations leverage MPP platforms like Hadoop and Teradata, for cost effective storage and processing of vast amounts of data. However, mining the business benefits of big data requires new approaches for deep analytics including predictive modeling and statistical analysis.Modeling big data requires a comprehensive process that includes noisy data of different structures, and done in parallel on large number of processorsStill need to perform the analytics tasks in a cost effective manner.We describe our experience in running statistical analysis and modeling of big data.We will review and compare the platforms we use to store and process the data.Then describe integrating processing with R, Python and SQL on Hadoop and Teradata for a range of analytics tasks. .
  2. The large volumes of data need to be stored and processed on data platforms, which are clusters of computers with vast storage and processing power.The data consist of combination of structured, semi-structured and unstructured data, that needs special processing for cleansing the data, reshaping the data for modeling, and a large set of algorithms to extract the value from the data.Big data contain sufficient amount of information for analysis of otherwise too-small segments of the market. The sheer combinations of those segments can yield a wealth of patterns that can be mined for the corporation. As more people get to view and explore the data, the more patterns will be identified, increasing the value to the corporationThus, making big data analysis feasible to large groups of people, beyond few developers, will lead to more interaction with the data hence to more benefits.
  3. Big data offer many opportunities to corporations to extract signals to guide profitable decisions.A large portion of the new big data comes from the wild in unstructured and semi-structured formatsThese data need to be cleansed and structured to enable the computation of statistical metrics and construction of predictive modelsThe volume and format and the wealth of analysis tasks requires application of different tools and environments to store and process the data.The patterns and signals in the data are more likely to be extracted when large number of analysts are given access and can construct their own models.Thus, make the tools available and accessible to the many.
  4. The most common architecture for big data is MPP.RDBMS and Hadoop are the most common architectures.They are similar in employing a large number of processors and disks and distributing the processing to where the data areRDBMS and Hadoop offer different programming environments and performance characteristics.Companies are increasingly deploying both platforms to accommodate a wide spectrum of business analytics needs.When supporting multiple concurrent user jobs they have to deliver not only data and computation but also quality of service that match users’ expectations.How to allocate workloads to platforms to maximize value is an area of active research.A large number of programming languages and tools have been developed for these platforms. Java, PIG, Hive, and scala are powerful tools that many organizations have adopted.We have found Python and R to be particularly attractive to the analytics tasks that we are performing. They are well established languages that many analysts have been using for years on smaller datasets. When combined in the the Streaming frameworks, R and Python can be used to create models quickly and in code that is clear and concise. Their packages provide many models and processing tasks out of the box.Teradata offers strong SQL implementation with many extension UDFs designed for processing of semi-structured data in textual format.An R package was recently published that enables using the processing power of the cluster for many statistical functions for running on massive datasets
  5. This table compares the platforms based on the types of processing tasks. For example, scanning large tables of text is most suitable for Hadoop whereas jobs that modify tables or search based on primary index are more efficiently performed in TD.Special functions can be written more easily for Hadoop whereas join to 2 large tables is more easily done on TD.When data are replicated across multiple platforms, such tables are used to decide on the best platform to run particular jobs.
  6. We now turn to the topic of creating and running the actual analytics tasks. R is a powerful language that was designed for data analysis and statistical modeling. It has functions and packages for processing data at all the steps of the data analysis cycles: from sourcing the data from RDBMS, flat files, or the web through data preparation, exploratory data analysis, model creation for all imaginable statistical test or algorithm, DOE, model validation, variable selection, all the way to creation of charge and graphs for presenting the results R is gaining in popularity and has been place in the top 20 programming languages. However, in our experience we found Python to be more effective in text processing.which calls for using both languages in Hadoop tasks.
  7. On Hadoop, the Streaming framework enables us to run mapper and reducer in different languages. In this environment, the mapper is written in Python and the reducer in R.The cleansed and filtered map data is send to the framwork with proper keys that deliver to the reducer the data in logical chunks, each of each is considered as a statistical dataset, in the form of data.frame.The model is built on these data frames in the same way it has been traditionally done in R. Only in Hadoop, all reducers perform these task in parallel.