SlideShare uma empresa Scribd logo
1 de 74
The Rise of Big Data Science
GILAD

BARKAN
Big Data Science

Big
Data

Data
Science

Big
Data
Science
Big Data
 Why ?

 What ?
 How ?
Big Data
 Why ?

 What ?
 How ?
Why Big Data ?
 It’s the flooded information era we live in

 In a world where data is power, big data is big power
Why Big Data ?
 Web 2.0
Why should we care about Big Data ?
 The big business opportunities
 Competitive fast moving marketplace


Capitalize on business opportunities before everyone else

Existing channels to every person on the planet
 Maximizing revenues from customers
 Segment-of-1 - more personal customer experiences

Big Data
 Why ?

 What ?
 How ?
What is Big Data ?
 The 3 V’s

Volume

Variety

Velocity
What is Big Data ?
 The 3 V’s

Volume

Variety

Velocity
Big Data - Volume
Big Data - Volume

Big Users
More Users, All the Time

2 35 1

+

Billion

Global Online
Population

Billion Hours

Hours Spent
Online

Billion

Smartphone
Users
More
Users

More
Data

+

Big Data
What is Big Data ?
 The 3 V’s

Volume

Variety

Velocity
Big Data - Variety

Trillions of Gigabytes (Zettabytes)

 Heterogeneous sources of data
 Structured
Un/SemiStructured Data
 Unstructured
Structured Data

Audio
images

tables

text

video

700 MB / movie

Text, Log
Files, Click
5000 KB / song Streams, Blogs, T
weets, Audio, Vide
o, etc.

1000 KB / image

5 KB / record

Traditional Structured SQL

50 KB / record

Unstructured NoSQL
What is Big Data ?
 The 3 V’s

Volume

Variety

Velocity
Big Data - Velocity
 How the hell does Google return an answer in 0.28

seconds by looking at 4 Billion pages?
Big Data - Velocity
 Online Advertisement - Real Time Bidding (RTB)
Big Data - Velocity
 Recommendations
Big Data
 Why ?

 What ?
 How ?
How is Big Data Handled ?
 The challenge is huge
 Store, analyze and serve huge volume of variety of data
in high velocity
 We can’t achieve this using a single machine, no

matters how strong it is. Why?
Expensive – stay tuned
 Load balancing requests


Outbrain serves 3,000 per second
 DG (MediaMind) serves 500K per second!!!




Not fault tolerant
The Big Data Paradigms Shifts
Volume

Distributing the Data
Scale Out

Scale Up

(Horizontal)

(Vertical)
SQL Server
Hadoop
Cluster

HDFS
(GFS)

Nodes
Big Data –Reducing Costs
 Hadoop is a 5 times cheaper infrastructure !!!

 TCO (purchase + maintenance) for 3 years per 300 TB:

DBMS server = 5 M$

75 nodes cluster = 1 M$
Big Data Paradigm Shift - Computing
MapReduce Computing Paradigm
 Exploiting the distributed architecture for large scale

computations in parallel
MapReduce
 “Hello MapReduce” – counting words

Map

Mappers
W
the

C

the

7

Cow

1

quick

0

W

C

the

9

Cow

Hadoop Cluster

2

W

URL 2

0

quick

1

quick

3

Reduce

5

Cow

Master

C

Reducer

+

W

C

the

21

Cow

2

quick

5
Big Data Paradigm Shift – NoSQL
Variety

 Schema-less databases to support the variety of data

 Complex SQL queries (joins, etc.) in a distributed data

framework is extremely inefficient
  Key-Value Store
NoSQL
Key

Value

user_id
Any – not single
primary as in SQL

tables

url

text

image_id
video_id

images

video

any
Big Data Paradigm Shift –

Velocity

 RAM-based DBs instead of traditional disk-based DBs
 Store critical data in memory (much more expensive)
 If the data doesn't come to Alg - Alg will come to the data
Alg
Write

Read

Data

Alg
Read

Write
Data

traditional

today
Big Data - Summary
Big Data - Summary
 BIG business opportunities

 The 3 V’s: Volume, Variety, Velocity
 Technological paradigm shifts
Big Data Technological Paradigm Shifts
Volume
Scale up

Map

Variety
NoSQL

Scale Out

Mappers

Key

Value

Velocity
Reduce

Alg
Alg
Data

Master

Reducer

Data
Big Data - Summary
 BIG business opportunities

 The 3 V’s: Volume, Variety, Velocity
 Computing and DB paradigm shifts
 Flood of new (open source) technologies
Flood of New Big Data Technologies
 Open Source
Big Data - Summary
 BIG business opportunities

 The 3 V’s: Volume, Variety, Velocity
 Computing and DB paradigm shifts
 Flood of new (open source) technologies
 It’s definitely not just a buzz
Big Buzz ?
Big Data - Summary
 BIG business opportunities
 The 3 V’s: Volume, Variety, Velocity
 Computing and DB paradigm shifts
 Flood of new (open source) technologies
 It’s definitely not just a buzz

It’s a real response to the world hectic paced evolution
 reducing costs by order of magnitude


 Still it doesn’t mean every business today will / should

transform its technology stack to support big data
Big Data Science

Big
Data

Data
Science

Big
Data
Science
Data Science
 Why ?
 What ?
 How ?
Data Science
 Why ?
 What ?
 How ?
Why Data Science ?

data
scientists
Data is a real value
 Facebook acquires Onavo for ~150M$
Data Science
 Why ?
 What ?
 How ?
Welcome to the Intelligent world

Data
Analysis

Data
Mining

Data
Analytics

Data
Science
Automatic
Decisioning

Machine
Learning

Predictive
Analytics
Data Miners are the New Gold Miners
Search
Online Advertisement - Real Time Bidding (RTB)
Recommendations
 Recommendations
Text Analysis
CRM – Customers Churn Prediction
Time Series Analysis
Machine Learning
 Classification

 Clustering
 Regression
 Recommendation
Classification

Amdocs Insight™ - why is the customer calling the Call Center ?

Pay Bill
Third Party
Charges

Bill too
high

Overage
Abnormal
fee
Clustering

Market Segmentation
Social Network Analysis
Regression
 Housing price prediction
400

Price ($)
in 1000’s

300

280
215

200
100

50

100 130 150
Size in m2

200

250
The Data Scientist
Data Scientist Skillset

Hands on tools,
languages,
technologies

MsC / PhD in
Math, CS, Stats,
Physics

Hands on the
specific problem
domain
Data Science ≠ BI
 Apply advanced statistical machine learning

algorithms to:
dig deeper to find patterns that traditional BI tools may
not reveal
 much wider domains / applications spectrum


 Predictive Analytics ≠ Exploratory Analytics
Predictive Analytics
Data Science
Big Data Science

Vs.

Exploratory Analytics
Business Intelligence
Traditional BI
Exploratory Analytics
Academia Response to Data Science
Data Science
 Why ?
 What ?
 How ?
The Art of Data Science
 We need at least one semester course for it
 Still…
Data Science Life Cycle
Run Time

Offline Data
Analysis

Understand
Data

Prepare
Data

Monitor

Business
Goal

Deploy

Model

Evaluate
Closing the Loop
 Technically wise, what do you think?
 Is Big Data good or bad for Data Science ?

Big
Data

Data
Science

Big
Data
Science
The Bad - Finding a Needle in a Haystack
 It’s the same treasure that hides – the problem is

that the pile is now huge
 Big Data  Big Noise
The Bad - Finding a Needle in a Haystack
 It’s the same treasure that hides – the problem is

that the pile is now huge
 Big Data  Big Noise
The Good - The Statistical View
 Statistics is predictive analytics’ fuel !

 The more data you have (Big Data) the better your

predictive models will perform
Law of Large Numbers
Law of Large Numbers
Law of Large Numbers
Law of Large Numbers
Law of Large Numbers
Law of Large Numbers
Combining the Good & Bad
 Data is a function of quality and quantity

High

Quality
Low

Small

Quantity

Big
Big Data Science - Summary
 Big Data
  Big Numbers  Big Opportunities
 Big Data is the buzziest technology nowadays
 Data Scientists
 the ones that coax the treasures for their companies, out
of the big data
 Are multi-discipline skilled
 the new industry rock stars
Thank You for your attention

Mais conteúdo relacionado

Destaque

A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"Nicola Ferraro
 
[giip] A.I. Infrastructure Advisor (인공지능 인프라 어드바이저)
[giip] A.I. Infrastructure Advisor (인공지능 인프라 어드바이저)[giip] A.I. Infrastructure Advisor (인공지능 인프라 어드바이저)
[giip] A.I. Infrastructure Advisor (인공지능 인프라 어드바이저)Lowy Shin
 
JEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkJEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkTaras Matyashovsky
 
Business case for Big Data Analytics
Business case for Big Data AnalyticsBusiness case for Big Data Analytics
Business case for Big Data AnalyticsVijay Rao
 
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analyticsCapgemini
 
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...Big Data Spain
 
The Efficient Big data Platform - IDC 360, Copenhagen
The Efficient Big data Platform - IDC 360, CopenhagenThe Efficient Big data Platform - IDC 360, Copenhagen
The Efficient Big data Platform - IDC 360, CopenhagenPetri Pekkarinen
 
Data Modeling for Big Data
Data Modeling for Big DataData Modeling for Big Data
Data Modeling for Big DataDATAVERSITY
 
Deep Learning Use Cases - Data Science Pop-up Seattle
Deep Learning Use Cases - Data Science Pop-up SeattleDeep Learning Use Cases - Data Science Pop-up Seattle
Deep Learning Use Cases - Data Science Pop-up SeattleDomino Data Lab
 

Destaque (9)

A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
[giip] A.I. Infrastructure Advisor (인공지능 인프라 어드바이저)
[giip] A.I. Infrastructure Advisor (인공지능 인프라 어드바이저)[giip] A.I. Infrastructure Advisor (인공지능 인프라 어드바이저)
[giip] A.I. Infrastructure Advisor (인공지능 인프라 어드바이저)
 
JEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkJEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache Spark
 
Business case for Big Data Analytics
Business case for Big Data AnalyticsBusiness case for Big Data Analytics
Business case for Big Data Analytics
 
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analytics
 
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
 
The Efficient Big data Platform - IDC 360, Copenhagen
The Efficient Big data Platform - IDC 360, CopenhagenThe Efficient Big data Platform - IDC 360, Copenhagen
The Efficient Big data Platform - IDC 360, Copenhagen
 
Data Modeling for Big Data
Data Modeling for Big DataData Modeling for Big Data
Data Modeling for Big Data
 
Deep Learning Use Cases - Data Science Pop-up Seattle
Deep Learning Use Cases - Data Science Pop-up SeattleDeep Learning Use Cases - Data Science Pop-up Seattle
Deep Learning Use Cases - Data Science Pop-up Seattle
 

Último

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Último (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

The Rise of Big Data Science

  • 1. The Rise of Big Data Science GILAD BARKAN
  • 3. Big Data  Why ?  What ?  How ?
  • 4. Big Data  Why ?  What ?  How ?
  • 5. Why Big Data ?  It’s the flooded information era we live in  In a world where data is power, big data is big power
  • 6. Why Big Data ?  Web 2.0
  • 7. Why should we care about Big Data ?  The big business opportunities  Competitive fast moving marketplace  Capitalize on business opportunities before everyone else Existing channels to every person on the planet  Maximizing revenues from customers  Segment-of-1 - more personal customer experiences 
  • 8. Big Data  Why ?  What ?  How ?
  • 9. What is Big Data ?  The 3 V’s Volume Variety Velocity
  • 10. What is Big Data ?  The 3 V’s Volume Variety Velocity
  • 11. Big Data - Volume
  • 12. Big Data - Volume Big Users More Users, All the Time 2 35 1 + Billion Global Online Population Billion Hours Hours Spent Online Billion Smartphone Users
  • 14. What is Big Data ?  The 3 V’s Volume Variety Velocity
  • 15. Big Data - Variety Trillions of Gigabytes (Zettabytes)  Heterogeneous sources of data  Structured Un/SemiStructured Data  Unstructured Structured Data Audio images tables text video 700 MB / movie Text, Log Files, Click 5000 KB / song Streams, Blogs, T weets, Audio, Vide o, etc. 1000 KB / image 5 KB / record Traditional Structured SQL 50 KB / record Unstructured NoSQL
  • 16. What is Big Data ?  The 3 V’s Volume Variety Velocity
  • 17. Big Data - Velocity  How the hell does Google return an answer in 0.28 seconds by looking at 4 Billion pages?
  • 18. Big Data - Velocity  Online Advertisement - Real Time Bidding (RTB)
  • 19. Big Data - Velocity  Recommendations
  • 20. Big Data  Why ?  What ?  How ?
  • 21. How is Big Data Handled ?  The challenge is huge  Store, analyze and serve huge volume of variety of data in high velocity  We can’t achieve this using a single machine, no matters how strong it is. Why? Expensive – stay tuned  Load balancing requests  Outbrain serves 3,000 per second  DG (MediaMind) serves 500K per second!!!   Not fault tolerant
  • 22. The Big Data Paradigms Shifts Volume Distributing the Data Scale Out Scale Up (Horizontal) (Vertical) SQL Server Hadoop Cluster HDFS (GFS) Nodes
  • 23. Big Data –Reducing Costs  Hadoop is a 5 times cheaper infrastructure !!!  TCO (purchase + maintenance) for 3 years per 300 TB: DBMS server = 5 M$ 75 nodes cluster = 1 M$
  • 24. Big Data Paradigm Shift - Computing MapReduce Computing Paradigm  Exploiting the distributed architecture for large scale computations in parallel
  • 25. MapReduce  “Hello MapReduce” – counting words Map Mappers W the C the 7 Cow 1 quick 0 W C the 9 Cow Hadoop Cluster 2 W URL 2 0 quick 1 quick 3 Reduce 5 Cow Master C Reducer + W C the 21 Cow 2 quick 5
  • 26. Big Data Paradigm Shift – NoSQL Variety  Schema-less databases to support the variety of data  Complex SQL queries (joins, etc.) in a distributed data framework is extremely inefficient   Key-Value Store NoSQL Key Value user_id Any – not single primary as in SQL tables url text image_id video_id images video any
  • 27. Big Data Paradigm Shift – Velocity  RAM-based DBs instead of traditional disk-based DBs  Store critical data in memory (much more expensive)  If the data doesn't come to Alg - Alg will come to the data Alg Write Read Data Alg Read Write Data traditional today
  • 28. Big Data - Summary
  • 29. Big Data - Summary  BIG business opportunities  The 3 V’s: Volume, Variety, Velocity  Technological paradigm shifts
  • 30. Big Data Technological Paradigm Shifts Volume Scale up Map Variety NoSQL Scale Out Mappers Key Value Velocity Reduce Alg Alg Data Master Reducer Data
  • 31. Big Data - Summary  BIG business opportunities  The 3 V’s: Volume, Variety, Velocity  Computing and DB paradigm shifts  Flood of new (open source) technologies
  • 32. Flood of New Big Data Technologies  Open Source
  • 33. Big Data - Summary  BIG business opportunities  The 3 V’s: Volume, Variety, Velocity  Computing and DB paradigm shifts  Flood of new (open source) technologies  It’s definitely not just a buzz
  • 35. Big Data - Summary  BIG business opportunities  The 3 V’s: Volume, Variety, Velocity  Computing and DB paradigm shifts  Flood of new (open source) technologies  It’s definitely not just a buzz It’s a real response to the world hectic paced evolution  reducing costs by order of magnitude   Still it doesn’t mean every business today will / should transform its technology stack to support big data
  • 37. Data Science  Why ?  What ?  How ?
  • 38. Data Science  Why ?  What ?  How ?
  • 39. Why Data Science ? data scientists
  • 40. Data is a real value  Facebook acquires Onavo for ~150M$
  • 41. Data Science  Why ?  What ?  How ?
  • 42. Welcome to the Intelligent world Data Analysis Data Mining Data Analytics Data Science Automatic Decisioning Machine Learning Predictive Analytics
  • 43. Data Miners are the New Gold Miners
  • 45. Online Advertisement - Real Time Bidding (RTB)
  • 48. CRM – Customers Churn Prediction
  • 50. Machine Learning  Classification  Clustering  Regression  Recommendation
  • 51. Classification Amdocs Insight™ - why is the customer calling the Call Center ? Pay Bill Third Party Charges Bill too high Overage Abnormal fee
  • 53. Regression  Housing price prediction 400 Price ($) in 1000’s 300 280 215 200 100 50 100 130 150 Size in m2 200 250
  • 55. Data Scientist Skillset Hands on tools, languages, technologies MsC / PhD in Math, CS, Stats, Physics Hands on the specific problem domain
  • 56. Data Science ≠ BI  Apply advanced statistical machine learning algorithms to: dig deeper to find patterns that traditional BI tools may not reveal  much wider domains / applications spectrum   Predictive Analytics ≠ Exploratory Analytics
  • 57. Predictive Analytics Data Science Big Data Science Vs. Exploratory Analytics Business Intelligence Traditional BI Exploratory Analytics
  • 58. Academia Response to Data Science
  • 59. Data Science  Why ?  What ?  How ?
  • 60. The Art of Data Science  We need at least one semester course for it  Still…
  • 61. Data Science Life Cycle Run Time Offline Data Analysis Understand Data Prepare Data Monitor Business Goal Deploy Model Evaluate
  • 62. Closing the Loop  Technically wise, what do you think?  Is Big Data good or bad for Data Science ? Big Data Data Science Big Data Science
  • 63. The Bad - Finding a Needle in a Haystack  It’s the same treasure that hides – the problem is that the pile is now huge  Big Data  Big Noise
  • 64. The Bad - Finding a Needle in a Haystack  It’s the same treasure that hides – the problem is that the pile is now huge  Big Data  Big Noise
  • 65. The Good - The Statistical View  Statistics is predictive analytics’ fuel !  The more data you have (Big Data) the better your predictive models will perform
  • 66. Law of Large Numbers
  • 67. Law of Large Numbers
  • 68. Law of Large Numbers
  • 69. Law of Large Numbers
  • 70. Law of Large Numbers
  • 71. Law of Large Numbers
  • 72. Combining the Good & Bad  Data is a function of quality and quantity High Quality Low Small Quantity Big
  • 73. Big Data Science - Summary  Big Data   Big Numbers  Big Opportunities  Big Data is the buzziest technology nowadays  Data Scientists  the ones that coax the treasures for their companies, out of the big data  Are multi-discipline skilled  the new industry rock stars
  • 74. Thank You for your attention

Notas do Editor

  1. It’s an introductory lecture of the buzziest domain technology nowadays.The domain encapsulates a lot of new concepts, keywords, theories which make the full academic rainbow from computer science to business departments very busy to digest these upcoming, fast pacing concepts.Academies should, and do, offer new tracks to support these developments
  2. This trivial equation tells the whole story.The subject of this lecture is comprised of two parts: Big Data & Data ScienceAnd the lecture will appropriately be divided into these two parts.Of course we’ll see how they are connected and related to each other
  3. The Big Data tour will be divided into 3 parts (as everything is in…big data, and you’ll see shortly)
  4. The Big Data tour will be divided into 3 parts (as everything is in…big data, and you’ll see shortly)
  5. We’ll start with the why and then the what will be better understood.Big Data is a business / technological aspect of a wider social phenomena we’re currently leave in.As all past social revolutions, they were all started with a technological revolution, e.g. the French revolution was a side effect of the industrial revolution.This is a same case where the Internet created a social revolutionEveryone is connected to everyone
  6. Actually the Big Data as a phenomena started with the rise of Web2.0, where unlike the older Web 1.o, where only site owners created the online data, then came the users which create the content
  7. The Big Data tour will be divided into 3 parts (as everything is in…big data, and you’ll see shortly)
  8. Big Data -> big numbers.Taken from http://visual.ly/what-big-data
  9. Big Users is an equally big trend driving developers to use NoSQL databases.Most new applications are made available over the internet so people can easily access them.This has caused the number of simultaneous users for many applications to explode.The number of people connected to the internet is more than 2B and growing rapidly.The number of hours that the average user spends on the internet is growing too further increasing the number of simultaneous users.And, with the proliferation of smart phones, people use their applications more and more frequently further increasing the number of simultaneous users.All these simultaneous users leads to a rapidly growing number of database operations and the need for a far easier way to scale your database to meet these demands.Taken from Couchbase deck @ IGTCloud summit 2013http://www.go-gulf.com/blog/online-timehttp://business.time.com/2012/02/14/one-billion-smartphones-by-2016-here-comes-the-mobile-arms-race/
  10. To summarize, the technology implications of the Big Data, Big User, and Cloud Computing mega trends are causing people to seriously rethink what database they use for their applications and are increasingly coming to the conclusion that NoSQL databases are a better fit than relational databases.
  11. Finally, the move to cloud computing and SaaS business models is also driving developers to consider NoSQL databases.15 years ago most applications were developed with a client/server architecture and a packaged software business model that supported the needs of users on a company-by-company basis.Today, applications are increasingly developed using a 3-tier internet architecture, are cloud-based, and use a Software-as-a-Service business model that needs to support the collective needs of thousandsvof customersThis approach increasingly requires a horizontally scalable architecture that easily scales with the number of users and amount of data your application has.
  12. The Big Data tour will be divided into 3 parts (as everything is in…big data, and you’ll see shortly)
  13. Outbrain serves 8 billion impressions a month = 3000 impressions / sec ; DG (MediaMind) serves 50 billion a day = 500K/sechttp://readwrite.com/2013/05/29/the-real-reason-hadoop-is-such-a-big-deal-in-big-datahttp://www.computerworlduk.com/in-depth/applications/1779/oracles-database-machine-how-much-will-it-really-cost/
  14. http://readwrite.com/2013/05/29/the-real-reason-hadoop-is-such-a-big-deal-in-big-data
  15. MapReduce providesUser-defined functionsAutomatic parallelization and distributionFault-toleranceI/O schedulingStatus and monitoring
  16. MapReduce providesUser-defined functionsAutomatic parallelization and distributionFault-toleranceI/O schedulingStatus and monitoring
  17. Taken from http://db-engines.com/en/ranking
  18. This trivial equation tells the whole story.The subject of this lecture is comprised of two parts: Big Data & Data ScienceAnd the lecture will appropriately be divided into these two parts.Of course we’ll see how they are connected and related to each other
  19. Ok, we have the big data. Now, what are we doing with it?Big data is important if you want to be successful in analytic processing. But, why is that important? The answer is that success in a highly competitive, fast-moving marketplace is determined by who can capitalize on business opportunities before everyone else seizes the same opportunity. In this section we’ll meet the data scientists / data miners that coax treasures out of the huge volume of data
  20. Although Onavo has started from a service that optimizes devices & apps performance, on the way they’ve collected logs from these apps & devices and became one of the leading mobile analytics aggregators in the world
  21. Notations first.It has many names that mean more or less the same: the art of inference insights from data
  22. In this section we’ll meet the data scientists / data miners that coax treasures out of the huge volume of data.Domains applying data science / data mining.. Vary:
  23. Learning is comprised of three steps: First, we build our probabilistic model of the real worldThen, we train the model with labeled (supervised) examples, i.e. this is a car, this is not a car. This takes place offline.Last, online, we feed the model with a totally new example and expect it will predict for us the correct prediction
  24. Drew Conway, http://www.dataists.com/2010/09/the-data-science-venn-diagram