The Rise of Big Data Science

The Rise of Big Data Science
GILAD

BARKAN

Big Data Science

Big
Data

Data
Science

Big
Data
Science

Big Data
 Why ?

 What ?
 How ?

Why Big Data ?
 It’s the flooded information era we live in

 In a world where data is power, big data is big power

Why should we care about Big Data ?
 The big business opportunities
 Competitive fast moving marketplace


Capitalize on business opportunities before everyone else

Existing channels to every person on the planet
 Maximizing revenues from customers
 Segment-of-1 - more personal customer experiences


What is Big Data ?
 The 3 V’s

Volume

Variety

Velocity

Big Data - Volume

Big Users
More Users, All the Time

2 35 1

+

Billion

Global Online
Population

Billion Hours

Hours Spent
Online

Billion

Smartphone
Users

More
Users

More
Data

+

Big Data

Big Data - Variety

Trillions of Gigabytes (Zettabytes)

 Heterogeneous sources of data
 Structured
Un/SemiStructured Data
 Unstructured
Structured Data

Audio
images

tables

text

video

700 MB / movie

Text, Log
Files, Click
5000 KB / song Streams, Blogs, T
weets, Audio, Vide
o, etc.

1000 KB / image

5 KB / record

Traditional Structured SQL

50 KB / record

Unstructured NoSQL

Big Data - Velocity
 How the hell does Google return an answer in 0.28

seconds by looking at 4 Billion pages?

Big Data - Velocity
 Online Advertisement - Real Time Bidding (RTB)

Big Data - Velocity
 Recommendations

How is Big Data Handled ?
 The challenge is huge
 Store, analyze and serve huge volume of variety of data
in high velocity
 We can’t achieve this using a single machine, no

matters how strong it is. Why?
Expensive – stay tuned
 Load balancing requests


Outbrain serves 3,000 per second
 DG (MediaMind) serves 500K per second!!!




Not fault tolerant

The Big Data Paradigms Shifts
Volume

Distributing the Data
Scale Out

Scale Up

(Horizontal)

(Vertical)
SQL Server
Hadoop
Cluster

HDFS
(GFS)

Nodes

Big Data –Reducing Costs
 Hadoop is a 5 times cheaper infrastructure !!!

 TCO (purchase + maintenance) for 3 years per 300 TB:

DBMS server = 5 M$

75 nodes cluster = 1 M$

Big Data Paradigm Shift - Computing
MapReduce Computing Paradigm
 Exploiting the distributed architecture for large scale

computations in parallel

MapReduce
 “Hello MapReduce” – counting words

Map

Mappers
W
the

C

the

7

Cow

1

quick

0

W

C

the

9

Cow

Hadoop Cluster

2

W

URL 2

0

quick

1

quick

3

Reduce

5

Cow

Master

C

Reducer

+

W

C

the

21

Cow

2

quick

5

Big Data Paradigm Shift – NoSQL
Variety

 Schema-less databases to support the variety of data

 Complex SQL queries (joins, etc.) in a distributed data

framework is extremely inefficient
  Key-Value Store
NoSQL
Key

Value

user_id
Any – not single
primary as in SQL

tables

url

text

image_id
video_id

images

video

any

Big Data Paradigm Shift –

Velocity

 RAM-based DBs instead of traditional disk-based DBs
 Store critical data in memory (much more expensive)
 If the data doesn't come to Alg - Alg will come to the data
Alg
Write

Read

Data

Alg
Read

Write
Data

traditional

today

Big Data - Summary
 BIG business opportunities

 The 3 V’s: Volume, Variety, Velocity
 Technological paradigm shifts

Big Data Technological Paradigm Shifts
Volume
Scale up

Map

Variety
NoSQL

Scale Out

Mappers

Key

Value

Velocity
Reduce

Alg
Alg
Data

Master

Reducer

Data

Big Data - Summary

 Computing and DB paradigm shifts
 Flood of new (open source) technologies

Flood of New Big Data Technologies
 Open Source

Big Data - Summary

 It’s definitely not just a buzz

Big Data - Summary
 It’s definitely not just a buzz

It’s a real response to the world hectic paced evolution
 reducing costs by order of magnitude


 Still it doesn’t mean every business today will / should

transform its technology stack to support big data

Data Science
 Why ?
 What ?
 How ?

Why Data Science ?

data
scientists

Data is a real value
 Facebook acquires Onavo for ~150M$

Welcome to the Intelligent world

Data
Analysis

Data
Mining

Data
Analytics

Data
Science
Automatic
Decisioning

Machine
Learning

Predictive
Analytics

Data Miners are the New Gold Miners

Online Advertisement - Real Time Bidding (RTB)

Recommendations
 Recommendations

CRM – Customers Churn Prediction

Machine Learning
 Classification

 Clustering
 Regression
 Recommendation

Classification

Amdocs Insight™ - why is the customer calling the Call Center ?

Pay Bill
Third Party
Charges

Bill too
high

Overage
Abnormal
fee

Clustering

Market Segmentation
Social Network Analysis

Regression
 Housing price prediction
400

Price ($)
in 1000’s

300

280
215

200
100

50

100 130 150
Size in m2

200

250

Data Scientist Skillset

Hands on tools,
languages,
technologies

MsC / PhD in
Math, CS, Stats,
Physics

Hands on the
specific problem
domain

Data Science ≠ BI
 Apply advanced statistical machine learning

algorithms to:
dig deeper to find patterns that traditional BI tools may
not reveal
 much wider domains / applications spectrum


 Predictive Analytics ≠ Exploratory Analytics

Predictive Analytics
Data Science
Big Data Science

Vs.

Exploratory Analytics
Business Intelligence
Traditional BI
Exploratory Analytics

Academia Response to Data Science

The Art of Data Science
 We need at least one semester course for it
 Still…

Data Science Life Cycle
Run Time

Offline Data
Analysis

Understand
Data

Prepare
Data

Monitor

Business
Goal

Deploy

Model

Evaluate

Closing the Loop
 Technically wise, what do you think?
 Is Big Data good or bad for Data Science ?

Big
Data

Data
Science

Big
Data
Science

The Bad - Finding a Needle in a Haystack
 It’s the same treasure that hides – the problem is

that the pile is now huge
 Big Data  Big Noise

The Good - The Statistical View
 Statistics is predictive analytics’ fuel !

 The more data you have (Big Data) the better your

predictive models will perform

Combining the Good & Bad
 Data is a function of quality and quantity

High

Quality
Low

Small

Quantity

Big

Big Data Science - Summary
 Big Data
  Big Numbers  Big Opportunities
 Big Data is the buzziest technology nowadays
 Data Scientists
 the ones that coax the treasures for their companies, out
of the big data
 Are multi-discipline skilled
 the new industry rock stars

The Rise of Big Data Science

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (9)

Último

Último (20)

The Rise of Big Data Science

Notas do Editor