This is an introductory lecture of the buzziest domain technology nowadays.
The domain encapsulates a lot of new concepts, keywords, theories and paradigm shifts, from computer science to business.
7. Why should we care about Big Data ?
The big business opportunities
Competitive fast moving marketplace
Capitalize on business opportunities before everyone else
Existing channels to every person on the planet
Maximizing revenues from customers
Segment-of-1 - more personal customer experiences
12. Big Data - Volume
Big Users
More Users, All the Time
2 35 1
+
Billion
Global Online
Population
Billion Hours
Hours Spent
Online
Billion
Smartphone
Users
14. What is Big Data ?
The 3 V’s
Volume
Variety
Velocity
15. Big Data - Variety
Trillions of Gigabytes (Zettabytes)
Heterogeneous sources of data
Structured
Un/SemiStructured Data
Unstructured
Structured Data
Audio
images
tables
text
video
700 MB / movie
Text, Log
Files, Click
5000 KB / song Streams, Blogs, T
weets, Audio, Vide
o, etc.
1000 KB / image
5 KB / record
Traditional Structured SQL
50 KB / record
Unstructured NoSQL
16. What is Big Data ?
The 3 V’s
Volume
Variety
Velocity
17. Big Data - Velocity
How the hell does Google return an answer in 0.28
seconds by looking at 4 Billion pages?
18. Big Data - Velocity
Online Advertisement - Real Time Bidding (RTB)
21. How is Big Data Handled ?
The challenge is huge
Store, analyze and serve huge volume of variety of data
in high velocity
We can’t achieve this using a single machine, no
matters how strong it is. Why?
Expensive – stay tuned
Load balancing requests
Outbrain serves 3,000 per second
DG (MediaMind) serves 500K per second!!!
Not fault tolerant
22. The Big Data Paradigms Shifts
Volume
Distributing the Data
Scale Out
Scale Up
(Horizontal)
(Vertical)
SQL Server
Hadoop
Cluster
HDFS
(GFS)
Nodes
23. Big Data –Reducing Costs
Hadoop is a 5 times cheaper infrastructure !!!
TCO (purchase + maintenance) for 3 years per 300 TB:
DBMS server = 5 M$
75 nodes cluster = 1 M$
24. Big Data Paradigm Shift - Computing
MapReduce Computing Paradigm
Exploiting the distributed architecture for large scale
computations in parallel
25. MapReduce
“Hello MapReduce” – counting words
Map
Mappers
W
the
C
the
7
Cow
1
quick
0
W
C
the
9
Cow
Hadoop Cluster
2
W
URL 2
0
quick
1
quick
3
Reduce
5
Cow
Master
C
Reducer
+
W
C
the
21
Cow
2
quick
5
26. Big Data Paradigm Shift – NoSQL
Variety
Schema-less databases to support the variety of data
Complex SQL queries (joins, etc.) in a distributed data
framework is extremely inefficient
Key-Value Store
NoSQL
Key
Value
user_id
Any – not single
primary as in SQL
tables
url
text
image_id
video_id
images
video
any
27. Big Data Paradigm Shift –
Velocity
RAM-based DBs instead of traditional disk-based DBs
Store critical data in memory (much more expensive)
If the data doesn't come to Alg - Alg will come to the data
Alg
Write
Read
Data
Alg
Read
Write
Data
traditional
today
29. Big Data - Summary
BIG business opportunities
The 3 V’s: Volume, Variety, Velocity
Technological paradigm shifts
30. Big Data Technological Paradigm Shifts
Volume
Scale up
Map
Variety
NoSQL
Scale Out
Mappers
Key
Value
Velocity
Reduce
Alg
Alg
Data
Master
Reducer
Data
31. Big Data - Summary
BIG business opportunities
The 3 V’s: Volume, Variety, Velocity
Computing and DB paradigm shifts
Flood of new (open source) technologies
32. Flood of New Big Data Technologies
Open Source
33. Big Data - Summary
BIG business opportunities
The 3 V’s: Volume, Variety, Velocity
Computing and DB paradigm shifts
Flood of new (open source) technologies
It’s definitely not just a buzz
35. Big Data - Summary
BIG business opportunities
The 3 V’s: Volume, Variety, Velocity
Computing and DB paradigm shifts
Flood of new (open source) technologies
It’s definitely not just a buzz
It’s a real response to the world hectic paced evolution
reducing costs by order of magnitude
Still it doesn’t mean every business today will / should
transform its technology stack to support big data
55. Data Scientist Skillset
Hands on tools,
languages,
technologies
MsC / PhD in
Math, CS, Stats,
Physics
Hands on the
specific problem
domain
56. Data Science ≠ BI
Apply advanced statistical machine learning
algorithms to:
dig deeper to find patterns that traditional BI tools may
not reveal
much wider domains / applications spectrum
Predictive Analytics ≠ Exploratory Analytics
60. The Art of Data Science
We need at least one semester course for it
Still…
61. Data Science Life Cycle
Run Time
Offline Data
Analysis
Understand
Data
Prepare
Data
Monitor
Business
Goal
Deploy
Model
Evaluate
62. Closing the Loop
Technically wise, what do you think?
Is Big Data good or bad for Data Science ?
Big
Data
Data
Science
Big
Data
Science
63. The Bad - Finding a Needle in a Haystack
It’s the same treasure that hides – the problem is
that the pile is now huge
Big Data Big Noise
64. The Bad - Finding a Needle in a Haystack
It’s the same treasure that hides – the problem is
that the pile is now huge
Big Data Big Noise
65. The Good - The Statistical View
Statistics is predictive analytics’ fuel !
The more data you have (Big Data) the better your
predictive models will perform
72. Combining the Good & Bad
Data is a function of quality and quantity
High
Quality
Low
Small
Quantity
Big
73. Big Data Science - Summary
Big Data
Big Numbers Big Opportunities
Big Data is the buzziest technology nowadays
Data Scientists
the ones that coax the treasures for their companies, out
of the big data
Are multi-discipline skilled
the new industry rock stars
It’s an introductory lecture of the buzziest domain technology nowadays.The domain encapsulates a lot of new concepts, keywords, theories which make the full academic rainbow from computer science to business departments very busy to digest these upcoming, fast pacing concepts.Academies should, and do, offer new tracks to support these developments
This trivial equation tells the whole story.The subject of this lecture is comprised of two parts: Big Data & Data ScienceAnd the lecture will appropriately be divided into these two parts.Of course we’ll see how they are connected and related to each other
The Big Data tour will be divided into 3 parts (as everything is in…big data, and you’ll see shortly)
The Big Data tour will be divided into 3 parts (as everything is in…big data, and you’ll see shortly)
We’ll start with the why and then the what will be better understood.Big Data is a business / technological aspect of a wider social phenomena we’re currently leave in.As all past social revolutions, they were all started with a technological revolution, e.g. the French revolution was a side effect of the industrial revolution.This is a same case where the Internet created a social revolutionEveryone is connected to everyone
Actually the Big Data as a phenomena started with the rise of Web2.0, where unlike the older Web 1.o, where only site owners created the online data, then came the users which create the content
The Big Data tour will be divided into 3 parts (as everything is in…big data, and you’ll see shortly)
Big Data -> big numbers.Taken from http://visual.ly/what-big-data
Big Users is an equally big trend driving developers to use NoSQL databases.Most new applications are made available over the internet so people can easily access them.This has caused the number of simultaneous users for many applications to explode.The number of people connected to the internet is more than 2B and growing rapidly.The number of hours that the average user spends on the internet is growing too further increasing the number of simultaneous users.And, with the proliferation of smart phones, people use their applications more and more frequently further increasing the number of simultaneous users.All these simultaneous users leads to a rapidly growing number of database operations and the need for a far easier way to scale your database to meet these demands.Taken from Couchbase deck @ IGTCloud summit 2013http://www.go-gulf.com/blog/online-timehttp://business.time.com/2012/02/14/one-billion-smartphones-by-2016-here-comes-the-mobile-arms-race/
To summarize, the technology implications of the Big Data, Big User, and Cloud Computing mega trends are causing people to seriously rethink what database they use for their applications and are increasingly coming to the conclusion that NoSQL databases are a better fit than relational databases.
Finally, the move to cloud computing and SaaS business models is also driving developers to consider NoSQL databases.15 years ago most applications were developed with a client/server architecture and a packaged software business model that supported the needs of users on a company-by-company basis.Today, applications are increasingly developed using a 3-tier internet architecture, are cloud-based, and use a Software-as-a-Service business model that needs to support the collective needs of thousandsvof customersThis approach increasingly requires a horizontally scalable architecture that easily scales with the number of users and amount of data your application has.
The Big Data tour will be divided into 3 parts (as everything is in…big data, and you’ll see shortly)
Outbrain serves 8 billion impressions a month = 3000 impressions / sec ; DG (MediaMind) serves 50 billion a day = 500K/sechttp://readwrite.com/2013/05/29/the-real-reason-hadoop-is-such-a-big-deal-in-big-datahttp://www.computerworlduk.com/in-depth/applications/1779/oracles-database-machine-how-much-will-it-really-cost/
MapReduce providesUser-defined functionsAutomatic parallelization and distributionFault-toleranceI/O schedulingStatus and monitoring
MapReduce providesUser-defined functionsAutomatic parallelization and distributionFault-toleranceI/O schedulingStatus and monitoring
Taken from http://db-engines.com/en/ranking
This trivial equation tells the whole story.The subject of this lecture is comprised of two parts: Big Data & Data ScienceAnd the lecture will appropriately be divided into these two parts.Of course we’ll see how they are connected and related to each other
Ok, we have the big data. Now, what are we doing with it?Big data is important if you want to be successful in analytic processing. But, why is that important? The answer is that success in a highly competitive, fast-moving marketplace is determined by who can capitalize on business opportunities before everyone else seizes the same opportunity. In this section we’ll meet the data scientists / data miners that coax treasures out of the huge volume of data
Although Onavo has started from a service that optimizes devices & apps performance, on the way they’ve collected logs from these apps & devices and became one of the leading mobile analytics aggregators in the world
Notations first.It has many names that mean more or less the same: the art of inference insights from data
In this section we’ll meet the data scientists / data miners that coax treasures out of the huge volume of data.Domains applying data science / data mining.. Vary:
Learning is comprised of three steps: First, we build our probabilistic model of the real worldThen, we train the model with labeled (supervised) examples, i.e. this is a car, this is not a car. This takes place offline.Last, online, we feed the model with a totally new example and expect it will predict for us the correct prediction