3. Big Data Definition
• No single standard definition…
“Big Data” is data whose scale, diversity, and complexity require new architecture,
techniques, algorithms, and analytics to manage it and extract value and hidden
knowledge from it…
3
4. Characteristics of Big Data:
1-Scale (Volume)
• Data Volume
• 44x increase from 2009 2020
• From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially
4
Exponential increase in
collected/generated data
5. Characteristics of Big Data:
2-Complexity (Varity)
• Various formats, types, and structures
• Text, numerical, images, audio, video, sequences, time
series, social media data, multi-dim arrays, etc…
• Static data vs. streaming data
• A single application can be generating/collecting many
types of data
5
To extract knowledge all these types of data need to
linked together
6. Characteristics of Big Data:
3-Speed (Velocity)
• Data is begin generated fast and need to be processed fast
• Online Data Analytics
• Late decisions missing opportunities
• Examples
• E-Promotions: Based on your current location, your purchase history, what you like send
promotions right now for store next to you
• Healthcare monitoring: sensors monitoring your activities and body any abnormal
measurements require immediate reaction
6
9. Who’s Generating Big Data
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and networks
(measuring all kinds of data)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the
collected data in a timely manner and in a scalable fashion
9
14. Recommender Systems
• Movie Problem: Find “Similar” movies to my taste.
• Movies have many “Features” – Western, Clint Eastwood, Tarantino, 90s,
• A viewer as preferences –”Features” – Likes ‘Western’; hates ‘content based
filtering movies’
Netflix Prize
From Wikipedia, the free encyclopedia
The Netflix Prize was an open competition for the best collaborative filtering algorithm to predict
user ratings for films, based on previous ratings without any other information about the users or
films, i.e. without the users or the films being identified except by numbers assigned for the contest.
The competition was held by Netflix, an online DVD-rental service, and was open to anyone not
connected with Netflix (current and former employees, agents, close relatives of Netflix employees,
etc.) or a resident of Cuba, Iran, Syria, North Korea, Burma or Sudan.[1] On 21 September 2009, the
grand prize of US$1,000,000 was given to the BellKor's Pragmatic Chaos team which bested Netflix's
own algorithm for predicting ratings by 10.06%.[2]
18. Quiz #1
• Is google search a recommender systems?
19. Supervised Learning
Design an Accurate Vending Machine
This is a Classification Problem – This line is called the
Decision Boundary or Separating Hyper plane
20. Quiz #2
• Give an example where you think supervised learning is used –
• Hint – Spam vs. Ham in Emails
21. Some Common Supervised Algorithms
• Classification
• Decision Trees
• Random Forest
• Support Vector Machine
• Neural Network
• Logistic Regression
• Regression
• Linear Regression
• Non-linear Regression
• Logistic Regression
• Association Rule Learning
• Arules
• Even Sequence Analysis
28. Clustering
• Cluster: A collection/group of data objects/points
• similar (or related) to one another within the same group
• dissimilar (or unrelated) to the objects in other groups
• Cluster analysis
• find similarities between data according to characteristics underlying the data and
grouping similar data objects into clusters
• Unsupervised learning
• no predefined classes for a training data set
• Two general tasks: identify the “natural” clustering number and properly grouping
objects into “sensible” clusters
33. Quiz #5
• Which of the below are supervised and which are unsupervised
• Take a collection of 1000 essays written on the US Economy, and find a way to automatically
group these essays into a small number of groups of essays that are somehow "similar" or
"related".
• Examine a large collection of emails that are known to be spam email, to discover if there
are sub-types of spam mail.
• Given historical data of children‘s ages and heights, predict children's height as a function of
their age.
• Have a computer examine an audio clip of a piece of music, and classify whether or not
there are vocals (i.e., a human voice singing) in that audio clip, or if it is a clip of only musical
instruments (and no vocals).
• Given a set of news articles from many different news websites, find out what are the main
topics covered.
• Suppose you are working on weather prediction, and you would like to predict
whether or not it will be raining at 5pm tomorrow. You want to use a learning
algorithm for this. Would you treat this as a classification or a regression problem?
35. Lets start from (Big) Data
• How do you design this system?
• How do you pay for this?
• How do you trust someone to do it
right?
• How expensive will such a system be?
I need Data. Good reusable data. High quality data. Else
all the smarts are waste.
36. Here comes BIG Data to help
• Image
• Audio
• Learning
• HUGE data sets