Data science for advanced dummies

Data Science for Advanced
Dummies

Introduction to Big Data
What is Big Data?
What makes data, “Big” Data?
2

Big Data Definition
• No single standard definition…
“Big Data” is data whose scale, diversity, and complexity require new architecture,
techniques, algorithms, and analytics to manage it and extract value and hidden
knowledge from it…
3

Characteristics of Big Data:
1-Scale (Volume)
• Data Volume
• 44x increase from 2009 2020
• From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially
4
Exponential increase in
collected/generated data

2-Complexity (Varity)
• Various formats, types, and structures
• Text, numerical, images, audio, video, sequences, time
series, social media data, multi-dim arrays, etc…
• Static data vs. streaming data
• A single application can be generating/collecting many
types of data
5
To extract knowledge all these types of data need to
linked together

3-Speed (Velocity)
• Data is begin generated fast and need to be processed fast
• Online Data Analytics
• Late decisions  missing opportunities
• Examples
• E-Promotions: Based on your current location, your purchase history, what you like  send
promotions right now for store next to you
• Healthcare monitoring: sensors monitoring your activities and body  any abnormal
measurements require immediate reaction
6

Who’s Generating Big Data
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and networks
(measuring all kinds of data)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the
collected data in a timely manner and in a scalable fashion
9

What Technology Do We Have
For Big Data ??
10

Which Movie Do You
Like?
Designing a movie recommendation system

Can you describe the movie you would
like?

Recommender Systems
• Movie Problem: Find “Similar” movies to my taste.
• Movies have many “Features” – Western, Clint Eastwood, Tarantino, 90s,
• A viewer as preferences –”Features” – Likes ‘Western’; hates ‘content based
filtering movies’
Netflix Prize
From Wikipedia, the free encyclopedia
The Netflix Prize was an open competition for the best collaborative filtering algorithm to predict
user ratings for films, based on previous ratings without any other information about the users or
films, i.e. without the users or the films being identified except by numbers assigned for the contest.
The competition was held by Netflix, an online DVD-rental service, and was open to anyone not
connected with Netflix (current and former employees, agents, close relatives of Netflix employees,
etc.) or a resident of Cuba, Iran, Syria, North Korea, Burma or Sudan.[1] On 21 September 2009, the
grand prize of US$1,000,000 was given to the BellKor's Pragmatic Chaos team which bested Netflix's
own algorithm for predicting ratings by 10.06%.[2]

A Highly Simple Solution
Comedy Action Blockbu
ster
…. … … … Is Tom Cruise
the Lead?
6 5 0 … … … … 1
7 8 1 … … … … 0
… … … … … … … …
Saurav
2
8
…
Saurav’s Score = .2*Comedy + .1*Action + 10*Blockbuster + …+ … -.9*Tom Cruise
Comedy Action Blockbu
ster
…. … … … Is Tom Cruise
the Lead?
2 8 0 … … … … 0
Saurav
7

Quiz #1
• Is google search a recommender systems?

Supervised Learning
Design an Accurate Vending Machine
This is a Classification Problem – This line is called the
Decision Boundary or Separating Hyper plane

Quiz #2
• Give an example where you think supervised learning is used –
• Hint – Spam vs. Ham in Emails

Some Common Supervised Algorithms
• Classification
• Decision Trees
• Random Forest
• Support Vector Machine
• Neural Network
• Logistic Regression
• Regression
• Linear Regression
• Non-linear Regression
• Logistic Regression
• Association Rule Learning
• Arules
• Even Sequence Analysis

In Action
• Handwriting Recognition System
• Classification
• Input?
• Output?
200 200 10 …
200 200 8 …
180 200 20 …
… … … …
6
Features Labels

Note the
similarity
Classification Algorithms Try to
Separate items into “Classes”

Quiz #3
• Is driverless cars a learning problem?
• What are the features?
• What is the label?

Flowers
Tetramerous flower of Ludwigia
octovalvis showing petals and
sepals
Sepal lengthSepal width Petal length Petal width
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5.0 3.6 1.4 0.2
5.4 3.9 1.7 0.4
4.6 3.4 1.4 0.3
5.0 3.4 1.5 0.2
4.4 2.9 1.4 0.2
4.9 3.1 1.5 0.1
5.4 3.7 1.5 0.2

Clustering
• Cluster: A collection/group of data objects/points
• similar (or related) to one another within the same group
• dissimilar (or unrelated) to the objects in other groups
• Cluster analysis
• find similarities between data according to characteristics underlying the data and
grouping similar data objects into clusters
• Unsupervised learning
• no predefined classes for a training data set
• Two general tasks: identify the “natural” clustering number and properly grouping
objects into “sensible” clusters

Quiz #4
• How many types (species) of flowers are there?

Examples of Unsupervised Learning
• Clustering
• Dimensionality Reduction
• Feature Extraction
• Self Organizing Maps

Quiz #5
• Which of the below are supervised and which are unsupervised
• Take a collection of 1000 essays written on the US Economy, and find a way to automatically
group these essays into a small number of groups of essays that are somehow "similar" or
"related".
• Examine a large collection of emails that are known to be spam email, to discover if there
are sub-types of spam mail.
• Given historical data of children‘s ages and heights, predict children's height as a function of
their age.
• Have a computer examine an audio clip of a piece of music, and classify whether or not
there are vocals (i.e., a human voice singing) in that audio clip, or if it is a clip of only musical
instruments (and no vocals).
• Given a set of news articles from many different news websites, find out what are the main
topics covered.
• Suppose you are working on weather prediction, and you would like to predict
whether or not it will be raining at 5pm tomorrow. You want to use a learning
algorithm for this. Would you treat this as a classification or a regression problem?

Lets start from (Big) Data
• How do you design this system?
• How do you pay for this?
• How do you trust someone to do it
right?
• How expensive will such a system be?
I need Data. Good reusable data. High quality data. Else
all the smarts are waste.

Here comes BIG Data to help
• Image
• Audio
• Learning
• HUGE data sets

Data science for advanced dummies

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (20)

Semelhante a Data science for advanced dummies

Semelhante a Data science for advanced dummies (20)

Último

Último (20)

Data science for advanced dummies