SlideShare a Scribd company logo
1 of 47
Introduction to Big Data
& Basic Data Analysis
Big Data EveryWhere!
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
– Social Network
How much data?
•
•
•
•
•

Google processes 20 PB a day (2008)
Wayback Machine has 3 PB + 100 TB/month (3/2009)
Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
CERN’s Large Hydron Collider (LHC) generates 15 PB a
year
640K ought to be
enough for anybody.
Maximilien Brice, © CERN
The Earthscope
• The Earthscope is the world's
largest science project. Designed to
track North America's geological
evolution, this observatory records
data over 3.8 million square miles,
amassing 67 terabytes of data. It
analyzes seismic slips in the San
Andreas fault, sure, but also the
plume of magma underneath
Yellowstone and much, much more.
(http://www.msnbc.msn.com/id/44
363598/ns/technology_and_science
-future_of_technology/#.TmetOdQ-uI)
Type of Data
• Relational Data (Tables/Transaction/Legacy
Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
– Social Network, Semantic Web (RDF), …

• Streaming Data
– You can only scan the data once
What to do with these data?
• Aggregation and Statistics
– Data warehouse and OLAP

• Indexing, Searching, and Querying
– Keyword based search
– Pattern matching (XML/RDF)

• Knowledge discovery
– Data Mining
– Statistical Modeling
Statistics 101
Random Sample and Statistics
• Population: is used to refer to the set or universe of all entities
under study.
• However, looking at the entire population may not be
feasible, or may be too expensive.
• Instead, we draw a random sample from the population, and
compute appropriate statistics from the sample, that give
estimates of the corresponding population parameters of
interest.
Statistic
• Let Si denote the random variable corresponding to
data point xi , then a statistic ˆθ is a function ˆθ : (S1,
S2, · · · , Sn) → R.
• If we use the value of a statistic to estimate a
population parameter, this value is called a point
estimate of the parameter, and the statistic is called
as an estimator of the parameter.
Empirical Cumulative Distribution Function

Where

Inverse Cumulative Distribution Function
Example
Measures of Central Tendency (Mean)
Population Mean:

Sample Mean (Unbiased, not robust):
Measures of Central Tendency
(Median)
Population Median:
or
Sample Median:
Example
Measures of Dispersion (Range)
Range:
Sample Range:



Not robust, sensitive to extreme values
Measures of Dispersion (Inter-Quartile Range)
Inter-Quartile Range (IQR):

Sample IQR:



More robust
Measures of Dispersion
(Variance and Standard Deviation)
Variance:

Standard Deviation:
Measures of Dispersion
(Variance and Standard Deviation)
Variance:

Standard Deviation:

Sample Variance & Standard Deviation:
Univariate Normal Distribution
Multivariate Normal Distribution
OLAP and Data Mining
Warehouse Architecture
Client

Client

Query & Analysis

Metadata

Warehouse

Integration

Source

Source

Source
23
Star Schemas
•

A star schema is a common organization for
data at a warehouse. It consists of:
1. Fact table : a very large accumulation of facts
such as sales.
w Often “insert-only.”

2. Dimension tables : smaller, generally static
information about the entities involved in the
facts.
24
Terms
• Fact table
• Dimension tables
• Measures

product
prodId
name
price

sale
orderId
date
custId
prodId
storeId
qty
amt

customer
custId
name
address
city

store
storeId
city

25
Star
product

prodId
p1
p2

name price
bolt
10
nut
5

sale oderId date
o100 1/7/97
o102 2/7/97
105 3/8/97

customer

custId
53
81
111

store

custId
53
53
111

name
joe
fred
sally

prodId
p1
p2
p1

storeId
c1
c1
c3

address
10 main
12 main
80 willow

qty
1
2
5

storeId
c1
c2
c3

city
nyc
sfo
la

amt
12
11
50

city
sfo
sfo
la
26
Cube
Fact table view:
sale

prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2

Multi-dimensional cube:
amt
12
11
50
8

p1
p2

c1
12
11

c2

c3
50

8

dimensions = 2

27
3-D Cube
Fact table view:
sale

prodId
p1
p2
p1
p2
p1
p1

Multi-dimensional cube:
storeId
c1
c1
c3
c2
c1
c2

date
1
1
1
1
2
2

amt
12
11
50
8
44
4

day 2

day 1

p1
p2 c1
p1
12
p2
11

c1
44

c2
4
c2

c3
c3
50

8

dimensions = 3

28
ROLAP vs. MOLAP
• ROLAP:
Relational On-Line Analytical Processing
• MOLAP:
Multi-Dimensional On-Line Analytical
Processing

29
Aggregates
• Add up amounts for day 1
• In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
sale

prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

date
1
1
1
1
2
2

amt
12
11
50
8
44
4

81

30
Aggregates
• Add up amounts by day
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
sale

prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

date
1
1
1
1
2
2

amt
12
11
50
8
44
4

ans

date
1
2

sum
81
48

31
Another Example
• Add up amounts by day, product
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale

prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

date
1
1
1
1
2
2

amt
12
11
50
8
44
4

sale

prodId
p1
p2
p1

date
1
1
2

amt
62
19
48

rollup
drill-down
32
Aggregates
• Operators: sum, count, max, min,
median, ave
• “Having” clause
• Using dimension hierarchy
– average by region (within store)
– maximum by month (within date)

33
What is Data Mining?
• Discovery of useful, possibly
unexpected, patterns in data
• Non-trivial extraction of implicit, previously
unknown and potentially useful information
from data
• Exploration & analysis, by automatic or
semi-automatic means, of large quantities of
data in order to discover meaningful patterns
Data Mining Tasks
•
•
•
•
•
•
•

Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
Collaborative Filter [Predictive]
Classification: Definition
• Given a collection of records (training set )

– Each record contains a set of attributes, one of the
attributes is the class.

• Find a model for class attribute as a function
of the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.

– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it.
Decision Trees
Example:
• Conducted survey to see what customers were
interested in new model car
• Want to select customers for advertising campaign

sale

custId
c1
c2
c3
c4
c5
c6

car
taurus
van
van
taurus
merc
taurus

age
27
35
40
22
50
25

city newCar
sf
yes
la
yes
sf
yes
sf
yes
la
no
la
no

training
set

37
Clustering

income
education
age

38
K-Means Clustering

39
Association Rule Mining

sales
records:

tran1
tran2
tran3
tran4
tran5
tran6

cust33
cust45
cust12
cust40
cust12
cust12

p2,
p5,
p1,
p5,
p2,
p9

p5, p8
p8, p11
p9
p8, p11
p9

market-basket
data

• Trend: Products p5, p8 often bough together
• Trend: Customer 12 likes product p9

40
Association Rule Discovery
• Marketing and Sales Promotion:

– Let the rule discovered be
{Bagels, … } --> {Potato Chips}
– Potato Chips as consequent => Can be used to
determine what should be done to boost its sales.
– Bagels in the antecedent => can be used to see which
products would be affected if the store discontinues
selling bagels.
– Bagels in antecedent and Potato chips in consequent
=> Can be used to see what products should be sold
with Bagels to promote sale of Potato chips!

• Supermarket shelf management.
• Inventory Managemnt
Collaborative Filtering
• Goal: predict what movies/books/… a person may be interested in,
on the basis of
– Past preferences of the person
– Other people with similar past preferences
– The preferences of such people for a new movie/book/…

• One approach based on repeated clustering
– Cluster people on the basis of preferences for movies
– Then cluster movies on the basis of being liked by the same clusters of
people
– Again cluster people based on their preferences for (the newly created
clusters of) movies
– Repeat above till equilibrium

• Above problem is an instance of collaborative filtering, where users
collaborate in the task of filtering information to find information of
interest

42
Other Types of Mining
• Text mining: application of data mining to textual
documents
– cluster Web pages to find related pages
– cluster pages a user has visited to organize their visit
history
– classify Web pages automatically into a Web directory

• Graph Mining:
– Deal with graph data

43
Data Streams
• What are Data Streams?
– Continuous streams
– Huge, Fast, and Changing
• Why Data Streams?
– The arriving speed of streams and the huge amount of data
are beyond our capability to store them.
– “Real-time” processing
• Window Models
– Landscape window (Entire Data Stream)
– Sliding Window
– Damped Window
• Mining Data Stream

44
A Simple Problem
• Finding frequent items
– Given a sequence (x1,…xN) where xi ∈[1,m], and a real
number θ between zero and one.
– Looking for xi whose frequency > θ
– Naïve Algorithm (m counters)
• The number of frequent items ≤ 1/θ
• Problem: N>>m>>1/θ

P×(Nθ) ≤ N

45
KRP algorithm
─ Karp, et. al (TODS’ 03)

m=12

N=30

Θ=0.35

⌈1/θ⌉ = 3

N/ (⌈1/θ⌉) ≤ Nθ

46
Streaming Sample Problem
• Scan the dataset once
• Sample K records
– Each one has equally probability to be sampled
– Total N record: K/N

More Related Content

What's hot

Ontology based clustering algorithms
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithmsIkutwa
 
GeneticMax: An Efficient Approach to Mining Maximal Frequent Itemsets Based o...
GeneticMax: An Efficient Approach to Mining Maximal Frequent Itemsets Based o...GeneticMax: An Efficient Approach to Mining Maximal Frequent Itemsets Based o...
GeneticMax: An Efficient Approach to Mining Maximal Frequent Itemsets Based o...ITIIIndustries
 
Its all about data mining
Its all about data miningIts all about data mining
Its all about data miningJason Rodrigues
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsUniversity of Washington
 
What makes a linked data pattern interesting?
What makes a linked data pattern interesting?What makes a linked data pattern interesting?
What makes a linked data pattern interesting?Szymon Klarman
 
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Arjen de Vries
 

What's hot (7)

Ontology based clustering algorithms
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithms
 
Lecture - Data Mining
Lecture - Data MiningLecture - Data Mining
Lecture - Data Mining
 
GeneticMax: An Efficient Approach to Mining Maximal Frequent Itemsets Based o...
GeneticMax: An Efficient Approach to Mining Maximal Frequent Itemsets Based o...GeneticMax: An Efficient Approach to Mining Maximal Frequent Itemsets Based o...
GeneticMax: An Efficient Approach to Mining Maximal Frequent Itemsets Based o...
 
Its all about data mining
Its all about data miningIts all about data mining
Its all about data mining
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
What makes a linked data pattern interesting?
What makes a linked data pattern interesting?What makes a linked data pattern interesting?
What makes a linked data pattern interesting?
 
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?
 

Similar to Big Data Real Time Training in Chennai

IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079ibankuk
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onwordSulman Ahmed
 
Data Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptData Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptAravindReddy565690
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learningjaumebp
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introductionamiyadash
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptxXanGwaps
 

Similar to Big Data Real Time Training in Chennai (20)

Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop PDF
Hadoop PDFHadoop PDF
Hadoop PDF
 
Skillwise Big data
Skillwise Big dataSkillwise Big data
Skillwise Big data
 
IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
Big data
Big dataBig data
Big data
 
Data Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptData Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).ppt
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Weka bike rental
Weka bike rentalWeka bike rental
Weka bike rental
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introduction
 
Basic Overview of Data Mining
Basic Overview of Data MiningBasic Overview of Data Mining
Basic Overview of Data Mining
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Data analytics courses
Data analytics coursesData analytics courses
Data analytics courses
 
Data science courses
Data science coursesData science courses
Data science courses
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
 
Lecture2 (1).ppt
Lecture2 (1).pptLecture2 (1).ppt
Lecture2 (1).ppt
 

Recently uploaded

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Recently uploaded (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Big Data Real Time Training in Chennai

  • 1. Introduction to Big Data & Basic Data Analysis
  • 2. Big Data EveryWhere! • Lots of data is being collected and warehoused – Web data, e-commerce – purchases at department/ grocery stores – Bank/Credit Card transactions – Social Network
  • 3. How much data? • • • • • Google processes 20 PB a day (2008) Wayback Machine has 3 PB + 100 TB/month (3/2009) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009) CERN’s Large Hydron Collider (LHC) generates 15 PB a year 640K ought to be enough for anybody.
  • 5. The Earthscope • The Earthscope is the world's largest science project. Designed to track North America's geological evolution, this observatory records data over 3.8 million square miles, amassing 67 terabytes of data. It analyzes seismic slips in the San Andreas fault, sure, but also the plume of magma underneath Yellowstone and much, much more. (http://www.msnbc.msn.com/id/44 363598/ns/technology_and_science -future_of_technology/#.TmetOdQ-uI)
  • 6. Type of Data • Relational Data (Tables/Transaction/Legacy Data) • Text Data (Web) • Semi-structured Data (XML) • Graph Data – Social Network, Semantic Web (RDF), … • Streaming Data – You can only scan the data once
  • 7. What to do with these data? • Aggregation and Statistics – Data warehouse and OLAP • Indexing, Searching, and Querying – Keyword based search – Pattern matching (XML/RDF) • Knowledge discovery – Data Mining – Statistical Modeling
  • 9. Random Sample and Statistics • Population: is used to refer to the set or universe of all entities under study. • However, looking at the entire population may not be feasible, or may be too expensive. • Instead, we draw a random sample from the population, and compute appropriate statistics from the sample, that give estimates of the corresponding population parameters of interest.
  • 10. Statistic • Let Si denote the random variable corresponding to data point xi , then a statistic ˆθ is a function ˆθ : (S1, S2, · · · , Sn) → R. • If we use the value of a statistic to estimate a population parameter, this value is called a point estimate of the parameter, and the statistic is called as an estimator of the parameter.
  • 11. Empirical Cumulative Distribution Function Where Inverse Cumulative Distribution Function
  • 13. Measures of Central Tendency (Mean) Population Mean: Sample Mean (Unbiased, not robust):
  • 14. Measures of Central Tendency (Median) Population Median: or Sample Median:
  • 16. Measures of Dispersion (Range) Range: Sample Range:  Not robust, sensitive to extreme values
  • 17. Measures of Dispersion (Inter-Quartile Range) Inter-Quartile Range (IQR): Sample IQR:  More robust
  • 18. Measures of Dispersion (Variance and Standard Deviation) Variance: Standard Deviation:
  • 19. Measures of Dispersion (Variance and Standard Deviation) Variance: Standard Deviation: Sample Variance & Standard Deviation:
  • 22. OLAP and Data Mining
  • 23. Warehouse Architecture Client Client Query & Analysis Metadata Warehouse Integration Source Source Source 23
  • 24. Star Schemas • A star schema is a common organization for data at a warehouse. It consists of: 1. Fact table : a very large accumulation of facts such as sales. w Often “insert-only.” 2. Dimension tables : smaller, generally static information about the entities involved in the facts. 24
  • 25. Terms • Fact table • Dimension tables • Measures product prodId name price sale orderId date custId prodId storeId qty amt customer custId name address city store storeId city 25
  • 26. Star product prodId p1 p2 name price bolt 10 nut 5 sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 customer custId 53 81 111 store custId 53 53 111 name joe fred sally prodId p1 p2 p1 storeId c1 c1 c3 address 10 main 12 main 80 willow qty 1 2 5 storeId c1 c2 c3 city nyc sfo la amt 12 11 50 city sfo sfo la 26
  • 27. Cube Fact table view: sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 Multi-dimensional cube: amt 12 11 50 8 p1 p2 c1 12 11 c2 c3 50 8 dimensions = 2 27
  • 28. 3-D Cube Fact table view: sale prodId p1 p2 p1 p2 p1 p1 Multi-dimensional cube: storeId c1 c1 c3 c2 c1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4 day 2 day 1 p1 p2 c1 p1 12 p2 11 c1 44 c2 4 c2 c3 c3 50 8 dimensions = 3 28
  • 29. ROLAP vs. MOLAP • ROLAP: Relational On-Line Analytical Processing • MOLAP: Multi-Dimensional On-Line Analytical Processing 29
  • 30. Aggregates • Add up amounts for day 1 • In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4 81 30
  • 31. Aggregates • Add up amounts by day • In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4 ans date 1 2 sum 81 48 31
  • 32. Another Example • Add up amounts by day, product • In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId sale prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4 sale prodId p1 p2 p1 date 1 1 2 amt 62 19 48 rollup drill-down 32
  • 33. Aggregates • Operators: sum, count, max, min, median, ave • “Having” clause • Using dimension hierarchy – average by region (within store) – maximum by month (within date) 33
  • 34. What is Data Mining? • Discovery of useful, possibly unexpected, patterns in data • Non-trivial extraction of implicit, previously unknown and potentially useful information from data • Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
  • 35. Data Mining Tasks • • • • • • • Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive] Collaborative Filter [Predictive]
  • 36. Classification: Definition • Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class. • Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
  • 37. Decision Trees Example: • Conducted survey to see what customers were interested in new model car • Want to select customers for advertising campaign sale custId c1 c2 c3 c4 c5 c6 car taurus van van taurus merc taurus age 27 35 40 22 50 25 city newCar sf yes la yes sf yes sf yes la no la no training set 37
  • 40. Association Rule Mining sales records: tran1 tran2 tran3 tran4 tran5 tran6 cust33 cust45 cust12 cust40 cust12 cust12 p2, p5, p1, p5, p2, p9 p5, p8 p8, p11 p9 p8, p11 p9 market-basket data • Trend: Products p5, p8 often bough together • Trend: Customer 12 likes product p9 40
  • 41. Association Rule Discovery • Marketing and Sales Promotion: – Let the rule discovered be {Bagels, … } --> {Potato Chips} – Potato Chips as consequent => Can be used to determine what should be done to boost its sales. – Bagels in the antecedent => can be used to see which products would be affected if the store discontinues selling bagels. – Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips! • Supermarket shelf management. • Inventory Managemnt
  • 42. Collaborative Filtering • Goal: predict what movies/books/… a person may be interested in, on the basis of – Past preferences of the person – Other people with similar past preferences – The preferences of such people for a new movie/book/… • One approach based on repeated clustering – Cluster people on the basis of preferences for movies – Then cluster movies on the basis of being liked by the same clusters of people – Again cluster people based on their preferences for (the newly created clusters of) movies – Repeat above till equilibrium • Above problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find information of interest 42
  • 43. Other Types of Mining • Text mining: application of data mining to textual documents – cluster Web pages to find related pages – cluster pages a user has visited to organize their visit history – classify Web pages automatically into a Web directory • Graph Mining: – Deal with graph data 43
  • 44. Data Streams • What are Data Streams? – Continuous streams – Huge, Fast, and Changing • Why Data Streams? – The arriving speed of streams and the huge amount of data are beyond our capability to store them. – “Real-time” processing • Window Models – Landscape window (Entire Data Stream) – Sliding Window – Damped Window • Mining Data Stream 44
  • 45. A Simple Problem • Finding frequent items – Given a sequence (x1,…xN) where xi ∈[1,m], and a real number θ between zero and one. – Looking for xi whose frequency > θ – Naïve Algorithm (m counters) • The number of frequent items ≤ 1/θ • Problem: N>>m>>1/θ P×(Nθ) ≤ N 45
  • 46. KRP algorithm ─ Karp, et. al (TODS’ 03) m=12 N=30 Θ=0.35 ⌈1/θ⌉ = 3 N/ (⌈1/θ⌉) ≤ Nθ 46
  • 47. Streaming Sample Problem • Scan the dataset once • Sample K records – Each one has equally probability to be sampled – Total N record: K/N