SlideShare uma empresa Scribd logo
1 de 33
Introduction to Big Data
Sri Kanajan
Big Data
• When data is too VVV (volume, variety, velocity) to manage with traditional
RDBMS, then you enter BIG DATA!
• Data Storage and Manipulation, at Scale
– MapReduce, Hadoop, relationship to databases (Framework)
– Key-value stores and NoSQL; tradeoffs of SQL and NoSQL (Database type)
– Entity resolution, record linkage, data cleaning (data integration)
• Analytics (Machine Learning)
– Basic statistical modeling, experiment design, overfitting
– Supervised learning: overview, simple nearest neighbor, decision trees/forests, regression
– Unsupervised learning: k-means, multi-dimensional scaling
– Graph Analytics: PageRank, community detection, recursive queries, iterative processing
– Text Analytics: latent semantic analysis
– Collaborative Filtering: slope-one
• Communicating Results
– Visualization, data products, visual data analytics
Outline
• What is Big Data?
• Why is this important now?
• Key Concepts
– Hadoop ,MapReduce – Storage, Processing
– Machine Learning – Analytics
– Visualization
Big Data Everywhere!
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
– Social Network
Unknown Hidden Relationships within this Data !!!
How much data?
• Google processes 20 PB a day (2008)
• Wayback Machine has 3 PB + 100 TB/month (3/2009)
• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• CERN’s Large Hydron Collider (LHC) generates 15 PB a
year
640K ought to be
enough for anybody.
Type of Data
• Relational Data (Tables/Transaction/Legacy Data)
• Unstructured Text Data
– Log data, Comments, User generated text
• Semi-structured Data (XML)
• Graph Data
– Social Network, Semantic Web (RDF)
• Real time Data
– You can only scan the data once and need to do
analytics quickly
What does Big Data Give You?
• Without Big Data
– Many data warehouses that were separate and on non distributed
architectures
– Had to modify data structures and unique programming to merge databases
together
– Scaling database size is a continual problem
– Any large scale analytics took days and weeks and large coordination effort
within IT to get database accesses
– Data analysis is a large effort and lots of data tend to remain unanalyzed or
even worse not stored
• With Big Data
– Hadoop provides a single view of all databases that can be distributed
– Database size is a non issue
– Ability to perform advanced statistical analysis on very large datasets very
quickly
– Data analysis is the competitive edge for many companies since barriers of
entry are continually dropping through the development of platforms
Examples
• Norwegian Food Safety Authority
– accumulates data on all farm animals
– birth, death, movements, medication, samples, ...
• Hafslund
– time series from hydroelectric dams, power prices, meters of individual
customers, ...
• Social Security Administration
– data on individual cases, actions taken, outcomes...
• Statoil
– massive amounts of data from oil exploration, operations, logistics,
engineering, ...
• Retailers
– see Target example above
– also, connection between what people buy, weather forecast, logistics, ...
Big Data
Power of Distribution
45 Minutes! 4.5 Minutes!
Outline
• What is Big Data?
• Why is this important now?
• Key Concepts
– Hadoop ,MapReduce – Storage, Processing
– Machine Learning – Analytics
– Visualization
Hadoop
• A framework that allows for distributed
processing of large data sets across clusters of
commodity computers using a simple
programming model (I.e. MapReduce)
– Distributed data processing
– Works with structured and unstructured data
– Open source
– Master-slave architecture
– Fault tolerant using commodity hardware
MapReduce
• Programming model on top of Hadoop
• Basic concept is to provide a programming model that
immediately supports parallel processing (SQL on the
other hand does not natively encourage parallel
processing)
• Pig is a framework and programming language to
develop MapReduce
• Note – MapReduce is great for extremely large data
sets with simple relations. SQL is great for medium size
data sets but with complex relationships
– I.e. you have to decide the right technology depending on
your problem space
A Simple Example
• Counting words in a large set of documents
map(string value)
//key: document name
//value: document contents
for each word w in value
EmitIntermediate(w, “1”);
reduce(string key, iterator values)
//key: word
//values: list of counts
int results = 0;
for each v in values
result += ParseInt(v);
Emit(AsString(result));
MapReduce
Outline
• What is Big Data?
• Why is this important now?
• Key Concepts
– Hadoop, MapReduce – Storage architecture
– Machine Learning – Analytics
– Visualization
Machine Learning
• Essentially ways to analyze data to extract
valuable information with or without training
data
– Prediction
• predicting a variable from data
– Classification
• assigning records to predefined groups
– Clustering
• splitting records into groups based on similarity
– Association learning
• seeing what often appears together with what
– And many others….
Now you have an optimization
metric by which you can automate
the exploration of all possible
hypotheses !
Problems with this approach??
Two kinds of learning
21
• Supervised
– we have training data with correct answers
– use training data to prepare the algorithm
– then apply it to data without a correct
answer
• Unsupervised
– no training data
– throw data into the algorithm, hope it
makes some kind of sense out of the data
Example: Collaborative Filtering
• Goal: predict what movies/books/… a person may be interested in,
on the basis of
– Past preferences of the person
– Other people with similar past preferences
– The preferences of such people for a new movie/book/…
• One approach based on repeated clustering
– Cluster people on the basis of preferences for movies
– Then cluster movies on the basis of being liked by the same clusters of
people
– Again cluster people based on their preferences for (the newly created
clusters of) movies
– Repeat above till equilibrium
• Above problem is an instance of collaborative filtering, where users
collaborate in the task of filtering information to find information of
interest
22
Outline
• What is Big Data?
• Why is this important now?
• Key Concepts
– Hadoop, MapReduce – Storage architecture
– Machine Learning – Analytics
– Visualization
Is this an effective visual
representation?
Better Mapping? Why?
Diagrams Showing O-Ring Damage
that was Used to Decide to Launch
Challenger in 1987
Representation of the Same Data
Strategies to Increase the Information
Encoded by Spatial Position
• Composition
– Orthogonal placement of axes
– Creates a 2D metric space
Strategies to Increase the Information
Encoded by Spatial Position
• Alignment
Folding
• Continuation of the Axes
Recursion
Overloading
Conclusion
• Big Data is a huge field that combines
expertise from different domains in order to
find interesting information from data
• Extracting interesting information from data is
the next competitive edge for many
companies as information becomes available,
instantly anywhere

Mais conteúdo relacionado

Mais procurados

What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?CodePolitan
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopArchana Gopinath
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overviewColleen Farrelly
 
Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Benjamin Taylor
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceANOOP V S
 
Hadoop - An Introduction
Hadoop - An IntroductionHadoop - An Introduction
Hadoop - An IntroductionShankar R
 
What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?Seval Çapraz
 
Future of Data - Big Data
Future of Data - Big DataFuture of Data - Big Data
Future of Data - Big DataShankar R
 
Online retail a look at data consulting approach
Online retail   a look at data consulting approachOnline retail   a look at data consulting approach
Online retail a look at data consulting approachShesha R
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-stepsShesha R
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_publicAttila Barta
 
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]ssuser23e4f31
 

Mais procurados (20)

What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Predictive analytics and big data tutorial
Predictive analytics and big data tutorial
 
Big Data, Baby Steps
Big Data, Baby StepsBig Data, Baby Steps
Big Data, Baby Steps
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Hadoop - An Introduction
Hadoop - An IntroductionHadoop - An Introduction
Hadoop - An Introduction
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?
 
Future of Data - Big Data
Future of Data - Big DataFuture of Data - Big Data
Future of Data - Big Data
 
Bigdata
BigdataBigdata
Bigdata
 
Online retail a look at data consulting approach
Online retail   a look at data consulting approachOnline retail   a look at data consulting approach
Online retail a look at data consulting approach
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Data analytics
Data analyticsData analytics
Data analytics
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Data Science
Data ScienceData Science
Data Science
 
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
 

Semelhante a Big data Intro - Presentation to OCHackerz Meetup Group

Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
big data processing.pptx
big data processing.pptxbig data processing.pptx
big data processing.pptxssuser96aab9
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataRoi Blanco
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big DataIndu Khemchandani
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataMelissa Hornbostel
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusersBob Hardaway
 
Modul_1_Introduction_to_Big_Data.pptx
Modul_1_Introduction_to_Big_Data.pptxModul_1_Introduction_to_Big_Data.pptx
Modul_1_Introduction_to_Big_Data.pptxNouhaElhaji1
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolutionmark madsen
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQLPhilippe Julio
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataIMC Institute
 
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph TechnologyOracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph TechnologyInfiniteGraph
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & HadoopBlackvard
 

Semelhante a Big data Intro - Presentation to OCHackerz Meetup Group (20)

Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
big data processing.pptx
big data processing.pptxbig data processing.pptx
big data processing.pptx
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
Modul_1_Introduction_to_Big_Data.pptx
Modul_1_Introduction_to_Big_Data.pptxModul_1_Introduction_to_Big_Data.pptx
Modul_1_Introduction_to_Big_Data.pptx
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolution
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
 
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and ImpactTOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph TechnologyOracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
 

Último

ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleCeline George
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseCeline George
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...DhatriParmar
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDhatriParmar
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxkarenfajardo43
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 

Último (20)

Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP Module
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 Database
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
prashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Professionprashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Profession
 

Big data Intro - Presentation to OCHackerz Meetup Group

  • 1. Introduction to Big Data Sri Kanajan
  • 2. Big Data • When data is too VVV (volume, variety, velocity) to manage with traditional RDBMS, then you enter BIG DATA! • Data Storage and Manipulation, at Scale – MapReduce, Hadoop, relationship to databases (Framework) – Key-value stores and NoSQL; tradeoffs of SQL and NoSQL (Database type) – Entity resolution, record linkage, data cleaning (data integration) • Analytics (Machine Learning) – Basic statistical modeling, experiment design, overfitting – Supervised learning: overview, simple nearest neighbor, decision trees/forests, regression – Unsupervised learning: k-means, multi-dimensional scaling – Graph Analytics: PageRank, community detection, recursive queries, iterative processing – Text Analytics: latent semantic analysis – Collaborative Filtering: slope-one • Communicating Results – Visualization, data products, visual data analytics
  • 3. Outline • What is Big Data? • Why is this important now? • Key Concepts – Hadoop ,MapReduce – Storage, Processing – Machine Learning – Analytics – Visualization
  • 4. Big Data Everywhere! • Lots of data is being collected and warehoused – Web data, e-commerce – purchases at department/ grocery stores – Bank/Credit Card transactions – Social Network Unknown Hidden Relationships within this Data !!!
  • 5.
  • 6. How much data? • Google processes 20 PB a day (2008) • Wayback Machine has 3 PB + 100 TB/month (3/2009) • Facebook has 2.5 PB of user data + 15 TB/day (4/2009) • eBay has 6.5 PB of user data + 50 TB/day (5/2009) • CERN’s Large Hydron Collider (LHC) generates 15 PB a year 640K ought to be enough for anybody.
  • 7. Type of Data • Relational Data (Tables/Transaction/Legacy Data) • Unstructured Text Data – Log data, Comments, User generated text • Semi-structured Data (XML) • Graph Data – Social Network, Semantic Web (RDF) • Real time Data – You can only scan the data once and need to do analytics quickly
  • 8. What does Big Data Give You? • Without Big Data – Many data warehouses that were separate and on non distributed architectures – Had to modify data structures and unique programming to merge databases together – Scaling database size is a continual problem – Any large scale analytics took days and weeks and large coordination effort within IT to get database accesses – Data analysis is a large effort and lots of data tend to remain unanalyzed or even worse not stored • With Big Data – Hadoop provides a single view of all databases that can be distributed – Database size is a non issue – Ability to perform advanced statistical analysis on very large datasets very quickly – Data analysis is the competitive edge for many companies since barriers of entry are continually dropping through the development of platforms
  • 9. Examples • Norwegian Food Safety Authority – accumulates data on all farm animals – birth, death, movements, medication, samples, ... • Hafslund – time series from hydroelectric dams, power prices, meters of individual customers, ... • Social Security Administration – data on individual cases, actions taken, outcomes... • Statoil – massive amounts of data from oil exploration, operations, logistics, engineering, ... • Retailers – see Target example above – also, connection between what people buy, weather forecast, logistics, ...
  • 11. Power of Distribution 45 Minutes! 4.5 Minutes!
  • 12. Outline • What is Big Data? • Why is this important now? • Key Concepts – Hadoop ,MapReduce – Storage, Processing – Machine Learning – Analytics – Visualization
  • 13. Hadoop • A framework that allows for distributed processing of large data sets across clusters of commodity computers using a simple programming model (I.e. MapReduce) – Distributed data processing – Works with structured and unstructured data – Open source – Master-slave architecture – Fault tolerant using commodity hardware
  • 14. MapReduce • Programming model on top of Hadoop • Basic concept is to provide a programming model that immediately supports parallel processing (SQL on the other hand does not natively encourage parallel processing) • Pig is a framework and programming language to develop MapReduce • Note – MapReduce is great for extremely large data sets with simple relations. SQL is great for medium size data sets but with complex relationships – I.e. you have to decide the right technology depending on your problem space
  • 15. A Simple Example • Counting words in a large set of documents map(string value) //key: document name //value: document contents for each word w in value EmitIntermediate(w, “1”); reduce(string key, iterator values) //key: word //values: list of counts int results = 0; for each v in values result += ParseInt(v); Emit(AsString(result));
  • 17. Outline • What is Big Data? • Why is this important now? • Key Concepts – Hadoop, MapReduce – Storage architecture – Machine Learning – Analytics – Visualization
  • 18. Machine Learning • Essentially ways to analyze data to extract valuable information with or without training data – Prediction • predicting a variable from data – Classification • assigning records to predefined groups – Clustering • splitting records into groups based on similarity – Association learning • seeing what often appears together with what – And many others….
  • 19.
  • 20. Now you have an optimization metric by which you can automate the exploration of all possible hypotheses ! Problems with this approach??
  • 21. Two kinds of learning 21 • Supervised – we have training data with correct answers – use training data to prepare the algorithm – then apply it to data without a correct answer • Unsupervised – no training data – throw data into the algorithm, hope it makes some kind of sense out of the data
  • 22. Example: Collaborative Filtering • Goal: predict what movies/books/… a person may be interested in, on the basis of – Past preferences of the person – Other people with similar past preferences – The preferences of such people for a new movie/book/… • One approach based on repeated clustering – Cluster people on the basis of preferences for movies – Then cluster movies on the basis of being liked by the same clusters of people – Again cluster people based on their preferences for (the newly created clusters of) movies – Repeat above till equilibrium • Above problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find information of interest 22
  • 23. Outline • What is Big Data? • Why is this important now? • Key Concepts – Hadoop, MapReduce – Storage architecture – Machine Learning – Analytics – Visualization
  • 24. Is this an effective visual representation?
  • 26. Diagrams Showing O-Ring Damage that was Used to Decide to Launch Challenger in 1987
  • 28. Strategies to Increase the Information Encoded by Spatial Position • Composition – Orthogonal placement of axes – Creates a 2D metric space
  • 29. Strategies to Increase the Information Encoded by Spatial Position • Alignment
  • 33. Conclusion • Big Data is a huge field that combines expertise from different domains in order to find interesting information from data • Extracting interesting information from data is the next competitive edge for many companies as information becomes available, instantly anywhere