SlideShare uma empresa Scribd logo
1 de 70
BIG DATA
Prepared by
Bhuvaneshwari.P
Research Scholar, VIT university,
Vellore
Introduction
DEFINITION
Big data is defined as the collection of large
and complex datasets that are difficult to
process using database system tools or
traditional data processing application
software.
Mainframe
(Kilobytes)
Client /Server
(Megabytes)
The Internet
(Gigabytes)
[Big data]
Mobile, Social
media…
(Zettabytes)
Characteristics of Big data
The characteristics of big data is specified with 5V’s:
1. Volume – It refers to vast amount of data generated
every second. [Kilo bytes->Mega->Giga->Tera->Petta-
>Exa->Zetta->Yotta]
2. Variety – It refers to the different kinds of data
generated from different sources.
3. Velocity – It refers to the speed of data generation ,
process, and moves around.
4. Value – It refers to bring out the correct meaning out
of available data.
5. Veracity – It refers to the uncertainty and
inconsistencies in the data.
Categories of Big data
Big data is categorized into three forms.
1.Structured – Data which can be stored and
processed in predefined format. Ex: Table,
RDBMS data.
2.Unstructured – Any data without structure or
unknown form. Ex: Output returned by google
search, audio, video, image.
3.Semi_ structured - This type of data contain both
the forms of data. Ex: JSON, CSV,XML, email .
Data types*-> emails, text messages, photos,
videos, logs, documents, transactions, click trails,
public records etc.,
Examples of big data
Some examples of big data
1.Social media: 500+ terabytes of data is
generated in facebook everyday, 100,000 +
data is created in tweets for every 60 sec, 300
hours of videos are uploaded in you tube per
minute.
2.Airlines: A single jet engine produce
10+terabytes of data in 30 minutes of a flight
time.
Cont..,
3. Stock Exchange- The New York stock
exchange generates about one terabyte of
new trade data everyday.
4. Mobile Phones- For every 60 seconds
698,445+ google searches, 11,000,000+
instant messages, and 168000000 emails are
generated by users.
5. Walmart handles more than 1 million
customer transaction every hour.
Sources of big data
1.Activity data- The basic activity like searches
are stored in web browser, the usage of
phone is stored by mobile phones, Credit card
company stores where customer buys and
shop stores what they buys.
2. Conversational data- Conversations in emails,
social media sites like facebook, twitter and so
on.
Cont.,
3. Photo and video data- The pictures and
videos taken from mobile phones, digital
camera, and CCTV are uploaded heavily in
youtube and social media sites every second.
4. Sensor data- The sensors embedded in all
devices produce huge amount of data. Ex: GPS
provide direction and speed of a vehicle.
5. IOT data- Smart TV, smart watch, smart fridge
etc. Ex: Traffic sensors send data to alarm
clock in smart watch
Typical Classification
I. Internal data – It supports daily business
operations such as organizational or
enterprise data ( Structured). Ex: Customer
data, Sales data, ERP,CRM etc.,
II. External data – It is analyzed for
competitors, market environment and
technology such as social data
(Unstructured). Ex: Internet, Government,
Business partners, Syndicate data suppliers
etc.,
Big data storage
Big data storage is concerned with storing and
managing data in a scalable way, satisfying
the needs of applications that require access
to the data.
 Some of the big data storage technologies are
1. Distributed file system- Store large amounts
of unstructured data in a reliable way on
commodity hardware
Cont.,
Hadoop File System (HDFS) is an integral part
of the Hadoop framework designed for large
data files and is well suited for quickly
ingesting data and bulk processing
2. NoSQL database - Database that stores and
retrieves data that is modeled in means other
than the tabular relations and it lacks ACID
transactions
 Supports both structured and unstructured
data
The data structures used are key-value, wide
column, graph, or document
Less functionality and more performance
It focus on scalability, performance, and high
availability
Flat files RDBMS NoSQL
No standard
Implementa
tion
Could not
handle big
data
3. NewSQL database - Provide the same
scalable performance of NoSQL systems
for Online Transaction Processing (OLTP) read-
write workloads while still maintaining
the ACID guarantees of a traditional database
system
4. Cloud storage – Service model in which data
is maintained, managed, backed up remotely
and made available to users over the Internet
Cont.,
Eliminates the acquisition and management
costs of buying and maintaining your own
storage infrastructure, increases agility,
provides global scale, and delivers "anywhere,
anytime" access to data
Users generally pay for their cloud data
storage on a per-consumption “Pay as per
use”
Data intelligence
Data intelligence - Analysis of various forms of
data in such a way that it can be used by
companies to expand their services or
investments
Transforming data into information,
information into knowledge, and knowledge
into value
Data integration and serialization
Data integration- Combining data residing in
different sources and providing users with a
unified view of them
Data serialization- It is the concept of
converting structured data into a format that
allows it to be shared or stored in such a way
that its original structure to be recovered.
Data monitoring
Data monitoring- It allows an organization to
proactively maintain a high, consistent
standard of data quality
• By checking data routinely as it is stored
within applications, organizations can avoid
the resource-intensive pre-processing of data
before it is moved
• With data monitoring, data quality checked at
creation time rather than before a move.
Data indexing
Data indexing- It is a data structure that is
added to a file to provide faster access to the
data.
• It reduces the number of blocks that the
DBMS has to check.
• It contains a search key and a pointer. Search
key - an attribute or set of attributes that is
used to look up the records in a file.
• Pointer - contains the address of where the
data is stored in memory.
Why Big data?
These are the factors leads to the emergence of
big data
1. Increase of storage capacity
2. Increase of processing power
3. Availability of data
4. Derive insights and drive growth
5. To be competitive
Benefits of Big Data Processing
1. Businesses gains intelligence while decision
making.
2. Better Customer service.
3. Early identification of risk in product/
services.
4. Improved operational efficiency – Product
recommendation.
5. Detecting fraudulent behavior.
Applications of Bigdata
 Smarter health care – Leverage the health
care system with easy access and efficient
outcome.
Multi channel sales and web display
advertisement
Finance
Intelligence Traffic management
Manufacturing
Fraud and risk detection
Telecom
Analysis Vs Analytics
Analysis - It is the process of breaking a complex
topic or substance into smaller parts in order
to gain a better understanding of it
 What happened in the past? It is the process
of examining, transforming and arranging raw
data in a specific way to generate useful
information from it
Analytics – It is sub component of analysis that
involves the use of tools and techniques to
find novel, valuable and exploitable patterns
(What will happen in future?)
Big data analytics
It is the process of
 Collecting, Storing, Organizing and
Analyzing the large set of heterogeneous data
for gaining insights, discover patterns,
correlations and other useful information
 Faster and better decision making
 Enhance performance, service or product
Cost effective and next generation products
Challenges/Opportunity
Unstructured Data (90%) Structured
Data (10%)
To analyze & extract
meaningful
information
Stages in Big data analytics
I. Identifying problem
II. Designing data requirements
III. Preprocessing data
IV. Visualizing data and
V. Performing analytics over data
Tradition Vs Big data analytics
Traditional Analytics Big data analytics
 Analytics with well known and smaller
in size data
Not well understood format with
largely semi structured or unstructured
data
 Built based on relational data models  It is retrieved from various sources
with almost flat and no relationship in
nature
Four types of analytics
1. Descriptive Analytics : What happened?
 It is a backward looking and reveal what has
occurred in the past with the present data
(Hindsight)
 Two types: 1) Measures of central tendency
(mean, mode, and median)
2) Measures of dispersion (range,
variance, and standard deviation)
2. Diagnostic Analytics : Why did this happen?
What went wrong?
3. Predictive Analytics : What is likely to
happen?
 It predict what could happen in the future
(Insight)
 Several models used are i) Forecasting, ii)
Simulation, iii) Regression, iv)Classification,
and v) Clustering
4. Prescriptive analytics – What should we do to
make it happen?
It suggest conclusions or actions that can be
taken based on the analysis
Techniques used are i) Linear programming,
ii)Integer programming, iii)Mixed integer
programming, and iv)Non linear programming
Approach in analytics development
 Identify the data source
Select the right tools and technology for
collect, store and organize data
 Understand the domain and process data
Build mathematical model for your analytics
Visualize and validate the result
Learn, adapt and rebuilt your analytical
model.
Big data analytics domain
 Web and E-Tailing
 Government
Retail
Tele communication
Health care
Finance and banking
Big data techniques
There are seven widely used big data analysis
techniques. They are
1. Association rule learning
2. Classification tree analysis
3. Genetic algorithms
4. Machine learning
5. Regression analysis
6. Sentiment analysis
7. Social network analysis
Association rule learning
 Rule based machine learning method for
discovering the interesting relations between
variables in large database.
In order to select interesting rules from the
set of all possible rules, constraints on various
measures of significance and interest are
used. The best known constraints are
minimum threshold on support and
confidence.
Cont.,
Support- Indication of how frequently the
item set appear in the data set.
Confidence – Indication of how often the rule
has been found to be true.
Example Rule for the supermarket
{ bread, butter} => {Milk}, It mean that if butter
and bread are bought, customers also buy
milk.
Algorithms for association rule
learning
Some of the familiar algorithms used for mining
frequent item sets are
1.Apriori algorithm- It uses
a) breadth- first search strategy to count the
support of item sets
b) candidate generation function which exploits
the downward closure property of a support
Equivalence class transformation
(ECLAT) algorithm
 Depth first search algorithm using set
intersection
Suitable for serial and parallel execution with
locality enhancing properties
Frequent Pattern (FP) Growth
algorithm
1st
phase - Algorithm counts number of
occurrence of items in dataset and store in
header table
2nd
phase – FP tree structure is built by
inserting instances. Items in each instance
have to be sorted by descending order of their
frequency in the dataset, so that the tree can
be processed quickly.
Classification Tree Analysis
It is a type of machine learning algorithm used
to classify the class of an object
Identifies a set of characteristics that best
differentiates individuals based on a
categorical outcome variable
Genetic Algorithms
Search based optimization technique based
on the concepts of natural selection and
genetics
In GAs, we have a pool or a population of
possible solutions to the given problem.
These solutions then undergo recombination
and mutation (like in natural genetics),
producing new children, and the process is
repeated over various generations.
Cont.,
 Each individual is assigned a fitness value
(based on its objective function value) and the
fitter individuals are given a higher chance to
mate and yield more “fitter” individuals
Part of evolutionary algorithms
 Three basic operators of GA: (i)
Reproduction, (ii) Mutation, and (iii)
Crossover
Machine Learning
 It is a method of data analysis that automates
analytical model building
It is an application of Artificial Intelligence
based on the idea that machines should be
able to learn and adapt through experience
Within the field of data analytics, machine
learning is a method used to devise complex
models and algorithms that lend themselves
to prediction
Cont.,
• Machine learning is a branch of science that
deals with programming the systems in such a
way that they automatically learn and
improve with experience.
• Learning means recognizing and
understanding the input data and making wise
decisions based on the supplied data.
Cont.,
• It is very difficult to cater to all the decisions
based on all possible inputs. To tackle this
problem, algorithms are developed. These
algorithms build knowledge from specific data
and past experience with the principles of
statistics, probability theory, logic,
combinatorial optimization, search,
reinforcement learning, and control theory.
Learning types
There are several ways to implement machine
learning techniques, however the most
commonly used ones
Supervised learning
Unsupervised learning
Semi supervised learning
Supervised learning
• Deals with learning a function from available training
data. Known input and output variable. Use an
algorithm to learn the mapping function from input
to output [Y=f(X)]
• Analyzes the training data and produces an inferred
function, which can be used for mapping new
examples
• Some supervised learning algorithms are neural
networks, Support Vector Machines (SVMs), and
Naive Bayes Classifiers, Random forest, Decision
Trees, Regression.
• Ex: classifying spam, voice recognization, regression
Unsupervised Learning
 Makes sense of unlabeled data without having any
predefined dataset for its training. Only input (X)
and no corresponding output variable
 Model the underlying structure or distribution in the
data in order to learn more about data
 It is most commonly used for clustering similar input
into logical groups
 Common approaches: K means, self organizing maps
and hierarchical clustering
 Techniques: Recommendation, Association,
Clustering
Semi Supervised Learning
Problems where you have a large amount of
input data (X) and only some of the data is
labeled
 Example: In photo archive where only some
of the images are labeled and the majority are
unlabeled
Regression Analysis
• It is a set of statistical processes for estimating
the relationships among variables
• Regression analysis helps one understand how
the typical value of the dependent variable (or
'criterion variable') changes when any one of
the independent variables is varied, while the
other independent variables are held fixed.
• Widely used for prediction and forecasting,
where its use has substantial overlap with the
field of machine learning.
Cont.,
• This technique is used for forecasting, time
series modeling and finding the causal effect
relationship between the variables. For
example, relationship between rash driving
and number of road accidents by a driver is
best studied through regression.
Sentiment Analysis/ Opinion
Mining
Using NLP, statistics, or machine learning
methods to extract, identify, or otherwise
characterize the sentiment content of a text
unit
Sentiment = feelings
Attitudes – Emotions – Opinions
Subjective impressions, not facts
*A common use case for this technology is to
discover how people feel about a particular
topic
Automated extraction of subjective content
from digital text and predicting the
subjectivity such as positive, negative or
neutral
Social Network Analysis
• Process of investigating social structures
through the use of networks and graph theory
• It is the mapping and measuring of
relationships and flows between people,
groups, organizations, computers, URLs, and
other connected information/knowledge
entities.
• The nodes in the network are the people and
groups while the links show relationships or
flows between the nodes.
Two types of SNA
• Egocentric Analysis
– Focuses on the individual and studies an
individual’s personal network and its affects
on that individual
• Sociocentric Analysis
– Focuses on large groups of people – Quantifies
relationships between people in a group
– Studies patterns of interactions and how these
patterns affect the group as a whole
Egocentric Analysis
• Examines local network structure
• Describes the network around a
single node (the ego)
– Number of other nodes (alters)
– Types of connections
• Extracts network features
• Uses these factors to predict health and
longevity, economic success, levels of
depression, access to new opportunities
Sociocentric Analysis
• Quantifies relationships and interactions
between a group of people
• Studies how interactions, patterns of
interactions, and network structure affect
– Concentration of power and resources
– Spread of disease
– Access to new ideas
– Group dynamics
Big data analytics tools and
technologies
Hadoop = HDFS + Map Reduce
HiveHBase
Flume
Oozie
Pig
Flume
Sqoop
Khufka
Storm
RHadoop
Chukwa
Future role of data
Now Future
DNS =
Data
Decision Support
System
Digital Nervous
System (DNS)
Data
Sense ActDecideInterpret
History of Hadoop
1996-2000 2003-04 2005-06 2010 2013
Yahoo
Big data problem faced by all search engines
Google
Google file system and Map reduce papers
Hadoop spawns
Cloud era
Apache
(Dough & Mike)
Next generation Hadoop / Yarn &
Mapreduce2
Hadoop
It is an open source framework used for
distributed storage and processing of dataset
of big data using MapReduce programming
model
• The core components are i) Hadoop
Common – contains libraries and utilities
needed by other Hadoop modules;
• Hadoop Distributed File System (HDFS) –
Stores data on commodity machines,
providing very high aggregate bandwidth
across the cluster
• Hadoop YARN – a platform responsible for
managing computing resources in clusters and
using them for scheduling users' applications
• Hadoop MapReduce – an implementation of
the MapReduce programming model for
large-scale data processing.
Distributed Computing
Use of commodity hardware and open source
software (Increase in number of
processers)against expensive proprietary
software on expensive hardware (Server)
Major Components of Hadoop
Framework
1. HDFS (Hadoop Distributed File System):
Inspired from Google file system
2. Map Reduce : Inspired from Google Map
Reduce
* Both work on cluster of systems, hierarchical
architecture
Hdfs
Map Reduce
File
Master
Node
(Name
Node)
A
A
B
C
C
A
A
C
A
B
C
B
A
C
B
Data Nodes
A block B block C block
Master Node: It monitors the data distributed
among data node
Data Node: Stores the data blocks
* Both are Hadoop daemons. Actually java
programs run on specific machines
Map Reduce
It is divided into 2 phases
1.Map - Mapper code is distributed among
machines and it work on the data which the
system holds (Data locality). The locally
computed results are aggregated and sent to
reducer.
Map Map Map Reduce
2. Reducer- Reducer algorithms are applied to
global data to produce the final result.
Programmers need to write only the map
logic and reduce logic. The correct distribution
of map code to map machines are handled by
Hadoop.
Hadoop Ecosystem
Yahoo Facebook
HDFS
Map Reduce Hbase
Pig Hive
Sqoop/Flume
Pig
 It is a tool uses scripting statements to
process the data
Simple data flow language which saves
development time and efforts
 Typically it was designed for data scientist
who have less programming skills
It is developed by yahoo
Hive
 It produces SQL type language tool which
runs on top of map reduce
Hive is develop by facebook for data scientist
who have less programming skills.
The code written on pig/ hive gets converted
into map reduce jobs and run on HDFS
Sqoop/ Flume
Inorder to facilitate the movement of data to
Rhadoop, sqoop/flume is used
Sqoop is used to move the data from
Relational database and Flume is used to
inject the data as it was created by external
source.
Hbase is a tool which provide features like
real time database to receive data from HDFS

Mais conteúdo relacionado

Mais procurados

Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
Edureka!
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
smj
 

Mais procurados (20)

Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Data Warehouse
Data Warehouse Data Warehouse
Data Warehouse
 
Introduction data mining
Introduction data miningIntroduction data mining
Introduction data mining
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Big data mining
Big data miningBig data mining
Big data mining
 
Big data
Big dataBig data
Big data
 
Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
 
Big data
Big dataBig data
Big data
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
 
Big data
Big dataBig data
Big data
 
LDM Slides: Conceptual Data Models - How to Get the Attention of Business Use...
LDM Slides: Conceptual Data Models - How to Get the Attention of Business Use...LDM Slides: Conceptual Data Models - How to Get the Attention of Business Use...
LDM Slides: Conceptual Data Models - How to Get the Attention of Business Use...
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Data analytics
Data analyticsData analytics
Data analytics
 
Big Data Ppt PowerPoint Presentation Slides
Big Data Ppt PowerPoint Presentation Slides Big Data Ppt PowerPoint Presentation Slides
Big Data Ppt PowerPoint Presentation Slides
 
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
 

Semelhante a Big data

UNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfUNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdf
vvpadhu
 

Semelhante a Big data (20)

Data Science
Data ScienceData Science
Data Science
 
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxUnit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
 
Bigdata
Bigdata Bigdata
Bigdata
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Business Analytics 1 Module 1.pdf
Business Analytics 1 Module 1.pdfBusiness Analytics 1 Module 1.pdf
Business Analytics 1 Module 1.pdf
 
Big data Introduction
Big data IntroductionBig data Introduction
Big data Introduction
 
Real World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining ToolsReal World Application of Big Data In Data Mining Tools
Real World Application of Big Data In Data Mining Tools
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
 
What is Big Data - Edvicon
What is Big Data - EdviconWhat is Big Data - Edvicon
What is Big Data - Edvicon
 
BD1.pptx
BD1.pptxBD1.pptx
BD1.pptx
 
Characterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining TechniquesCharacterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining Techniques
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Introduction to visualizing Big Data
Introduction to visualizing Big DataIntroduction to visualizing Big Data
Introduction to visualizing Big Data
 
Unit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdfUnit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdf
 
Unit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdfUnit-1 introduction to Big data.pdf
Unit-1 introduction to Big data.pdf
 
U - 2 Emerging.pptx
U - 2 Emerging.pptxU - 2 Emerging.pptx
U - 2 Emerging.pptx
 
Sgcp14dunlea
Sgcp14dunleaSgcp14dunlea
Sgcp14dunlea
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
UNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfUNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdf
 

Último

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
Tonystark477637
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 

Último (20)

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 

Big data

  • 1. BIG DATA Prepared by Bhuvaneshwari.P Research Scholar, VIT university, Vellore
  • 2. Introduction DEFINITION Big data is defined as the collection of large and complex datasets that are difficult to process using database system tools or traditional data processing application software. Mainframe (Kilobytes) Client /Server (Megabytes) The Internet (Gigabytes) [Big data] Mobile, Social media… (Zettabytes)
  • 3. Characteristics of Big data The characteristics of big data is specified with 5V’s: 1. Volume – It refers to vast amount of data generated every second. [Kilo bytes->Mega->Giga->Tera->Petta- >Exa->Zetta->Yotta] 2. Variety – It refers to the different kinds of data generated from different sources. 3. Velocity – It refers to the speed of data generation , process, and moves around. 4. Value – It refers to bring out the correct meaning out of available data. 5. Veracity – It refers to the uncertainty and inconsistencies in the data.
  • 4. Categories of Big data Big data is categorized into three forms. 1.Structured – Data which can be stored and processed in predefined format. Ex: Table, RDBMS data. 2.Unstructured – Any data without structure or unknown form. Ex: Output returned by google search, audio, video, image. 3.Semi_ structured - This type of data contain both the forms of data. Ex: JSON, CSV,XML, email . Data types*-> emails, text messages, photos, videos, logs, documents, transactions, click trails, public records etc.,
  • 5. Examples of big data Some examples of big data 1.Social media: 500+ terabytes of data is generated in facebook everyday, 100,000 + data is created in tweets for every 60 sec, 300 hours of videos are uploaded in you tube per minute. 2.Airlines: A single jet engine produce 10+terabytes of data in 30 minutes of a flight time.
  • 6. Cont.., 3. Stock Exchange- The New York stock exchange generates about one terabyte of new trade data everyday. 4. Mobile Phones- For every 60 seconds 698,445+ google searches, 11,000,000+ instant messages, and 168000000 emails are generated by users. 5. Walmart handles more than 1 million customer transaction every hour.
  • 7. Sources of big data 1.Activity data- The basic activity like searches are stored in web browser, the usage of phone is stored by mobile phones, Credit card company stores where customer buys and shop stores what they buys. 2. Conversational data- Conversations in emails, social media sites like facebook, twitter and so on.
  • 8. Cont., 3. Photo and video data- The pictures and videos taken from mobile phones, digital camera, and CCTV are uploaded heavily in youtube and social media sites every second. 4. Sensor data- The sensors embedded in all devices produce huge amount of data. Ex: GPS provide direction and speed of a vehicle. 5. IOT data- Smart TV, smart watch, smart fridge etc. Ex: Traffic sensors send data to alarm clock in smart watch
  • 9. Typical Classification I. Internal data – It supports daily business operations such as organizational or enterprise data ( Structured). Ex: Customer data, Sales data, ERP,CRM etc., II. External data – It is analyzed for competitors, market environment and technology such as social data (Unstructured). Ex: Internet, Government, Business partners, Syndicate data suppliers etc.,
  • 10. Big data storage Big data storage is concerned with storing and managing data in a scalable way, satisfying the needs of applications that require access to the data.  Some of the big data storage technologies are 1. Distributed file system- Store large amounts of unstructured data in a reliable way on commodity hardware
  • 11. Cont., Hadoop File System (HDFS) is an integral part of the Hadoop framework designed for large data files and is well suited for quickly ingesting data and bulk processing 2. NoSQL database - Database that stores and retrieves data that is modeled in means other than the tabular relations and it lacks ACID transactions  Supports both structured and unstructured data
  • 12. The data structures used are key-value, wide column, graph, or document Less functionality and more performance It focus on scalability, performance, and high availability Flat files RDBMS NoSQL No standard Implementa tion Could not handle big data
  • 13. 3. NewSQL database - Provide the same scalable performance of NoSQL systems for Online Transaction Processing (OLTP) read- write workloads while still maintaining the ACID guarantees of a traditional database system 4. Cloud storage – Service model in which data is maintained, managed, backed up remotely and made available to users over the Internet
  • 14. Cont., Eliminates the acquisition and management costs of buying and maintaining your own storage infrastructure, increases agility, provides global scale, and delivers "anywhere, anytime" access to data Users generally pay for their cloud data storage on a per-consumption “Pay as per use”
  • 15. Data intelligence Data intelligence - Analysis of various forms of data in such a way that it can be used by companies to expand their services or investments Transforming data into information, information into knowledge, and knowledge into value
  • 16. Data integration and serialization Data integration- Combining data residing in different sources and providing users with a unified view of them Data serialization- It is the concept of converting structured data into a format that allows it to be shared or stored in such a way that its original structure to be recovered.
  • 17. Data monitoring Data monitoring- It allows an organization to proactively maintain a high, consistent standard of data quality • By checking data routinely as it is stored within applications, organizations can avoid the resource-intensive pre-processing of data before it is moved • With data monitoring, data quality checked at creation time rather than before a move.
  • 18. Data indexing Data indexing- It is a data structure that is added to a file to provide faster access to the data. • It reduces the number of blocks that the DBMS has to check. • It contains a search key and a pointer. Search key - an attribute or set of attributes that is used to look up the records in a file. • Pointer - contains the address of where the data is stored in memory.
  • 19. Why Big data? These are the factors leads to the emergence of big data 1. Increase of storage capacity 2. Increase of processing power 3. Availability of data 4. Derive insights and drive growth 5. To be competitive
  • 20. Benefits of Big Data Processing 1. Businesses gains intelligence while decision making. 2. Better Customer service. 3. Early identification of risk in product/ services. 4. Improved operational efficiency – Product recommendation. 5. Detecting fraudulent behavior.
  • 21. Applications of Bigdata  Smarter health care – Leverage the health care system with easy access and efficient outcome. Multi channel sales and web display advertisement Finance Intelligence Traffic management Manufacturing Fraud and risk detection Telecom
  • 22. Analysis Vs Analytics Analysis - It is the process of breaking a complex topic or substance into smaller parts in order to gain a better understanding of it  What happened in the past? It is the process of examining, transforming and arranging raw data in a specific way to generate useful information from it Analytics – It is sub component of analysis that involves the use of tools and techniques to find novel, valuable and exploitable patterns (What will happen in future?)
  • 23. Big data analytics It is the process of  Collecting, Storing, Organizing and Analyzing the large set of heterogeneous data for gaining insights, discover patterns, correlations and other useful information  Faster and better decision making  Enhance performance, service or product Cost effective and next generation products
  • 24. Challenges/Opportunity Unstructured Data (90%) Structured Data (10%) To analyze & extract meaningful information
  • 25. Stages in Big data analytics I. Identifying problem II. Designing data requirements III. Preprocessing data IV. Visualizing data and V. Performing analytics over data
  • 26. Tradition Vs Big data analytics Traditional Analytics Big data analytics  Analytics with well known and smaller in size data Not well understood format with largely semi structured or unstructured data  Built based on relational data models  It is retrieved from various sources with almost flat and no relationship in nature
  • 27. Four types of analytics 1. Descriptive Analytics : What happened?  It is a backward looking and reveal what has occurred in the past with the present data (Hindsight)  Two types: 1) Measures of central tendency (mean, mode, and median) 2) Measures of dispersion (range, variance, and standard deviation)
  • 28. 2. Diagnostic Analytics : Why did this happen? What went wrong? 3. Predictive Analytics : What is likely to happen?  It predict what could happen in the future (Insight)  Several models used are i) Forecasting, ii) Simulation, iii) Regression, iv)Classification, and v) Clustering
  • 29. 4. Prescriptive analytics – What should we do to make it happen? It suggest conclusions or actions that can be taken based on the analysis Techniques used are i) Linear programming, ii)Integer programming, iii)Mixed integer programming, and iv)Non linear programming
  • 30. Approach in analytics development  Identify the data source Select the right tools and technology for collect, store and organize data  Understand the domain and process data Build mathematical model for your analytics Visualize and validate the result Learn, adapt and rebuilt your analytical model.
  • 31. Big data analytics domain  Web and E-Tailing  Government Retail Tele communication Health care Finance and banking
  • 32. Big data techniques There are seven widely used big data analysis techniques. They are 1. Association rule learning 2. Classification tree analysis 3. Genetic algorithms 4. Machine learning 5. Regression analysis 6. Sentiment analysis 7. Social network analysis
  • 33. Association rule learning  Rule based machine learning method for discovering the interesting relations between variables in large database. In order to select interesting rules from the set of all possible rules, constraints on various measures of significance and interest are used. The best known constraints are minimum threshold on support and confidence.
  • 34. Cont., Support- Indication of how frequently the item set appear in the data set. Confidence – Indication of how often the rule has been found to be true. Example Rule for the supermarket { bread, butter} => {Milk}, It mean that if butter and bread are bought, customers also buy milk.
  • 35. Algorithms for association rule learning Some of the familiar algorithms used for mining frequent item sets are 1.Apriori algorithm- It uses a) breadth- first search strategy to count the support of item sets b) candidate generation function which exploits the downward closure property of a support
  • 36. Equivalence class transformation (ECLAT) algorithm  Depth first search algorithm using set intersection Suitable for serial and parallel execution with locality enhancing properties
  • 37. Frequent Pattern (FP) Growth algorithm 1st phase - Algorithm counts number of occurrence of items in dataset and store in header table 2nd phase – FP tree structure is built by inserting instances. Items in each instance have to be sorted by descending order of their frequency in the dataset, so that the tree can be processed quickly.
  • 38. Classification Tree Analysis It is a type of machine learning algorithm used to classify the class of an object Identifies a set of characteristics that best differentiates individuals based on a categorical outcome variable
  • 39. Genetic Algorithms Search based optimization technique based on the concepts of natural selection and genetics In GAs, we have a pool or a population of possible solutions to the given problem. These solutions then undergo recombination and mutation (like in natural genetics), producing new children, and the process is repeated over various generations.
  • 40. Cont.,  Each individual is assigned a fitness value (based on its objective function value) and the fitter individuals are given a higher chance to mate and yield more “fitter” individuals Part of evolutionary algorithms  Three basic operators of GA: (i) Reproduction, (ii) Mutation, and (iii) Crossover
  • 41. Machine Learning  It is a method of data analysis that automates analytical model building It is an application of Artificial Intelligence based on the idea that machines should be able to learn and adapt through experience Within the field of data analytics, machine learning is a method used to devise complex models and algorithms that lend themselves to prediction
  • 42. Cont., • Machine learning is a branch of science that deals with programming the systems in such a way that they automatically learn and improve with experience. • Learning means recognizing and understanding the input data and making wise decisions based on the supplied data.
  • 43. Cont., • It is very difficult to cater to all the decisions based on all possible inputs. To tackle this problem, algorithms are developed. These algorithms build knowledge from specific data and past experience with the principles of statistics, probability theory, logic, combinatorial optimization, search, reinforcement learning, and control theory.
  • 44. Learning types There are several ways to implement machine learning techniques, however the most commonly used ones Supervised learning Unsupervised learning Semi supervised learning
  • 45. Supervised learning • Deals with learning a function from available training data. Known input and output variable. Use an algorithm to learn the mapping function from input to output [Y=f(X)] • Analyzes the training data and produces an inferred function, which can be used for mapping new examples • Some supervised learning algorithms are neural networks, Support Vector Machines (SVMs), and Naive Bayes Classifiers, Random forest, Decision Trees, Regression. • Ex: classifying spam, voice recognization, regression
  • 46. Unsupervised Learning  Makes sense of unlabeled data without having any predefined dataset for its training. Only input (X) and no corresponding output variable  Model the underlying structure or distribution in the data in order to learn more about data  It is most commonly used for clustering similar input into logical groups  Common approaches: K means, self organizing maps and hierarchical clustering  Techniques: Recommendation, Association, Clustering
  • 47. Semi Supervised Learning Problems where you have a large amount of input data (X) and only some of the data is labeled  Example: In photo archive where only some of the images are labeled and the majority are unlabeled
  • 48. Regression Analysis • It is a set of statistical processes for estimating the relationships among variables • Regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed. • Widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning.
  • 49. Cont., • This technique is used for forecasting, time series modeling and finding the causal effect relationship between the variables. For example, relationship between rash driving and number of road accidents by a driver is best studied through regression.
  • 50. Sentiment Analysis/ Opinion Mining Using NLP, statistics, or machine learning methods to extract, identify, or otherwise characterize the sentiment content of a text unit Sentiment = feelings Attitudes – Emotions – Opinions Subjective impressions, not facts
  • 51. *A common use case for this technology is to discover how people feel about a particular topic Automated extraction of subjective content from digital text and predicting the subjectivity such as positive, negative or neutral
  • 52. Social Network Analysis • Process of investigating social structures through the use of networks and graph theory • It is the mapping and measuring of relationships and flows between people, groups, organizations, computers, URLs, and other connected information/knowledge entities. • The nodes in the network are the people and groups while the links show relationships or flows between the nodes.
  • 53. Two types of SNA • Egocentric Analysis – Focuses on the individual and studies an individual’s personal network and its affects on that individual • Sociocentric Analysis – Focuses on large groups of people – Quantifies relationships between people in a group – Studies patterns of interactions and how these patterns affect the group as a whole
  • 54. Egocentric Analysis • Examines local network structure • Describes the network around a single node (the ego) – Number of other nodes (alters) – Types of connections • Extracts network features • Uses these factors to predict health and longevity, economic success, levels of depression, access to new opportunities
  • 55. Sociocentric Analysis • Quantifies relationships and interactions between a group of people • Studies how interactions, patterns of interactions, and network structure affect – Concentration of power and resources – Spread of disease – Access to new ideas – Group dynamics
  • 56. Big data analytics tools and technologies Hadoop = HDFS + Map Reduce HiveHBase Flume Oozie Pig Flume Sqoop Khufka Storm RHadoop Chukwa
  • 57. Future role of data Now Future DNS = Data Decision Support System Digital Nervous System (DNS) Data Sense ActDecideInterpret
  • 58. History of Hadoop 1996-2000 2003-04 2005-06 2010 2013 Yahoo Big data problem faced by all search engines Google Google file system and Map reduce papers Hadoop spawns Cloud era Apache (Dough & Mike) Next generation Hadoop / Yarn & Mapreduce2
  • 59. Hadoop It is an open source framework used for distributed storage and processing of dataset of big data using MapReduce programming model • The core components are i) Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
  • 60. • Hadoop Distributed File System (HDFS) – Stores data on commodity machines, providing very high aggregate bandwidth across the cluster • Hadoop YARN – a platform responsible for managing computing resources in clusters and using them for scheduling users' applications • Hadoop MapReduce – an implementation of the MapReduce programming model for large-scale data processing.
  • 61. Distributed Computing Use of commodity hardware and open source software (Increase in number of processers)against expensive proprietary software on expensive hardware (Server)
  • 62. Major Components of Hadoop Framework 1. HDFS (Hadoop Distributed File System): Inspired from Google file system 2. Map Reduce : Inspired from Google Map Reduce * Both work on cluster of systems, hierarchical architecture Hdfs Map Reduce
  • 64. Master Node: It monitors the data distributed among data node Data Node: Stores the data blocks * Both are Hadoop daemons. Actually java programs run on specific machines
  • 65. Map Reduce It is divided into 2 phases 1.Map - Mapper code is distributed among machines and it work on the data which the system holds (Data locality). The locally computed results are aggregated and sent to reducer. Map Map Map Reduce
  • 66. 2. Reducer- Reducer algorithms are applied to global data to produce the final result. Programmers need to write only the map logic and reduce logic. The correct distribution of map code to map machines are handled by Hadoop.
  • 67. Hadoop Ecosystem Yahoo Facebook HDFS Map Reduce Hbase Pig Hive Sqoop/Flume
  • 68. Pig  It is a tool uses scripting statements to process the data Simple data flow language which saves development time and efforts  Typically it was designed for data scientist who have less programming skills It is developed by yahoo
  • 69. Hive  It produces SQL type language tool which runs on top of map reduce Hive is develop by facebook for data scientist who have less programming skills. The code written on pig/ hive gets converted into map reduce jobs and run on HDFS
  • 70. Sqoop/ Flume Inorder to facilitate the movement of data to Rhadoop, sqoop/flume is used Sqoop is used to move the data from Relational database and Flume is used to inject the data as it was created by external source. Hbase is a tool which provide features like real time database to receive data from HDFS

Notas do Editor

  1. y
  2. Ng the