This document provides an overview of big data, including its definition, characteristics, categories, sources, storage, analytics, challenges and opportunities. Big data is large and complex datasets that are difficult to process using traditional database management tools. It is characterized by the 5 V's - volume, variety, velocity, value and veracity. Big data comes from both internal and external sources and can be structured, unstructured or semi-structured. It requires specialized storage technologies like Hadoop and NoSQL databases. Analytics on big data uses techniques like machine learning, regression analysis and social network analysis to gain insights. The growth of big data presents both challenges in processing diverse and voluminous data as well as opportunities to generate value.
2. Introduction
DEFINITION
Big data is defined as the collection of large
and complex datasets that are difficult to
process using database system tools or
traditional data processing application
software.
Mainframe
(Kilobytes)
Client /Server
(Megabytes)
The Internet
(Gigabytes)
[Big data]
Mobile, Social
media…
(Zettabytes)
3. Characteristics of Big data
The characteristics of big data is specified with 5V’s:
1. Volume – It refers to vast amount of data generated
every second. [Kilo bytes->Mega->Giga->Tera->Petta-
>Exa->Zetta->Yotta]
2. Variety – It refers to the different kinds of data
generated from different sources.
3. Velocity – It refers to the speed of data generation ,
process, and moves around.
4. Value – It refers to bring out the correct meaning out
of available data.
5. Veracity – It refers to the uncertainty and
inconsistencies in the data.
4. Categories of Big data
Big data is categorized into three forms.
1.Structured – Data which can be stored and
processed in predefined format. Ex: Table,
RDBMS data.
2.Unstructured – Any data without structure or
unknown form. Ex: Output returned by google
search, audio, video, image.
3.Semi_ structured - This type of data contain both
the forms of data. Ex: JSON, CSV,XML, email .
Data types*-> emails, text messages, photos,
videos, logs, documents, transactions, click trails,
public records etc.,
5. Examples of big data
Some examples of big data
1.Social media: 500+ terabytes of data is
generated in facebook everyday, 100,000 +
data is created in tweets for every 60 sec, 300
hours of videos are uploaded in you tube per
minute.
2.Airlines: A single jet engine produce
10+terabytes of data in 30 minutes of a flight
time.
6. Cont..,
3. Stock Exchange- The New York stock
exchange generates about one terabyte of
new trade data everyday.
4. Mobile Phones- For every 60 seconds
698,445+ google searches, 11,000,000+
instant messages, and 168000000 emails are
generated by users.
5. Walmart handles more than 1 million
customer transaction every hour.
7. Sources of big data
1.Activity data- The basic activity like searches
are stored in web browser, the usage of
phone is stored by mobile phones, Credit card
company stores where customer buys and
shop stores what they buys.
2. Conversational data- Conversations in emails,
social media sites like facebook, twitter and so
on.
8. Cont.,
3. Photo and video data- The pictures and
videos taken from mobile phones, digital
camera, and CCTV are uploaded heavily in
youtube and social media sites every second.
4. Sensor data- The sensors embedded in all
devices produce huge amount of data. Ex: GPS
provide direction and speed of a vehicle.
5. IOT data- Smart TV, smart watch, smart fridge
etc. Ex: Traffic sensors send data to alarm
clock in smart watch
9. Typical Classification
I. Internal data – It supports daily business
operations such as organizational or
enterprise data ( Structured). Ex: Customer
data, Sales data, ERP,CRM etc.,
II. External data – It is analyzed for
competitors, market environment and
technology such as social data
(Unstructured). Ex: Internet, Government,
Business partners, Syndicate data suppliers
etc.,
10. Big data storage
Big data storage is concerned with storing and
managing data in a scalable way, satisfying
the needs of applications that require access
to the data.
Some of the big data storage technologies are
1. Distributed file system- Store large amounts
of unstructured data in a reliable way on
commodity hardware
11. Cont.,
Hadoop File System (HDFS) is an integral part
of the Hadoop framework designed for large
data files and is well suited for quickly
ingesting data and bulk processing
2. NoSQL database - Database that stores and
retrieves data that is modeled in means other
than the tabular relations and it lacks ACID
transactions
Supports both structured and unstructured
data
12. The data structures used are key-value, wide
column, graph, or document
Less functionality and more performance
It focus on scalability, performance, and high
availability
Flat files RDBMS NoSQL
No standard
Implementa
tion
Could not
handle big
data
13. 3. NewSQL database - Provide the same
scalable performance of NoSQL systems
for Online Transaction Processing (OLTP) read-
write workloads while still maintaining
the ACID guarantees of a traditional database
system
4. Cloud storage – Service model in which data
is maintained, managed, backed up remotely
and made available to users over the Internet
14. Cont.,
Eliminates the acquisition and management
costs of buying and maintaining your own
storage infrastructure, increases agility,
provides global scale, and delivers "anywhere,
anytime" access to data
Users generally pay for their cloud data
storage on a per-consumption “Pay as per
use”
15. Data intelligence
Data intelligence - Analysis of various forms of
data in such a way that it can be used by
companies to expand their services or
investments
Transforming data into information,
information into knowledge, and knowledge
into value
16. Data integration and serialization
Data integration- Combining data residing in
different sources and providing users with a
unified view of them
Data serialization- It is the concept of
converting structured data into a format that
allows it to be shared or stored in such a way
that its original structure to be recovered.
17. Data monitoring
Data monitoring- It allows an organization to
proactively maintain a high, consistent
standard of data quality
• By checking data routinely as it is stored
within applications, organizations can avoid
the resource-intensive pre-processing of data
before it is moved
• With data monitoring, data quality checked at
creation time rather than before a move.
18. Data indexing
Data indexing- It is a data structure that is
added to a file to provide faster access to the
data.
• It reduces the number of blocks that the
DBMS has to check.
• It contains a search key and a pointer. Search
key - an attribute or set of attributes that is
used to look up the records in a file.
• Pointer - contains the address of where the
data is stored in memory.
19. Why Big data?
These are the factors leads to the emergence of
big data
1. Increase of storage capacity
2. Increase of processing power
3. Availability of data
4. Derive insights and drive growth
5. To be competitive
20. Benefits of Big Data Processing
1. Businesses gains intelligence while decision
making.
2. Better Customer service.
3. Early identification of risk in product/
services.
4. Improved operational efficiency – Product
recommendation.
5. Detecting fraudulent behavior.
21. Applications of Bigdata
Smarter health care – Leverage the health
care system with easy access and efficient
outcome.
Multi channel sales and web display
advertisement
Finance
Intelligence Traffic management
Manufacturing
Fraud and risk detection
Telecom
22. Analysis Vs Analytics
Analysis - It is the process of breaking a complex
topic or substance into smaller parts in order
to gain a better understanding of it
What happened in the past? It is the process
of examining, transforming and arranging raw
data in a specific way to generate useful
information from it
Analytics – It is sub component of analysis that
involves the use of tools and techniques to
find novel, valuable and exploitable patterns
(What will happen in future?)
23. Big data analytics
It is the process of
Collecting, Storing, Organizing and
Analyzing the large set of heterogeneous data
for gaining insights, discover patterns,
correlations and other useful information
Faster and better decision making
Enhance performance, service or product
Cost effective and next generation products
25. Stages in Big data analytics
I. Identifying problem
II. Designing data requirements
III. Preprocessing data
IV. Visualizing data and
V. Performing analytics over data
26. Tradition Vs Big data analytics
Traditional Analytics Big data analytics
Analytics with well known and smaller
in size data
Not well understood format with
largely semi structured or unstructured
data
Built based on relational data models It is retrieved from various sources
with almost flat and no relationship in
nature
27. Four types of analytics
1. Descriptive Analytics : What happened?
It is a backward looking and reveal what has
occurred in the past with the present data
(Hindsight)
Two types: 1) Measures of central tendency
(mean, mode, and median)
2) Measures of dispersion (range,
variance, and standard deviation)
28. 2. Diagnostic Analytics : Why did this happen?
What went wrong?
3. Predictive Analytics : What is likely to
happen?
It predict what could happen in the future
(Insight)
Several models used are i) Forecasting, ii)
Simulation, iii) Regression, iv)Classification,
and v) Clustering
29. 4. Prescriptive analytics – What should we do to
make it happen?
It suggest conclusions or actions that can be
taken based on the analysis
Techniques used are i) Linear programming,
ii)Integer programming, iii)Mixed integer
programming, and iv)Non linear programming
30. Approach in analytics development
Identify the data source
Select the right tools and technology for
collect, store and organize data
Understand the domain and process data
Build mathematical model for your analytics
Visualize and validate the result
Learn, adapt and rebuilt your analytical
model.
31. Big data analytics domain
Web and E-Tailing
Government
Retail
Tele communication
Health care
Finance and banking
32. Big data techniques
There are seven widely used big data analysis
techniques. They are
1. Association rule learning
2. Classification tree analysis
3. Genetic algorithms
4. Machine learning
5. Regression analysis
6. Sentiment analysis
7. Social network analysis
33. Association rule learning
Rule based machine learning method for
discovering the interesting relations between
variables in large database.
In order to select interesting rules from the
set of all possible rules, constraints on various
measures of significance and interest are
used. The best known constraints are
minimum threshold on support and
confidence.
34. Cont.,
Support- Indication of how frequently the
item set appear in the data set.
Confidence – Indication of how often the rule
has been found to be true.
Example Rule for the supermarket
{ bread, butter} => {Milk}, It mean that if butter
and bread are bought, customers also buy
milk.
35. Algorithms for association rule
learning
Some of the familiar algorithms used for mining
frequent item sets are
1.Apriori algorithm- It uses
a) breadth- first search strategy to count the
support of item sets
b) candidate generation function which exploits
the downward closure property of a support
36. Equivalence class transformation
(ECLAT) algorithm
Depth first search algorithm using set
intersection
Suitable for serial and parallel execution with
locality enhancing properties
37. Frequent Pattern (FP) Growth
algorithm
1st
phase - Algorithm counts number of
occurrence of items in dataset and store in
header table
2nd
phase – FP tree structure is built by
inserting instances. Items in each instance
have to be sorted by descending order of their
frequency in the dataset, so that the tree can
be processed quickly.
38. Classification Tree Analysis
It is a type of machine learning algorithm used
to classify the class of an object
Identifies a set of characteristics that best
differentiates individuals based on a
categorical outcome variable
39. Genetic Algorithms
Search based optimization technique based
on the concepts of natural selection and
genetics
In GAs, we have a pool or a population of
possible solutions to the given problem.
These solutions then undergo recombination
and mutation (like in natural genetics),
producing new children, and the process is
repeated over various generations.
40. Cont.,
Each individual is assigned a fitness value
(based on its objective function value) and the
fitter individuals are given a higher chance to
mate and yield more “fitter” individuals
Part of evolutionary algorithms
Three basic operators of GA: (i)
Reproduction, (ii) Mutation, and (iii)
Crossover
41. Machine Learning
It is a method of data analysis that automates
analytical model building
It is an application of Artificial Intelligence
based on the idea that machines should be
able to learn and adapt through experience
Within the field of data analytics, machine
learning is a method used to devise complex
models and algorithms that lend themselves
to prediction
42. Cont.,
• Machine learning is a branch of science that
deals with programming the systems in such a
way that they automatically learn and
improve with experience.
• Learning means recognizing and
understanding the input data and making wise
decisions based on the supplied data.
43. Cont.,
• It is very difficult to cater to all the decisions
based on all possible inputs. To tackle this
problem, algorithms are developed. These
algorithms build knowledge from specific data
and past experience with the principles of
statistics, probability theory, logic,
combinatorial optimization, search,
reinforcement learning, and control theory.
44. Learning types
There are several ways to implement machine
learning techniques, however the most
commonly used ones
Supervised learning
Unsupervised learning
Semi supervised learning
45. Supervised learning
• Deals with learning a function from available training
data. Known input and output variable. Use an
algorithm to learn the mapping function from input
to output [Y=f(X)]
• Analyzes the training data and produces an inferred
function, which can be used for mapping new
examples
• Some supervised learning algorithms are neural
networks, Support Vector Machines (SVMs), and
Naive Bayes Classifiers, Random forest, Decision
Trees, Regression.
• Ex: classifying spam, voice recognization, regression
46. Unsupervised Learning
Makes sense of unlabeled data without having any
predefined dataset for its training. Only input (X)
and no corresponding output variable
Model the underlying structure or distribution in the
data in order to learn more about data
It is most commonly used for clustering similar input
into logical groups
Common approaches: K means, self organizing maps
and hierarchical clustering
Techniques: Recommendation, Association,
Clustering
47. Semi Supervised Learning
Problems where you have a large amount of
input data (X) and only some of the data is
labeled
Example: In photo archive where only some
of the images are labeled and the majority are
unlabeled
48. Regression Analysis
• It is a set of statistical processes for estimating
the relationships among variables
• Regression analysis helps one understand how
the typical value of the dependent variable (or
'criterion variable') changes when any one of
the independent variables is varied, while the
other independent variables are held fixed.
• Widely used for prediction and forecasting,
where its use has substantial overlap with the
field of machine learning.
49. Cont.,
• This technique is used for forecasting, time
series modeling and finding the causal effect
relationship between the variables. For
example, relationship between rash driving
and number of road accidents by a driver is
best studied through regression.
50. Sentiment Analysis/ Opinion
Mining
Using NLP, statistics, or machine learning
methods to extract, identify, or otherwise
characterize the sentiment content of a text
unit
Sentiment = feelings
Attitudes – Emotions – Opinions
Subjective impressions, not facts
51. *A common use case for this technology is to
discover how people feel about a particular
topic
Automated extraction of subjective content
from digital text and predicting the
subjectivity such as positive, negative or
neutral
52. Social Network Analysis
• Process of investigating social structures
through the use of networks and graph theory
• It is the mapping and measuring of
relationships and flows between people,
groups, organizations, computers, URLs, and
other connected information/knowledge
entities.
• The nodes in the network are the people and
groups while the links show relationships or
flows between the nodes.
53. Two types of SNA
• Egocentric Analysis
– Focuses on the individual and studies an
individual’s personal network and its affects
on that individual
• Sociocentric Analysis
– Focuses on large groups of people – Quantifies
relationships between people in a group
– Studies patterns of interactions and how these
patterns affect the group as a whole
54. Egocentric Analysis
• Examines local network structure
• Describes the network around a
single node (the ego)
– Number of other nodes (alters)
– Types of connections
• Extracts network features
• Uses these factors to predict health and
longevity, economic success, levels of
depression, access to new opportunities
55. Sociocentric Analysis
• Quantifies relationships and interactions
between a group of people
• Studies how interactions, patterns of
interactions, and network structure affect
– Concentration of power and resources
– Spread of disease
– Access to new ideas
– Group dynamics
56. Big data analytics tools and
technologies
Hadoop = HDFS + Map Reduce
HiveHBase
Flume
Oozie
Pig
Flume
Sqoop
Khufka
Storm
RHadoop
Chukwa
57. Future role of data
Now Future
DNS =
Data
Decision Support
System
Digital Nervous
System (DNS)
Data
Sense ActDecideInterpret
58. History of Hadoop
1996-2000 2003-04 2005-06 2010 2013
Yahoo
Big data problem faced by all search engines
Google
Google file system and Map reduce papers
Hadoop spawns
Cloud era
Apache
(Dough & Mike)
Next generation Hadoop / Yarn &
Mapreduce2
59. Hadoop
It is an open source framework used for
distributed storage and processing of dataset
of big data using MapReduce programming
model
• The core components are i) Hadoop
Common – contains libraries and utilities
needed by other Hadoop modules;
60. • Hadoop Distributed File System (HDFS) –
Stores data on commodity machines,
providing very high aggregate bandwidth
across the cluster
• Hadoop YARN – a platform responsible for
managing computing resources in clusters and
using them for scheduling users' applications
• Hadoop MapReduce – an implementation of
the MapReduce programming model for
large-scale data processing.
61. Distributed Computing
Use of commodity hardware and open source
software (Increase in number of
processers)against expensive proprietary
software on expensive hardware (Server)
62. Major Components of Hadoop
Framework
1. HDFS (Hadoop Distributed File System):
Inspired from Google file system
2. Map Reduce : Inspired from Google Map
Reduce
* Both work on cluster of systems, hierarchical
architecture
Hdfs
Map Reduce
64. Master Node: It monitors the data distributed
among data node
Data Node: Stores the data blocks
* Both are Hadoop daemons. Actually java
programs run on specific machines
65. Map Reduce
It is divided into 2 phases
1.Map - Mapper code is distributed among
machines and it work on the data which the
system holds (Data locality). The locally
computed results are aggregated and sent to
reducer.
Map Map Map Reduce
66. 2. Reducer- Reducer algorithms are applied to
global data to produce the final result.
Programmers need to write only the map
logic and reduce logic. The correct distribution
of map code to map machines are handled by
Hadoop.
68. Pig
It is a tool uses scripting statements to
process the data
Simple data flow language which saves
development time and efforts
Typically it was designed for data scientist
who have less programming skills
It is developed by yahoo
69. Hive
It produces SQL type language tool which
runs on top of map reduce
Hive is develop by facebook for data scientist
who have less programming skills.
The code written on pig/ hive gets converted
into map reduce jobs and run on HDFS
70. Sqoop/ Flume
Inorder to facilitate the movement of data to
Rhadoop, sqoop/flume is used
Sqoop is used to move the data from
Relational database and Flume is used to
inject the data as it was created by external
source.
Hbase is a tool which provide features like
real time database to receive data from HDFS