Big data refers to large, complex datasets that are difficult to store and process using traditional database systems. The size of data considered "big" depends on the capabilities of the organization handling it. Big data is classified based on its volume, variety, velocity, veracity, and value. Volume refers to the large amount of data generated from sources like social media. Variety means data comes in many formats like structured, semi-structured, and unstructured. Velocity is the speed at which data is processed. Veracity refers to inconsistencies in data from different sources. Value means extracting useful information and insights from big data. Common problems with big data include storing and processing large, complex datasets quickly.
1. What is Big Data..??
Big Data is the collection of the large and complex
amount of data that it becomes difficult for traditional
database processing applications to store and
process it.
From what point onwards
big data starts..?
Assumption: After certain size, data is
said to be big data else it is small data. But, this is
not the case.
2. Company A
1 TB of data Client C
2 TB of data
Client B
500 GB of data
I will process your
data as my system’s
processing capability
is up to 2 TB
(terabytes)
Sorry, I cannot
process your data
because my system’s
capability for data
processing is up to
500GB only
Hey, I want
to process my
1 TB data
3. Company A
1 TB of data
Client C
2 TB of data
Client B
500 GB of data
Big Data can start from anywhere. It depends
upon the capability of the organization.
Data
Is
Big Data
Unable to
handle the
processing
request of
data with
size more
than 500 GB
4. Classification of Big Data
Big Data is classified into the concept of 5 V’s which
are helpful in determining which type of data will be
difficult for us to process and which not.
Following 5 V’s are:
Volume
Variety
Velocity
Veracity
Value
Let us understand them one by one.
5. Volume
Volume refers to the amount of data
Let us understand this with a simple scenario.
At any social media platform, say Facebook, there are 5
million users. So, these users exchange pictures, share
videos, send or post messages hence generating
terabytes or petabytes of data.
With time, the number of users is expected to increase
and hence amount of data it will generate will be very
large.
Large amount of data results in the creation of large
files.
6. Variety
Variety of data is different types of data that is being
generated from various sources
Data can be:
• Structured Data is a type of data that is stored in the
form of any record or file. It is easy to queried, or
analysed e.g. tables
• Semi-structured Data is a type of data that is not
stored in any kind of repository like RDBMS. Rather, it
contains data that has information associated with it
e.g. XML document, Log files
• Unstructured Data is a type of data that is not
organized into any format. They can be accessed
easily e.g. photos, videos.
7. Velocity
Velocity refers to the speed of processing of data
It basically keeps the record of number of users per unit
of time.
More number of users ultimately results in the
generation of large amount of data thereby affecting the
speed to process the data.
8. Veracity
As we know that data collected from various sources
will have lots of inconsistencies and uncertainties. So, it
is obvious that when you will extract useful information
from such big amount of data, then on dumping
remaining data, there will be some data packages that
are bound to loose in the process.
What we have to do is, we have to fill in the gaps and
again mine it and process it to achieve desired goals.
9. Value
Value of data is meaningful information
As the amount of data is increasing with time so the
bigger problem arises which is, how to extract useful
data from this large amount of data.
What we have to do first is, we have to extract
meaningful data from the collection of data and then
some analytics has to be performed over the extracted
data.
The result obtained after analysis should be of some
value.
Extracting value out of big amount of data is itself a
challenge.
10. Sources of Big Data
Some of the sources of Big Data are:
• Users
• Systems
• Applications and Sensors
• Social Media
• Small-scale, mid-scale and large-scale Industries and
so on
These sources are generating large and large amount of
data with varying speeds and also with varying formats
of data. All these factors are creating challenges for
traditional database systems and hence giving the term
‘BIG DATA’
11. Problems with Big Data
• Storing exponentially huge datasets
• Processing the data with complex structures i.e. data
can be structured, semi-structured or unstructured
• Speed of data processing
In other words, we can conclude that big data problem
arises on the basis of 3 prime factors i.e. VOLUME,
VARIETY and VELOCITY
Solution..?