2. It describes the large volume of data both
structured and unstructured – that helps a
business on a day-to-day basis.
The data might be in petabytes and Exabyte
also.
3. Volume. When a huge amount of data is there.
Velocity. Processing time of data is very high.
Variety. Data comes in all types of formats –
from structured, numeric data in traditional
databases to unstructured text documents,
email, video, audio and financial transactions.
4. Variability. In addition to the increasing
velocities and varieties of data, data flows can
be highly inconsistent with periodic peaks i.e. in
a particular time period how data is processed
efficiently.
Complexity. Today's data comes from multiple
sources, which makes it difficult to link, match,
cleanse and transform data across systems.
However, it’s a bit complex to connect and
correlate relationships, hierarchies.
5. Hadoop is a framework for handling large
datasets in a distributed computing
environment.
The core of Apache Hadoop consists of a
storage part, known as Hadoop Distributed File
System (HDFS), and a processing part
called Map Reduce.
Hadoop splits files into large blocks and
distributes them across nodes in a cluster.
6. Map takes a set of data and converts it into
another set of data, where individual elements
are broken down into tuples.
Secondly, reduce task, which takes the output
from a map as an input and combines those
data tuples into a smaller set of tuples.