All About Big Data

All About Big Data
By
Sai Venkatesh Attaluri
Head – BD & Big Data Analytics
Netxcell Limited

 Big data is a collection of data sets so large and complex that it becomes
difficult to process using on-hand database management tools. The
challenges include capture, curation, storage, search, sharing, analysis,
and visualization. The trend to larger data sets is due to the additional
information derivable from analysis of a single large set of related data, as
compared to separate smaller sets with the same total amount of data,
allowing correlations to be found to "spot business trends, determine
quality of research, prevent diseases, link legal citations, combat crime,
and determine real-time roadway traffic conditions. (Wikipedia)
 “Any fool can make things bigger, more complex, and more violent. It takes
a touch of genius-and a lot of courage-to move in the opposite direction.” -
Albert Einstein
Big Data - Definition

Simplifying the Definition
• Big data refers to data that is too big to fit on a single
server, too unstructured to fit into a row-and-column
database, or too continuously flowing to fit into a
static data warehouse. - - Thomas H Davenport
• Put another way, big data is the realization of greater
business intelligence by storing, processing, and
analyzing data that was previously ignored due to the
limitations of traditional data management
technologies

About Big Data
 Every second of every day, businesses generate more data. Researchers
at IDC estimate that by the end of 2013, the amount of stored data will
exceed 4 zettabytes, or 4 billion terabytes.
 All of that big data represents a big Opportunity for organizations.
 Big data is a term applied to data sets whose size is beyond the ability of
commonly used software tools to capture, manage, and process the data
within a tolerable elapsed time.
 In simplest terms, "Big Data" refers to the tools, processes and
procedures allowing an organization to create, manipulate, and manage
extremely large data sets. This means terabytes, petabytes or even large
collection of data such as zettabytes.

How Does Big Data Differ from
Traditional Transactional Systems?

Traditional Transaction Systems Big Data
TTS are designed and implemented
to track information whose format is
and use are known ahead of time
Big Data Systems are deployed when
the questions to be asked and the
data for ats to be exa i ed are t
known ahead of time.
Data that resides within the fixed
confines of a record or file is known
as structured data.
Data that comes from a variety of
sources, such as emails, text
documents, videos, photos, audio
files, and social media posts, is
referred to as unstructured data.
Does t support unstructured data
Structured Data even in large
volumes can be entered, stored,
queried, and analyzed in a simple and
straightforward manner, this type of
data is best served by a Traditional
Transaction Database.
TTS Vs Big Data

Traditional Transaction Systems Big Data
Companies whose data workloads are
constant and predictable will be better
served by a traditional database.
Companies challenged by increasing data
demands will want to take advantage
of Big Data s scalable infrastructure.
Scalability allows servers to be added on
demand to accommodate growing
workloads.
In cases where organizations rely on time-
sensitive data analysis, a traditional
database is the better fit. That s because
shorter time-to -insight is t about
analyzing large unstructured datasets. It s
about analyzing smaller data sets in real
or near-real time, which is what
traditional databases are well equipped to
do.
Big Data is designed for large distributed
data processing that addresses every file
in the database. And that type of
processing takes time. For tasks where
fast performance is t critical, such as
running end-of-day reports to review daily
transactions, scanning historical data, and
performing analytics where a slower time-
to-insight is acceptable, Big Data is ideal.
TTS Vs Big Data..Continued

 Unfortunately, extracting valuable information from big
data is t as easy as it sounds. Big data amplifies any
existing problems in your infrastructure, processes or
even the data itself.
 It is also misrepresented by the media making it difficult
for organizations to determine investing in Big Data will
bring expected results and make it possible to improve
efficiency, bring out better products and services.
Misconception of big data

The Promise of Big Data
 Companies recognize that Big data contains valuable information such as
 Obtain Actionable Insights
 Product Performance,
 Deepen Customer Relationships
 Understanding Customer Behavior,
 Prevent Threats & Fraud
 Identify New Revenue Opportunities.
80-90% of data produced today is unstructured

12
Big Data
Volume
Variety
Velocity
Veracity
The 4 V s

 To make the most of the information in their systems, companies
must successfully deal with the 4 V s that distinguish big data:
1. Variety
2. Volume
3. Velocity and
4. Veracity.
 The first three—variety, volume and velocity— define big data;
when you have a large volume of data coming in from a wide
variety of applications and formats and it s moving and changing at
a rapid velocity, that s when you know you have big data.
Definition of V s

 Volume
– Big Data tools and services are designed to manage extremely large and
growing sources of data that require capabilities beyond that found in
traditional database engines. Ex: Extreme Large Volumes of Data
 Variety
– Big Data Tools manage an extensive variety of data as well. This means
having the capability to manage structured data, very much like the
capabilities offered by a database engine. They go beyond supporting
structured data to working with both non-structured data, such as
documents, spreadsheets, presentation decks and the like; and log data
coming from operating systems, database engines, application framework,
retail point of sale systems, mobile communications systems and more.
Ex: Structured, Unstructured, images, documents, etc
Definition of V s

 Velocity
– Ability to gather, analyze and report on rapidly changing sets of data. In
some cases, this means having the capability to manage data that changes
so rapidly that the updated data cannot be saved to traditional disk drives
before it is changed again.
Simple Term: Quickly Moving Data
 Veracity
– Veracity is a measure of the accuracy and trustworthiness of your data.
Veracity is a goal one that the variety, volume and velocity of big data make
harder to achieve.
Simple Term: Trust and integrity
Definition of V s

• 2.5 quintillion bytes of data are generated every day!
– A quintillion is 1018
• Data come from many quarters.
– Social media sites
– Sensors
– Digital photos
– Business transactions
– Location-based data
Lots of Data
Style of Data Source of Data Industry Affected Function Affected
Large Volume Online Financial Services Marketing
Unstructured Video Health Care Supply Chain
Continuous Flow Sensor Manufacturing Human Resources
Multiple Formats Genomic Travel / Transport Finance

• Aspects of the way in which users want to interact with their data…
– Totality: Users have an increased desire to process and analyze
all available data
– Exploration: Users apply analytic approaches where the schema
is defined in response to the nature of the query
– Frequency: Users have a desire to increase the rate of analysis
in order to generate more accurate and timely business
intelligence
– Dependency: Users eed to balance investment in existing
technologies and skills with the adoption of new techniques
• So in a Nutshell, Big Data is about better analytics
The Need of Big Data

Term Time Frame Specific Meaning
Decision Support 1970-1985
Use of data analysis to support
decision making
Executive Support 1980-1990
Focus on Data Analysis for
decisions by Senior Executives
Online Analytical
Processing (OLAP)
1990-2000
Software for analyzing
multidimensional data tables
Business Intelligence 1989-2005
Tools to support data driven
decisions, with emphasis on
reporting.
Analytics 2006-2010
Focus on Statistical and
Mathematical analysis for
decisions
Big Data
2010-Present &
Next 10 Years
Focus on very large,
unstructured, fast-moving data
Terminology For Using and Analyzing data

 Your company can take advantage of the opportunities available in big data only
when you have processes and solutions that can handle all 4 V's.
 Many of the previous attempts to address the need to gather information from
the rapidly growing, rapidly changing and broad types of data have been based
upon the use of special-purpose, complex and highly expensive computing
systems. Today's Big Data Solutions are built upon a different foundation.
 Rather than trying to use a very powerful, dedicated database system, cluster of
inexpensive, powerful, industry standard (X86) systems are harnessed to attack
these very small problems.
 The clustered approach uses commodity systems, storage, and memory. It also
adds the benefit of being more reliable. The failure of any single system in the
cluster will not stop processing.
Technology Shift

Gartner s Visualization on Big Data

• Problems:
– Although there is a massive spike available data, the percentage of the
data that an enterprise can understand is on the decline
– The data that the enterprise is trying to understand is saturated with
both useful signals and lots of noise
Big Data – Conundrum

High Level Architecture of Recognizer
Big Data Platform on Hadoop FM
API s to
3rd Party
API s
Enterprise
Recommendation Engine
OBD
IVR
DATA
PCA
Greybox
Others
Historical Data
Business
Intelligence
Churn
Prediction
Predictive
Analysis

Medium Level Architecture of Recognizer

What is Hadoop?
• Hadoop is a free software framework that is developed by
Apache Software Foundation to support distributed
processing of data. Initially, Java™ language was used to
develop Hadoop, but today many other languages are used
for scripting Hadoop. Hadoop is used as the core platform to
structure Big Data and helps in performing data analytics.
• This Distributed processing framework designed to harness
together the power of many computers, each having its own
processing and storage, and provide the capability to quickly
process large, distributed data sets.

Hadoop Distributed File System (HDFS)
• Hadoop Distributed File System designed to
support large data sets to made up of rapidly
changing structured and non-structured data.
MapReduce
• MapReduce is a tool designed to allow analysts
and developers to rapidly shift through massive
amounts of data to examine only those data
items that match a specified set of criteria.
Introduction to Hadoop

Hadoop Components Hadoop Components
Sqoop Flume
ZooKeeper Oozie
Pig Mahout
R Connectors Hive
Map Reduce HDFS
Hbase MongoDB
Cloudera Horton Works
Kafka Yarn
Cassandra VMware Player
SQL NOSQL
MetaStore Scala
Query Compiler Hadoop Cluster
Execution Engine Ambari
Hadoop Architecture & Components

Apache Hadoop Architecture
• Hadoop is a master and slave architecture that includes the NameNode as the master
and the DataNode as the slave.
Apache Sqoop
• Apache Sqoop is a command-line tool for transferring data between relational
databases and Hadoop. Sqoop, similar to other ETL tools, uses schema metadata to
infer data types and ensure type-safe data handling when the data moves from the
source to Hadoop.
Apache HBase
• Apache HBase is a column-oriented key/value data store built to run on top of the
Hadoop Distributed File System (HDFS). HBase is designed to support high table-update
rates and to scale out horizontally in distributed compute clusters. Its focus on scale
enables it to support very large database tables
Apache Zookeeper
• Apache ZooKeeper is an open source file application program interface (API) that
allows distributed processes in large systems to synchronize with each other so that all
clients making requests receive consistent data.
Let Us See Hadoop Components

Apache Hive
• Hive is an open-source data warehousing system used to analyze a large
amount of dataset that is stored in Hadoop files. It has three key functions
like summarization of data, query, and analysis.
HDFS
• The Hadoop Distributed File System (HDFS) is a distributed file system that
shares some of the features of other distributed file systems. It is used for
storing and retrieving unstructured data.
MapReduce
• The MapReduce is a core component of Hadoop, and is responsible for
processing jobs in distributed mode.
Pig
• The Apache Pig is a platform which helps to analyze large datasets that
includes high-level language required to express data analysis programs. Pig is
one of the components of the Hadoop eco-system.
Let Us See Hadoop Components – Contd..

NoSQL (Not Only SQL database)
• NoSQL database, also called Not Only SQL, is an approach to data management and
database design that's useful for very large sets of distributed data. NoSQL is especially
useful when an enterprise needs to access and analyze massive amounts of unstructured
data or data that's stored remotely on multiple virtual servers in the cloud.
MongoDB
• MongoDB database management system is designed for running modern applications that
rely on structured and unstructured data and support rapidly changing data..
Apache Cassandra
• Apache Cassandra is a free, open-source, distributed storage system for managing large
amounts of structured data. It differs from traditional relational database management
systems in some significant ways. Cassandra is designed to scale to a very large size across
many commodity servers, with no single point of failure, and provides a simple schema-
optional data model designed to allow maximum power and performance at scale.
Apache Hadoop YARN (Yet Another Resource Negotiator)
• Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management
technology. YARN is one of the key features in the second-generation Hadoop 2 version of
the Apache Software Foundation's open

Oozie
• Oozie is a workflow scheduler system to manage Hadoop jobs. It is a server-based Workflow
Engine specialized in running workflow jobs with actions that run Hadoop MapReduce and
Pig jobs. Oozie is implemented as a Java Web-Application that runs in a Java Servlet-
Container.
Apache Ambari
• The Apache Ambari project is aimed at making Hadoop management simpler by developing
software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari
provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.
Flume
• Flume is a distributed, reliable, and available service for efficiently collecting, aggregating,
and moving large amounts of log data. It has a simple and flexible architecture based on
streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and
many failover and recovery mechanisms.
Cloudera Impala
• Cloudera Impala is a query engine that runs on Apache Hadoop. Impala brings scalable
parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to
data stored in HDFS and Apache HBase without requiring data movement or transformation.

Apache Spark
• Apache Spark is an open source parallel processing framework that enables users to run
large-scale data analytics applications across clustered computers. Apache Spark can
process data from a variety of data repositories, including the Hadoop Distributed File
System (HDFS), NoSQL databases and relational data stores such as Apache Hive.
Scala (Scalable Language)
• Scala (Scalable Language) is a software programming language that mixes object-oriented
methods with functional programming capabilities that support a more concise style of
programming than other general-purpose languages like Java, reducing the amount of code
developers have to write..
Apache Kafka
• Apache Kafka is a distributed publish-subscribe messaging system designed to replace
traditional message brokers. Originally created and developed by LinkedIn, then open
sourced in 2011, Kafka is currently developed by the Apache Software Foundation to exploit
new data infrastructures made possible by massively parallel commodity clusters
Jaspersoft
• Jaspersoft provides the most flexible, cost-effective, and widely-deployed business
intelligence software in the world, enabling better decision making through highly
interactive Web-based reports, dashboards, and analysis.

Hadoop Cluster
• A Hadoop cluster is a special type of computational cluster designed specifically for storing
and analyzing huge amounts of unstructured data in a distributed computing environment.
Distributed File System
• A distributed file system is a client/server-based application that allows clients to access and
process data stored on the server as if it were on their own computer. When a user accesses
a file on the server, the server sends the user a copy of the file, which is cached on the user's
computer while the data is being processed and is then returned to the server.
Catastrophic Failure
• Catastrophic failure is a complete, sudden, often unexpected breakdown in a machine,
electronic system, computer or network. Such a breakdown may occur as a result of a
hardware event such as a disk drive crash, memory chip failure or surge on the power line.
Catastrophic failure can also be caused by software conflicts or malware. Sometimes a single
component in a critical location fails, resulting in downtime for the entire system.
Python
• Python is an interpreted, object-oriented programming language similar to PERL, that has
gained popularity because of its clear syntax and readability. Python is said to be relatively
easy to learn and portable, meaning its statements can be interpreted in a number of
operating systems.

Hadoop Architecture & Components

The ‘ E viro e t
• R is an integrated suite of software facilities for data analysis
and graphics. Among other things it has
• An effective data handling and storage facility,
• A suite of operators for calculations on arrays, in particular
matrices,
• A large, coherent, integrated collection of intermediate
tools for data analysis,
• As a set of statistical methodologies and models.
• As a graphical tool, facilitates data analysis and display
either directly at the computer or on hardcopy, and
• As a well developed, simple and effective programming
language which includes conditionals, loops, user defined
recursive functions and input and output facilities.
An Introduction to ‘

All About Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to All About Big Data

Similar to All About Big Data (20)

All About Big Data