SlideShare a Scribd company logo
1 of 37
Download to read offline
All About Big Data
By
Sai Venkatesh Attaluri
Head – BD & Big Data Analytics
Netxcell Limited
 Big data is a collection of data sets so large and complex that it becomes
difficult to process using on-hand database management tools. The
challenges include capture, curation, storage, search, sharing, analysis,
and visualization. The trend to larger data sets is due to the additional
information derivable from analysis of a single large set of related data, as
compared to separate smaller sets with the same total amount of data,
allowing correlations to be found to "spot business trends, determine
quality of research, prevent diseases, link legal citations, combat crime,
and determine real-time roadway traffic conditions. (Wikipedia)
 “Any fool can make things bigger, more complex, and more violent. It takes
a touch of genius-and a lot of courage-to move in the opposite direction.” -
Albert Einstein
Big Data - Definition
Simplifying the Definition
• Big data refers to data that is too big to fit on a single
server, too unstructured to fit into a row-and-column
database, or too continuously flowing to fit into a
static data warehouse. - - Thomas H Davenport
• Put another way, big data is the realization of greater
business intelligence by storing, processing, and
analyzing data that was previously ignored due to the
limitations of traditional data management
technologies
About Big Data
 Every second of every day, businesses generate more data. Researchers
at IDC estimate that by the end of 2013, the amount of stored data will
exceed 4 zettabytes, or 4 billion terabytes.
 All of that big data represents a big Opportunity for organizations.
 Big data is a term applied to data sets whose size is beyond the ability of
commonly used software tools to capture, manage, and process the data
within a tolerable elapsed time.
 In simplest terms, "Big Data" refers to the tools, processes and
procedures allowing an organization to create, manipulate, and manage
extremely large data sets. This means terabytes, petabytes or even large
collection of data such as zettabytes.
How Does Big Data Differ from
Traditional Transactional Systems?
Traditional Transaction Systems Big Data
TTS are designed and implemented
to track information whose format is
and use are known ahead of time
Big Data Systems are deployed when
the questions to be asked and the
data for ats to be exa i ed are t
known ahead of time.
Data that resides within the fixed
confines of a record or file is known
as structured data.
Data that comes from a variety of
sources, such as emails, text
documents, videos, photos, audio
files, and social media posts, is
referred to as unstructured data.
Does t support unstructured data
Structured Data even in large
volumes can be entered, stored,
queried, and analyzed in a simple and
straightforward manner, this type of
data is best served by a Traditional
Transaction Database.
TTS Vs Big Data
Traditional Transaction Systems Big Data
Companies whose data workloads are
constant and predictable will be better
served by a traditional database.
Companies challenged by increasing data
demands will want to take advantage
of Big Data s scalable infrastructure.
Scalability allows servers to be added on
demand to accommodate growing
workloads.
In cases where organizations rely on time-
sensitive data analysis, a traditional
database is the better fit. That s because
shorter time-to -insight is t about
analyzing large unstructured datasets. It s
about analyzing smaller data sets in real
or near-real time, which is what
traditional databases are well equipped to
do.
Big Data is designed for large distributed
data processing that addresses every file
in the database. And that type of
processing takes time. For tasks where
fast performance is t critical, such as
running end-of-day reports to review daily
transactions, scanning historical data, and
performing analytics where a slower time-
to-insight is acceptable, Big Data is ideal.
TTS Vs Big Data..Continued
 Unfortunately, extracting valuable information from big
data is t as easy as it sounds. Big data amplifies any
existing problems in your infrastructure, processes or
even the data itself.
 It is also misrepresented by the media making it difficult
for organizations to determine investing in Big Data will
bring expected results and make it possible to improve
efficiency, bring out better products and services.
Misconception of big data
The Promise of Big Data
 Companies recognize that Big data contains valuable information such as
 Obtain Actionable Insights
 Product Performance,
 Deepen Customer Relationships
 Understanding Customer Behavior,
 Prevent Threats & Fraud
 Identify New Revenue Opportunities.
80-90% of data produced today is unstructured
11
Evolution of big data
12
Big Data
Volume
Variety
Velocity
Veracity
The 4 V s
 To make the most of the information in their systems, companies
must successfully deal with the 4 V s that distinguish big data:
1. Variety
2. Volume
3. Velocity and
4. Veracity.
 The first three—variety, volume and velocity— define big data;
when you have a large volume of data coming in from a wide
variety of applications and formats and it s moving and changing at
a rapid velocity, that s when you know you have big data.
Definition of V s
 Volume
– Big Data tools and services are designed to manage extremely large and
growing sources of data that require capabilities beyond that found in
traditional database engines. Ex: Extreme Large Volumes of Data
 Variety
– Big Data Tools manage an extensive variety of data as well. This means
having the capability to manage structured data, very much like the
capabilities offered by a database engine. They go beyond supporting
structured data to working with both non-structured data, such as
documents, spreadsheets, presentation decks and the like; and log data
coming from operating systems, database engines, application framework,
retail point of sale systems, mobile communications systems and more.
Ex: Structured, Unstructured, images, documents, etc
Definition of V s
 Velocity
– Ability to gather, analyze and report on rapidly changing sets of data. In
some cases, this means having the capability to manage data that changes
so rapidly that the updated data cannot be saved to traditional disk drives
before it is changed again.
Simple Term: Quickly Moving Data
 Veracity
– Veracity is a measure of the accuracy and trustworthiness of your data.
Veracity is a goal one that the variety, volume and velocity of big data make
harder to achieve.
Simple Term: Trust and integrity
Definition of V s
• 2.5 quintillion bytes of data are generated every day!
– A quintillion is 1018
• Data come from many quarters.
– Social media sites
– Sensors
– Digital photos
– Business transactions
– Location-based data
Lots of Data
Style of Data Source of Data Industry Affected Function Affected
Large Volume Online Financial Services Marketing
Unstructured Video Health Care Supply Chain
Continuous Flow Sensor Manufacturing Human Resources
Multiple Formats Genomic Travel / Transport Finance
• Aspects of the way in which users want to interact with their data…
– Totality: Users have an increased desire to process and analyze
all available data
– Exploration: Users apply analytic approaches where the schema
is defined in response to the nature of the query
– Frequency: Users have a desire to increase the rate of analysis
in order to generate more accurate and timely business
intelligence
– Dependency: Users eed to balance investment in existing
technologies and skills with the adoption of new techniques
• So in a Nutshell, Big Data is about better analytics
The Need of Big Data
Term Time Frame Specific Meaning
Decision Support 1970-1985
Use of data analysis to support
decision making
Executive Support 1980-1990
Focus on Data Analysis for
decisions by Senior Executives
Online Analytical
Processing (OLAP)
1990-2000
Software for analyzing
multidimensional data tables
Business Intelligence 1989-2005
Tools to support data driven
decisions, with emphasis on
reporting.
Analytics 2006-2010
Focus on Statistical and
Mathematical analysis for
decisions
Big Data
2010-Present &
Next 10 Years
Focus on very large,
unstructured, fast-moving data
Terminology For Using and Analyzing data
 Your company can take advantage of the opportunities available in big data only
when you have processes and solutions that can handle all 4 V's.
 Many of the previous attempts to address the need to gather information from
the rapidly growing, rapidly changing and broad types of data have been based
upon the use of special-purpose, complex and highly expensive computing
systems. Today's Big Data Solutions are built upon a different foundation.
 Rather than trying to use a very powerful, dedicated database system, cluster of
inexpensive, powerful, industry standard (X86) systems are harnessed to attack
these very small problems.
 The clustered approach uses commodity systems, storage, and memory. It also
adds the benefit of being more reliable. The failure of any single system in the
cluster will not stop processing.
Technology Shift
Gartner s Visualization on Big Data
• Problems:
– Although there is a massive spike available data, the percentage of the
data that an enterprise can understand is on the decline
– The data that the enterprise is trying to understand is saturated with
both useful signals and lots of noise
Big Data – Conundrum
Benefits of Big Data
Big Data Platform Manifesto
High Level Architecture of Recognizer
Big Data Platform on Hadoop FM
API s to
3rd Party
API s
Enterprise
Recommendation Engine
OBD
IVR
DATA
PCA
Greybox
Others
Historical Data
Business
Intelligence
Churn
Prediction
Predictive
Analysis
Medium Level Architecture of Recognizer
What is Hadoop?
• Hadoop is a free software framework that is developed by
Apache Software Foundation to support distributed
processing of data. Initially, Java™ language was used to
develop Hadoop, but today many other languages are used
for scripting Hadoop. Hadoop is used as the core platform to
structure Big Data and helps in performing data analytics.
• This Distributed processing framework designed to harness
together the power of many computers, each having its own
processing and storage, and provide the capability to quickly
process large, distributed data sets.
Hadoop Distributed File System (HDFS)
• Hadoop Distributed File System designed to
support large data sets to made up of rapidly
changing structured and non-structured data.
MapReduce
• MapReduce is a tool designed to allow analysts
and developers to rapidly shift through massive
amounts of data to examine only those data
items that match a specified set of criteria.
Introduction to Hadoop
Hadoop Components Hadoop Components
Sqoop Flume
ZooKeeper Oozie
Pig Mahout
R Connectors Hive
Map Reduce HDFS
Hbase MongoDB
Cloudera Horton Works
Kafka Yarn
Cassandra VMware Player
SQL NOSQL
MetaStore Scala
Query Compiler Hadoop Cluster
Execution Engine Ambari
Hadoop Architecture & Components
Apache Hadoop Architecture
• Hadoop is a master and slave architecture that includes the NameNode as the master
and the DataNode as the slave.
Apache Sqoop
• Apache Sqoop is a command-line tool for transferring data between relational
databases and Hadoop. Sqoop, similar to other ETL tools, uses schema metadata to
infer data types and ensure type-safe data handling when the data moves from the
source to Hadoop.
Apache HBase
• Apache HBase is a column-oriented key/value data store built to run on top of the
Hadoop Distributed File System (HDFS). HBase is designed to support high table-update
rates and to scale out horizontally in distributed compute clusters. Its focus on scale
enables it to support very large database tables
Apache Zookeeper
• Apache ZooKeeper is an open source file application program interface (API) that
allows distributed processes in large systems to synchronize with each other so that all
clients making requests receive consistent data.
Let Us See Hadoop Components
Apache Hive
• Hive is an open-source data warehousing system used to analyze a large
amount of dataset that is stored in Hadoop files. It has three key functions
like summarization of data, query, and analysis.
HDFS
• The Hadoop Distributed File System (HDFS) is a distributed file system that
shares some of the features of other distributed file systems. It is used for
storing and retrieving unstructured data.
MapReduce
• The MapReduce is a core component of Hadoop, and is responsible for
processing jobs in distributed mode.
Pig
• The Apache Pig is a platform which helps to analyze large datasets that
includes high-level language required to express data analysis programs. Pig is
one of the components of the Hadoop eco-system.
Let Us See Hadoop Components – Contd..
NoSQL (Not Only SQL database)
• NoSQL database, also called Not Only SQL, is an approach to data management and
database design that's useful for very large sets of distributed data. NoSQL is especially
useful when an enterprise needs to access and analyze massive amounts of unstructured
data or data that's stored remotely on multiple virtual servers in the cloud.
MongoDB
• MongoDB database management system is designed for running modern applications that
rely on structured and unstructured data and support rapidly changing data..
Apache Cassandra
• Apache Cassandra is a free, open-source, distributed storage system for managing large
amounts of structured data. It differs from traditional relational database management
systems in some significant ways. Cassandra is designed to scale to a very large size across
many commodity servers, with no single point of failure, and provides a simple schema-
optional data model designed to allow maximum power and performance at scale.
Apache Hadoop YARN (Yet Another Resource Negotiator)
• Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management
technology. YARN is one of the key features in the second-generation Hadoop 2 version of
the Apache Software Foundation's open
Let Us See Hadoop Components – Contd..
Oozie
• Oozie is a workflow scheduler system to manage Hadoop jobs. It is a server-based Workflow
Engine specialized in running workflow jobs with actions that run Hadoop MapReduce and
Pig jobs. Oozie is implemented as a Java Web-Application that runs in a Java Servlet-
Container.
Apache Ambari
• The Apache Ambari project is aimed at making Hadoop management simpler by developing
software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari
provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.
Flume
• Flume is a distributed, reliable, and available service for efficiently collecting, aggregating,
and moving large amounts of log data. It has a simple and flexible architecture based on
streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and
many failover and recovery mechanisms.
Cloudera Impala
• Cloudera Impala is a query engine that runs on Apache Hadoop. Impala brings scalable
parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to
data stored in HDFS and Apache HBase without requiring data movement or transformation.
Let Us See Hadoop Components – Contd..
Apache Spark
• Apache Spark is an open source parallel processing framework that enables users to run
large-scale data analytics applications across clustered computers. Apache Spark can
process data from a variety of data repositories, including the Hadoop Distributed File
System (HDFS), NoSQL databases and relational data stores such as Apache Hive.
Scala (Scalable Language)
• Scala (Scalable Language) is a software programming language that mixes object-oriented
methods with functional programming capabilities that support a more concise style of
programming than other general-purpose languages like Java, reducing the amount of code
developers have to write..
Apache Kafka
• Apache Kafka is a distributed publish-subscribe messaging system designed to replace
traditional message brokers. Originally created and developed by LinkedIn, then open
sourced in 2011, Kafka is currently developed by the Apache Software Foundation to exploit
new data infrastructures made possible by massively parallel commodity clusters
Jaspersoft
• Jaspersoft provides the most flexible, cost-effective, and widely-deployed business
intelligence software in the world, enabling better decision making through highly
interactive Web-based reports, dashboards, and analysis.
Let Us See Hadoop Components – Contd..
Hadoop Cluster
• A Hadoop cluster is a special type of computational cluster designed specifically for storing
and analyzing huge amounts of unstructured data in a distributed computing environment.
Distributed File System
• A distributed file system is a client/server-based application that allows clients to access and
process data stored on the server as if it were on their own computer. When a user accesses
a file on the server, the server sends the user a copy of the file, which is cached on the user's
computer while the data is being processed and is then returned to the server.
Catastrophic Failure
• Catastrophic failure is a complete, sudden, often unexpected breakdown in a machine,
electronic system, computer or network. Such a breakdown may occur as a result of a
hardware event such as a disk drive crash, memory chip failure or surge on the power line.
Catastrophic failure can also be caused by software conflicts or malware. Sometimes a single
component in a critical location fails, resulting in downtime for the entire system.
Python
• Python is an interpreted, object-oriented programming language similar to PERL, that has
gained popularity because of its clear syntax and readability. Python is said to be relatively
easy to learn and portable, meaning its statements can be interpreted in a number of
operating systems.
Let Us See Hadoop Components – Contd..
Hadoop Architecture & Components
The ‘ E viro e t
• R is an integrated suite of software facilities for data analysis
and graphics. Among other things it has
• An effective data handling and storage facility,
• A suite of operators for calculations on arrays, in particular
matrices,
• A large, coherent, integrated collection of intermediate
tools for data analysis,
• As a set of statistical methodologies and models.
• As a graphical tool, facilitates data analysis and display
either directly at the computer or on hardcopy, and
• As a well developed, simple and effective programming
language which includes conditionals, loops, user defined
recursive functions and input and output facilities.
An Introduction to ‘
Thank you

More Related Content

What's hot

Branches of statistics
Branches of statisticsBranches of statistics
Branches of statisticscalltutors
 
Statistical software packages
Statistical software packagesStatistical software packages
Statistical software packagesKm Ashif
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientistsAjay Ohri
 
Unit 8 data analysis and interpretation
Unit 8 data analysis and interpretationUnit 8 data analysis and interpretation
Unit 8 data analysis and interpretationAsima shahzadi
 
STATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptxSTATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptxMuhammadNafees42
 
Basics of data_interpretation
Basics of data_interpretationBasics of data_interpretation
Basics of data_interpretationVasista Vinuthan
 
Iso 9001 quality policy examples
Iso 9001 quality policy examplesIso 9001 quality policy examples
Iso 9001 quality policy examplesjintrajom
 
Descriptive & inferential statistics presentation 2
Descriptive & inferential statistics presentation 2Descriptive & inferential statistics presentation 2
Descriptive & inferential statistics presentation 2Angela Davidson
 
Step by Step guide to executing an analytics project
Step by Step guide to executing an analytics projectStep by Step guide to executing an analytics project
Step by Step guide to executing an analytics projectRamkumar Ravichandran
 
Quality software project management
Quality software project managementQuality software project management
Quality software project managementselinasimpson1601
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticsAttaullah Khan
 
Introduction to the statistics project
Introduction to the statistics projectIntroduction to the statistics project
Introduction to the statistics projectpmakunja
 
SOC2002 Lecture 11
SOC2002 Lecture 11SOC2002 Lecture 11
SOC2002 Lecture 11Bonnie Green
 

What's hot (19)

Panel slides
Panel slidesPanel slides
Panel slides
 
Branches of statistics
Branches of statisticsBranches of statistics
Branches of statistics
 
Data Analysis, Intepretation
Data Analysis, IntepretationData Analysis, Intepretation
Data Analysis, Intepretation
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
Statistical software packages
Statistical software packagesStatistical software packages
Statistical software packages
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
Unit 8 data analysis and interpretation
Unit 8 data analysis and interpretationUnit 8 data analysis and interpretation
Unit 8 data analysis and interpretation
 
STATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptxSTATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptx
 
Basics of data_interpretation
Basics of data_interpretationBasics of data_interpretation
Basics of data_interpretation
 
Descriptive Analytics: Data Reduction
 Descriptive Analytics: Data Reduction Descriptive Analytics: Data Reduction
Descriptive Analytics: Data Reduction
 
Iso 9001 quality policy examples
Iso 9001 quality policy examplesIso 9001 quality policy examples
Iso 9001 quality policy examples
 
Data preparation
Data preparationData preparation
Data preparation
 
Descriptive & inferential statistics presentation 2
Descriptive & inferential statistics presentation 2Descriptive & inferential statistics presentation 2
Descriptive & inferential statistics presentation 2
 
Step by Step guide to executing an analytics project
Step by Step guide to executing an analytics projectStep by Step guide to executing an analytics project
Step by Step guide to executing an analytics project
 
Quality software project management
Quality software project managementQuality software project management
Quality software project management
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 
Introduction to the statistics project
Introduction to the statistics projectIntroduction to the statistics project
Introduction to the statistics project
 
SOC2002 Lecture 11
SOC2002 Lecture 11SOC2002 Lecture 11
SOC2002 Lecture 11
 

Viewers also liked

The Content Marketing Metrics That Matter (#CMWorld 2015)
The Content Marketing Metrics That Matter (#CMWorld 2015)The Content Marketing Metrics That Matter (#CMWorld 2015)
The Content Marketing Metrics That Matter (#CMWorld 2015)PR 20/20
 
How to take your Business from Chaos to Control
How to take your Business from Chaos to ControlHow to take your Business from Chaos to Control
How to take your Business from Chaos to ControlDavid Guest
 
The Science behind a Winning Sales Culture
The Science behind a Winning Sales CultureThe Science behind a Winning Sales Culture
The Science behind a Winning Sales CultureBrad Giles
 
Integrated KPIs. Stats that matter workshop.
Integrated KPIs. Stats that matter workshop.Integrated KPIs. Stats that matter workshop.
Integrated KPIs. Stats that matter workshop.CharityComms
 
Senior staff accountant kpi
Senior staff accountant kpiSenior staff accountant kpi
Senior staff accountant kpidavivante
 
A R T E F O T O G R A F I C O
A R T E F O T O G R A F I C OA R T E F O T O G R A F I C O
A R T E F O T O G R A F I C OImhotep
 
Senior accountant kpi
Senior accountant kpiSenior accountant kpi
Senior accountant kpidavivante
 
Origins of the Marketing Intelligence Engine (INBOUND 2016)
Origins of the Marketing Intelligence Engine (INBOUND 2016)Origins of the Marketing Intelligence Engine (INBOUND 2016)
Origins of the Marketing Intelligence Engine (INBOUND 2016)PR 20/20
 
How to use KPI's to run your business
How to use KPI's to run your businessHow to use KPI's to run your business
How to use KPI's to run your businessDavid Guest
 
Aligning Budgeting To Corporate Planning - ABF Conference on Corporate Budgeting
Aligning Budgeting To Corporate Planning - ABF Conference on Corporate BudgetingAligning Budgeting To Corporate Planning - ABF Conference on Corporate Budgeting
Aligning Budgeting To Corporate Planning - ABF Conference on Corporate BudgetingKenny Ong
 
Bridge Knowle Workshop - Developing Effective KPIs (Main Slides)
Bridge Knowle Workshop - Developing Effective KPIs (Main Slides)Bridge Knowle Workshop - Developing Effective KPIs (Main Slides)
Bridge Knowle Workshop - Developing Effective KPIs (Main Slides)Kenny Ong
 
Bridge Knowle "YEAR END PERFORMANCE APPRAISAL" Workshop
Bridge Knowle "YEAR END PERFORMANCE APPRAISAL" WorkshopBridge Knowle "YEAR END PERFORMANCE APPRAISAL" Workshop
Bridge Knowle "YEAR END PERFORMANCE APPRAISAL" WorkshopKenny Ong
 
Bi in telecom through kpi’s
Bi in telecom through kpi’sBi in telecom through kpi’s
Bi in telecom through kpi’sSai Venkatesh
 
Entrepreneurship Ecosystem Map of Jordan 2015
Entrepreneurship Ecosystem Map of Jordan 2015Entrepreneurship Ecosystem Map of Jordan 2015
Entrepreneurship Ecosystem Map of Jordan 2015Jamil AlKhatib
 
The 80/20 Rule for Desktop KPI's: Less is More!
The 80/20 Rule for Desktop KPI's: Less is More!The 80/20 Rule for Desktop KPI's: Less is More!
The 80/20 Rule for Desktop KPI's: Less is More!MetricNet
 

Viewers also liked (20)

The Content Marketing Metrics That Matter (#CMWorld 2015)
The Content Marketing Metrics That Matter (#CMWorld 2015)The Content Marketing Metrics That Matter (#CMWorld 2015)
The Content Marketing Metrics That Matter (#CMWorld 2015)
 
How to take your Business from Chaos to Control
How to take your Business from Chaos to ControlHow to take your Business from Chaos to Control
How to take your Business from Chaos to Control
 
Data Strategy
Data StrategyData Strategy
Data Strategy
 
The Science behind a Winning Sales Culture
The Science behind a Winning Sales CultureThe Science behind a Winning Sales Culture
The Science behind a Winning Sales Culture
 
Integrated KPIs. Stats that matter workshop.
Integrated KPIs. Stats that matter workshop.Integrated KPIs. Stats that matter workshop.
Integrated KPIs. Stats that matter workshop.
 
Senior staff accountant kpi
Senior staff accountant kpiSenior staff accountant kpi
Senior staff accountant kpi
 
KPIs, Work flow & evaluating performances
KPIs, Work flow & evaluating performancesKPIs, Work flow & evaluating performances
KPIs, Work flow & evaluating performances
 
A R T E F O T O G R A F I C O
A R T E F O T O G R A F I C OA R T E F O T O G R A F I C O
A R T E F O T O G R A F I C O
 
Senior accountant kpi
Senior accountant kpiSenior accountant kpi
Senior accountant kpi
 
Origins of the Marketing Intelligence Engine (INBOUND 2016)
Origins of the Marketing Intelligence Engine (INBOUND 2016)Origins of the Marketing Intelligence Engine (INBOUND 2016)
Origins of the Marketing Intelligence Engine (INBOUND 2016)
 
How to use KPI's to run your business
How to use KPI's to run your businessHow to use KPI's to run your business
How to use KPI's to run your business
 
Titanic
TitanicTitanic
Titanic
 
Aligning Budgeting To Corporate Planning - ABF Conference on Corporate Budgeting
Aligning Budgeting To Corporate Planning - ABF Conference on Corporate BudgetingAligning Budgeting To Corporate Planning - ABF Conference on Corporate Budgeting
Aligning Budgeting To Corporate Planning - ABF Conference on Corporate Budgeting
 
Human Resource Management : The Importance of Effective Strategy and Planning
Human Resource Management : The Importance of Effective Strategy and PlanningHuman Resource Management : The Importance of Effective Strategy and Planning
Human Resource Management : The Importance of Effective Strategy and Planning
 
Strategi Pemasaran
Strategi  Pemasaran Strategi  Pemasaran
Strategi Pemasaran
 
Bridge Knowle Workshop - Developing Effective KPIs (Main Slides)
Bridge Knowle Workshop - Developing Effective KPIs (Main Slides)Bridge Knowle Workshop - Developing Effective KPIs (Main Slides)
Bridge Knowle Workshop - Developing Effective KPIs (Main Slides)
 
Bridge Knowle "YEAR END PERFORMANCE APPRAISAL" Workshop
Bridge Knowle "YEAR END PERFORMANCE APPRAISAL" WorkshopBridge Knowle "YEAR END PERFORMANCE APPRAISAL" Workshop
Bridge Knowle "YEAR END PERFORMANCE APPRAISAL" Workshop
 
Bi in telecom through kpi’s
Bi in telecom through kpi’sBi in telecom through kpi’s
Bi in telecom through kpi’s
 
Entrepreneurship Ecosystem Map of Jordan 2015
Entrepreneurship Ecosystem Map of Jordan 2015Entrepreneurship Ecosystem Map of Jordan 2015
Entrepreneurship Ecosystem Map of Jordan 2015
 
The 80/20 Rule for Desktop KPI's: Less is More!
The 80/20 Rule for Desktop KPI's: Less is More!The 80/20 Rule for Desktop KPI's: Less is More!
The 80/20 Rule for Desktop KPI's: Less is More!
 

Similar to All About Big Data

Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfrajsharma159890
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfPridesys IT Ltd.
 
IRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth EnhancementIRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth EnhancementIRJET Journal
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfPridesys IT Ltd.
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.saranya270513
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdfAkuhuruf
 
An Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAn Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAudrey Britton
 
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxUnit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxYashiBatra1
 

Similar to All About Big Data (20)

Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdf
 
Unit 1
Unit 1Unit 1
Unit 1
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
M.Florence Dayana
M.Florence DayanaM.Florence Dayana
M.Florence Dayana
 
IRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth EnhancementIRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth Enhancement
 
What Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdfWhat Is Big Data How Big Data Works.pdf
What Is Big Data How Big Data Works.pdf
 
1
11
1
 
Thilga
ThilgaThilga
Thilga
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdf
 
An Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAn Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data Analytics
 
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxUnit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 

All About Big Data

  • 1. All About Big Data By Sai Venkatesh Attaluri Head – BD & Big Data Analytics Netxcell Limited
  • 2.
  • 3.  Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools. The challenges include capture, curation, storage, search, sharing, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions. (Wikipedia)  “Any fool can make things bigger, more complex, and more violent. It takes a touch of genius-and a lot of courage-to move in the opposite direction.” - Albert Einstein Big Data - Definition
  • 4. Simplifying the Definition • Big data refers to data that is too big to fit on a single server, too unstructured to fit into a row-and-column database, or too continuously flowing to fit into a static data warehouse. - - Thomas H Davenport • Put another way, big data is the realization of greater business intelligence by storing, processing, and analyzing data that was previously ignored due to the limitations of traditional data management technologies
  • 5. About Big Data  Every second of every day, businesses generate more data. Researchers at IDC estimate that by the end of 2013, the amount of stored data will exceed 4 zettabytes, or 4 billion terabytes.  All of that big data represents a big Opportunity for organizations.  Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.  In simplest terms, "Big Data" refers to the tools, processes and procedures allowing an organization to create, manipulate, and manage extremely large data sets. This means terabytes, petabytes or even large collection of data such as zettabytes.
  • 6. How Does Big Data Differ from Traditional Transactional Systems?
  • 7. Traditional Transaction Systems Big Data TTS are designed and implemented to track information whose format is and use are known ahead of time Big Data Systems are deployed when the questions to be asked and the data for ats to be exa i ed are t known ahead of time. Data that resides within the fixed confines of a record or file is known as structured data. Data that comes from a variety of sources, such as emails, text documents, videos, photos, audio files, and social media posts, is referred to as unstructured data. Does t support unstructured data Structured Data even in large volumes can be entered, stored, queried, and analyzed in a simple and straightforward manner, this type of data is best served by a Traditional Transaction Database. TTS Vs Big Data
  • 8. Traditional Transaction Systems Big Data Companies whose data workloads are constant and predictable will be better served by a traditional database. Companies challenged by increasing data demands will want to take advantage of Big Data s scalable infrastructure. Scalability allows servers to be added on demand to accommodate growing workloads. In cases where organizations rely on time- sensitive data analysis, a traditional database is the better fit. That s because shorter time-to -insight is t about analyzing large unstructured datasets. It s about analyzing smaller data sets in real or near-real time, which is what traditional databases are well equipped to do. Big Data is designed for large distributed data processing that addresses every file in the database. And that type of processing takes time. For tasks where fast performance is t critical, such as running end-of-day reports to review daily transactions, scanning historical data, and performing analytics where a slower time- to-insight is acceptable, Big Data is ideal. TTS Vs Big Data..Continued
  • 9.  Unfortunately, extracting valuable information from big data is t as easy as it sounds. Big data amplifies any existing problems in your infrastructure, processes or even the data itself.  It is also misrepresented by the media making it difficult for organizations to determine investing in Big Data will bring expected results and make it possible to improve efficiency, bring out better products and services. Misconception of big data
  • 10. The Promise of Big Data  Companies recognize that Big data contains valuable information such as  Obtain Actionable Insights  Product Performance,  Deepen Customer Relationships  Understanding Customer Behavior,  Prevent Threats & Fraud  Identify New Revenue Opportunities. 80-90% of data produced today is unstructured
  • 13.  To make the most of the information in their systems, companies must successfully deal with the 4 V s that distinguish big data: 1. Variety 2. Volume 3. Velocity and 4. Veracity.  The first three—variety, volume and velocity— define big data; when you have a large volume of data coming in from a wide variety of applications and formats and it s moving and changing at a rapid velocity, that s when you know you have big data. Definition of V s
  • 14.  Volume – Big Data tools and services are designed to manage extremely large and growing sources of data that require capabilities beyond that found in traditional database engines. Ex: Extreme Large Volumes of Data  Variety – Big Data Tools manage an extensive variety of data as well. This means having the capability to manage structured data, very much like the capabilities offered by a database engine. They go beyond supporting structured data to working with both non-structured data, such as documents, spreadsheets, presentation decks and the like; and log data coming from operating systems, database engines, application framework, retail point of sale systems, mobile communications systems and more. Ex: Structured, Unstructured, images, documents, etc Definition of V s
  • 15.  Velocity – Ability to gather, analyze and report on rapidly changing sets of data. In some cases, this means having the capability to manage data that changes so rapidly that the updated data cannot be saved to traditional disk drives before it is changed again. Simple Term: Quickly Moving Data  Veracity – Veracity is a measure of the accuracy and trustworthiness of your data. Veracity is a goal one that the variety, volume and velocity of big data make harder to achieve. Simple Term: Trust and integrity Definition of V s
  • 16. • 2.5 quintillion bytes of data are generated every day! – A quintillion is 1018 • Data come from many quarters. – Social media sites – Sensors – Digital photos – Business transactions – Location-based data Lots of Data Style of Data Source of Data Industry Affected Function Affected Large Volume Online Financial Services Marketing Unstructured Video Health Care Supply Chain Continuous Flow Sensor Manufacturing Human Resources Multiple Formats Genomic Travel / Transport Finance
  • 17. • Aspects of the way in which users want to interact with their data… – Totality: Users have an increased desire to process and analyze all available data – Exploration: Users apply analytic approaches where the schema is defined in response to the nature of the query – Frequency: Users have a desire to increase the rate of analysis in order to generate more accurate and timely business intelligence – Dependency: Users eed to balance investment in existing technologies and skills with the adoption of new techniques • So in a Nutshell, Big Data is about better analytics The Need of Big Data
  • 18. Term Time Frame Specific Meaning Decision Support 1970-1985 Use of data analysis to support decision making Executive Support 1980-1990 Focus on Data Analysis for decisions by Senior Executives Online Analytical Processing (OLAP) 1990-2000 Software for analyzing multidimensional data tables Business Intelligence 1989-2005 Tools to support data driven decisions, with emphasis on reporting. Analytics 2006-2010 Focus on Statistical and Mathematical analysis for decisions Big Data 2010-Present & Next 10 Years Focus on very large, unstructured, fast-moving data Terminology For Using and Analyzing data
  • 19.  Your company can take advantage of the opportunities available in big data only when you have processes and solutions that can handle all 4 V's.  Many of the previous attempts to address the need to gather information from the rapidly growing, rapidly changing and broad types of data have been based upon the use of special-purpose, complex and highly expensive computing systems. Today's Big Data Solutions are built upon a different foundation.  Rather than trying to use a very powerful, dedicated database system, cluster of inexpensive, powerful, industry standard (X86) systems are harnessed to attack these very small problems.  The clustered approach uses commodity systems, storage, and memory. It also adds the benefit of being more reliable. The failure of any single system in the cluster will not stop processing. Technology Shift
  • 20. Gartner s Visualization on Big Data
  • 21. • Problems: – Although there is a massive spike available data, the percentage of the data that an enterprise can understand is on the decline – The data that the enterprise is trying to understand is saturated with both useful signals and lots of noise Big Data – Conundrum
  • 23. Big Data Platform Manifesto
  • 24. High Level Architecture of Recognizer Big Data Platform on Hadoop FM API s to 3rd Party API s Enterprise Recommendation Engine OBD IVR DATA PCA Greybox Others Historical Data Business Intelligence Churn Prediction Predictive Analysis
  • 25. Medium Level Architecture of Recognizer
  • 26. What is Hadoop? • Hadoop is a free software framework that is developed by Apache Software Foundation to support distributed processing of data. Initially, Java™ language was used to develop Hadoop, but today many other languages are used for scripting Hadoop. Hadoop is used as the core platform to structure Big Data and helps in performing data analytics. • This Distributed processing framework designed to harness together the power of many computers, each having its own processing and storage, and provide the capability to quickly process large, distributed data sets.
  • 27. Hadoop Distributed File System (HDFS) • Hadoop Distributed File System designed to support large data sets to made up of rapidly changing structured and non-structured data. MapReduce • MapReduce is a tool designed to allow analysts and developers to rapidly shift through massive amounts of data to examine only those data items that match a specified set of criteria. Introduction to Hadoop
  • 28. Hadoop Components Hadoop Components Sqoop Flume ZooKeeper Oozie Pig Mahout R Connectors Hive Map Reduce HDFS Hbase MongoDB Cloudera Horton Works Kafka Yarn Cassandra VMware Player SQL NOSQL MetaStore Scala Query Compiler Hadoop Cluster Execution Engine Ambari Hadoop Architecture & Components
  • 29. Apache Hadoop Architecture • Hadoop is a master and slave architecture that includes the NameNode as the master and the DataNode as the slave. Apache Sqoop • Apache Sqoop is a command-line tool for transferring data between relational databases and Hadoop. Sqoop, similar to other ETL tools, uses schema metadata to infer data types and ensure type-safe data handling when the data moves from the source to Hadoop. Apache HBase • Apache HBase is a column-oriented key/value data store built to run on top of the Hadoop Distributed File System (HDFS). HBase is designed to support high table-update rates and to scale out horizontally in distributed compute clusters. Its focus on scale enables it to support very large database tables Apache Zookeeper • Apache ZooKeeper is an open source file application program interface (API) that allows distributed processes in large systems to synchronize with each other so that all clients making requests receive consistent data. Let Us See Hadoop Components
  • 30. Apache Hive • Hive is an open-source data warehousing system used to analyze a large amount of dataset that is stored in Hadoop files. It has three key functions like summarization of data, query, and analysis. HDFS • The Hadoop Distributed File System (HDFS) is a distributed file system that shares some of the features of other distributed file systems. It is used for storing and retrieving unstructured data. MapReduce • The MapReduce is a core component of Hadoop, and is responsible for processing jobs in distributed mode. Pig • The Apache Pig is a platform which helps to analyze large datasets that includes high-level language required to express data analysis programs. Pig is one of the components of the Hadoop eco-system. Let Us See Hadoop Components – Contd..
  • 31. NoSQL (Not Only SQL database) • NoSQL database, also called Not Only SQL, is an approach to data management and database design that's useful for very large sets of distributed data. NoSQL is especially useful when an enterprise needs to access and analyze massive amounts of unstructured data or data that's stored remotely on multiple virtual servers in the cloud. MongoDB • MongoDB database management system is designed for running modern applications that rely on structured and unstructured data and support rapidly changing data.. Apache Cassandra • Apache Cassandra is a free, open-source, distributed storage system for managing large amounts of structured data. It differs from traditional relational database management systems in some significant ways. Cassandra is designed to scale to a very large size across many commodity servers, with no single point of failure, and provides a simple schema- optional data model designed to allow maximum power and performance at scale. Apache Hadoop YARN (Yet Another Resource Negotiator) • Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management technology. YARN is one of the key features in the second-generation Hadoop 2 version of the Apache Software Foundation's open Let Us See Hadoop Components – Contd..
  • 32. Oozie • Oozie is a workflow scheduler system to manage Hadoop jobs. It is a server-based Workflow Engine specialized in running workflow jobs with actions that run Hadoop MapReduce and Pig jobs. Oozie is implemented as a Java Web-Application that runs in a Java Servlet- Container. Apache Ambari • The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs. Flume • Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. Cloudera Impala • Cloudera Impala is a query engine that runs on Apache Hadoop. Impala brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and Apache HBase without requiring data movement or transformation. Let Us See Hadoop Components – Contd..
  • 33. Apache Spark • Apache Spark is an open source parallel processing framework that enables users to run large-scale data analytics applications across clustered computers. Apache Spark can process data from a variety of data repositories, including the Hadoop Distributed File System (HDFS), NoSQL databases and relational data stores such as Apache Hive. Scala (Scalable Language) • Scala (Scalable Language) is a software programming language that mixes object-oriented methods with functional programming capabilities that support a more concise style of programming than other general-purpose languages like Java, reducing the amount of code developers have to write.. Apache Kafka • Apache Kafka is a distributed publish-subscribe messaging system designed to replace traditional message brokers. Originally created and developed by LinkedIn, then open sourced in 2011, Kafka is currently developed by the Apache Software Foundation to exploit new data infrastructures made possible by massively parallel commodity clusters Jaspersoft • Jaspersoft provides the most flexible, cost-effective, and widely-deployed business intelligence software in the world, enabling better decision making through highly interactive Web-based reports, dashboards, and analysis. Let Us See Hadoop Components – Contd..
  • 34. Hadoop Cluster • A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment. Distributed File System • A distributed file system is a client/server-based application that allows clients to access and process data stored on the server as if it were on their own computer. When a user accesses a file on the server, the server sends the user a copy of the file, which is cached on the user's computer while the data is being processed and is then returned to the server. Catastrophic Failure • Catastrophic failure is a complete, sudden, often unexpected breakdown in a machine, electronic system, computer or network. Such a breakdown may occur as a result of a hardware event such as a disk drive crash, memory chip failure or surge on the power line. Catastrophic failure can also be caused by software conflicts or malware. Sometimes a single component in a critical location fails, resulting in downtime for the entire system. Python • Python is an interpreted, object-oriented programming language similar to PERL, that has gained popularity because of its clear syntax and readability. Python is said to be relatively easy to learn and portable, meaning its statements can be interpreted in a number of operating systems. Let Us See Hadoop Components – Contd..
  • 35. Hadoop Architecture & Components
  • 36. The ‘ E viro e t • R is an integrated suite of software facilities for data analysis and graphics. Among other things it has • An effective data handling and storage facility, • A suite of operators for calculations on arrays, in particular matrices, • A large, coherent, integrated collection of intermediate tools for data analysis, • As a set of statistical methodologies and models. • As a graphical tool, facilitates data analysis and display either directly at the computer or on hardcopy, and • As a well developed, simple and effective programming language which includes conditionals, loops, user defined recursive functions and input and output facilities. An Introduction to ‘