SlideShare a Scribd company logo
1 of 10
Stream Meets Batch for
Smarter Analytics
W H I T E P A P E R
Abstract
This white paper focuses on dealing with Big Data problems in
real time. It discusses how the traditional batch paradigm and
real time paradigm can work together to deliver smarter, quicker
and better insights on large volumes of data. It also talks about
additions to existing solutions to deal with low latency use cases.
The paper also guides you on picking the right strategy and right
technology stack to address real-time, Big Data analytics
problems.
Impetus Technologies, Inc.
www.impetus.com
Stream meets Batch for smarter analytics
2
Table of Contents
Introduction..............................................................................................2
The archive data analytics platform .........................................................3
Emerging use cases of Batch processing .....................................3
Downsides of Batch processing ...................................................3
Live data analytics platform......................................................................3
The stream processing system .................................................................5
Benefits of stream processing .....................................................6
Interesting use cases of live data analytics..................................6
Integration of archive data and live data analysis....................................7
Smarter Data Ingestion.............................................................................7
Adaptive analysis ......................................................................................8
Case Study: Auto categorize news articles ..............................................9
Summary.................................................................................................10
Introduction
Evolution in digital media and technologies has led to an exponential growth in
the volume of data produced by mankind. The data has grown to Exabytes, and
is expanding daily. As digital technologies is touching every aspect of our lives,
we have data being generated from posts of social media sites, e-mails, digital
pictures, online videos, sensor data for climate information, GPS signals, cell
phone data, browsing data, transactional data of online shoppers, etc. We
categorize such data of data as Big Data.
Enterprises are identifying smarter ways to extract valuable information out of
this data. This valuable information can be used to predict market trends,
optimize business processes, create effective campaigns and improve user
services. Analysis of the data can fall into two broad classes– real-time and
historic. Together, both kinds of analysis provide a 360 degree view of valuable
information.
Stream meets Batch for smarter analytics
3
Significant amount of work has been done in the area of historic or batch
analytics, with the evolution of solutions built over the Hadoop platform or
similar platforms like R Analytics and HPCC. However, enterprises are lagging
behind in the area of real-time analysis of Big Data and even more on the
combination of the two. The real challenge is dealing with historic information,
to find insights, and smarter ways to effectively use those insights with real-time
data. |
This paper primarily focuses on Big Data real-time processing strategies to
enable existing platforms to handle low latency use cases. It empowers
businesses to gain quick insight and in turn maximize Return on Investment
(ROI).
The Archive Data Analytics Platform
Batch or archive data processing is the most widely used approach for analyzing
big volumes of data. In batch processing, the data is aggregated into a single
entity called the batch or a job. The biggest covet in batching is that it won’t give
you a partial result of the analysis. For results, you have to wait until the batch
processing is done. Batch analysis is best suited for deeper analysis of data
which requires full view of the data. Consider an e-commerce web site, where
the requirement is to recommend to users the products of their taste, to
maximize sales.
Emerging use cases of Batch processing
Deeper analytics
Classification of data
Clustering of data
Recommendations on user tastes
Downsides of Batch processing
High latency results
Classification of data
Bigger hardware requirements
Limited ad-hoc capabilities
Live Data Analytics Platform
Time is the key. Analytics solutions for domains like defense, credit card fraud
detection, intelligence, law enforcement, online trading and security need to
Stream meets Batch for smarter analytics
4
quickly analyze, identify and react to the patterns of threats by continuously
processing the enormous amounts of data generated from network logs, e-
mails, social media feeds, sensor data, web feeds and many other sources. For
such applications, timely response is the only key to their business. Otherwise
high latency information is of no use.
Enterprises need a revolutionary upgrade in their capabilities to extract,
transform, analyze and quickly respond to the huge volume of data coming in
real time. Today, many enterprises are struggling to manage and analyze
massive and growing volumes of data in real time.
Lately, few technologies and tools have emerged to meet the challenges of
analyzing high volumes of data in real time or near real time. This section talks
about a few of the existing approaches with their downsides:
1. Relational Database management systems: RDBMSs have been
available for years for OLTP as well as data warehouse class of
applications. But they do not scale and perform for high volume
streaming data because of indexing limitations.
2. Main memory databases: Modified versions of DBMSs target the same
set of functionalities as traditional DBMSs but with higher throughput,
by storing data in the main memory instead of physical storage. Like
traditional systems, they also fail when it comes to Big Data
requirements.
3. Rule engines: Sales and marketing has ‘repurposed’ them to deal with
Big Data real time applications. Their downside is a lack of suitable
storage systems and hence the need for different infrastructure for
persistence of data.
Stream meets Batch for smarter analytics
5
The Stream Processing System
The stream processing system is a completely new paradigm well suited for
handling continuous data. It offers high scalability, performance and flexibility
over other traditional approaches. Conventional systems run continuous queries
over stored static data whereas a stream processing system runs static queries
over continuous unbounded data. A stream processing system for continuous
unbounded data is analogous to a DBMS for structural stored data.
The stream processing platform consists of three major components: data
import connectors, output connectors and ETL components. Various types of
incoming data from multiple sources are pulled into the platform using input
connectors. In the next stage, the data is cleansed, filtered, transformed,
clustered, classified or correlated and the resulting information used for
notifications, reporting and analyses.
Stream meets Batch for smarter analytics
6
Benefits of stream processing
• Online accumulation
• Real-time analytics
• Live BI competences
• Smart ingestion into data warehouse (details in next section)
Interesting use cases of live data analytics
Fraud detection– Analysis on millions of real time credit card transactions to
detect and prevent any fraud cases using predictive algorithms. Also, text in
insurance claim documents can be analyzed to identify probable fraud cases.
Patient health monitoring–An analytical solution can capture streams of data
coming from medical equipment that monitors a patient’s heart rate, blood
pressure, sugar levels and temperature and predict if an infection or
compilation can occur.
Omni channel retail – Data from various independent sources can be analyzed
to enhance the shoppers’ experience by recommending products, customized
campaigns, and location based offerings.
Stream meets Batch for smarter analytics
7
Integration of Archive Data and Live Data
Analysis
Both archive data analysis and live data analysis can handle their own class of
use cases, and they complement each other. At times, enterprises require close
integration of both platforms to get a full 360 degree view of the information.
This section focuses on the benefits of integrating these two classes of
platforms.
Smarter Data Ingestion
Recently, an interesting trend has been found in Big Data repositories. Lots of
data stored in data a warehouse is of very little or no business use and will
never appear in business reports. It is also stated to be a ‘Big Data fetish’
problem. To overcome this problem, it is essential to identify what is to be
stored and store what’s relevant to the business.
Streaming systems can be used to address the Big Data fetish problem. Data
coming from various data sources can be cleansed, extracted, transformed,
filtered and normalized in the streaming system. Processed data then can be
persisted in data warehouses for deeper analytics. This approach will reduce the
overall cost of data storage by a significant amount.
An example can be viewed in e-mail or SMS processing use cases such as
Lawyers.com. In this use case, lots of storage optimization can be achieved by
identifying spam and corrupted messages before dumping them into the data
store. This can be achieved using streams.
Stream meets Batch for smarter analytics
8
Adaptive Analysis
Both live data analytics platforms and archive data analytics platforms can
exchange data between. They can also be used smartly to exchange or share
intelligence. This will help improve the effectiveness, accuracy and quality of
analysis by absorbing these intelligences. Exchange of intelligence can be
achieved in two ways:
1. Archive to live exchange: Deeper analytics algorithms such as
Recommendation, Classification, Clustering, Statistical and pattern
finding algorithms are applied over huge volume of data accumulated
over long periods of time. For instance, classification model generation
over historic e-mails or finding item similarity models for
recommendations. The generated model can later be utilized by a
corresponding component in streams to identify quick, real time insight
over continuous data. For instance, incoming e-mail or a document
stream can be classified or categorized in real time. In this scenario,
deeper analysis is helping live streams in decision making.
2. Live to archive exchange: The stream validates unbounded incoming
data using models generated by the batch processing platform. In case,
conflicts goes beyond a threshold, a level stream processing platform
can signal to the batch platform that it is time to update or rebuild a
new model. For instance, if we are categorizing incoming documents on
the Wikipedia categorization model and if the percentage of the default
category or unidentified category goes beyond a threshold level, then
streams can signal to the batch processing platform to re-build the
categorization model using a new set of documents. In this scenario, the
achieved platform is assisted by the live platform for better quality of
analytics.
Stream meets Batch for smarter analytics
9
Case Study: Auto Categorize News Articles
This section describes how integration concepts explained in the above can be
applied in real world use cases. Consider an example of auto categorization of
new article streams or feeds coming from different data sources. A flow
diagram of this use cases is shown below:
Incoming new article streams and feeds are first cleansed and parsed to extract
meaningful data, with the garbage data getting thrown off. In the second stage,
the extracted data is pushed into the batch processing platform for deeper
analytics. At the same time, the stream processing platform categorizes the new
articles using the model generated by the batch processing platform. If the
percentage of a default or unknown category crosses a threshold limit, the
stream platform can trigger or ask the batch platform to re-generate a new
model. Once the batch platform is done with model generation, the updated
model is pushed to the stream platform to start categorizing documents in real
time again.
Stream meets Batch for smarter analytics
10
Summary
In conclusion it can be said that an ideal analytics platform is one which can
support offline analytics as well as online or real time analytics with equal ease.
These are two completely different paradigms which not only complement each
other but assist each other for effective analytics. Together, they can provide
effective, quick and 360 degree insight into large data. Having this integration
strategy in place can empower the platform to target almost any type of use
cases.
This paper describes different integration points where these paradigms can
interact with each other for delivering smart, quick and complete analytics over
Big Data.
About Impetus
Impetus Technologies is a leading provider of Big Data solutions for the
Fortune 500®. We help customers effectively manage the “3-Vs” of Big Data
and create new business insights across their enterprises.
Website: www.bigdata.impetus.com | Email: bigdata@impetus.com
© 2013 Impetus Technologies,
Inc. All rights reserved. Product
and company names mentioned
herein may be trademarks of
their respective companies.
May 2013

More Related Content

What's hot

Modern trends in information systems
Modern trends in information systemsModern trends in information systems
Modern trends in information systemsPreeti Sontakke
 
Importance of Data Mining
Importance of Data MiningImportance of Data Mining
Importance of Data MiningScottperrone
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Data warehouse and data mining
Data warehouse and data miningData warehouse and data mining
Data warehouse and data miningRohit Kumar
 
Python's Role in the Future of Data Analysis
Python's Role in the Future of Data AnalysisPython's Role in the Future of Data Analysis
Python's Role in the Future of Data AnalysisPeter Wang
 
DOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCEDOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCEijsptm
 
Requirements document for big data use cases
Requirements document for big data use casesRequirements document for big data use cases
Requirements document for big data use casesAllied Consultants
 
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAA REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAIJMIT JOURNAL
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdfAkuhuruf
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligencehktripathy
 
Security issues in big data
Security issues in big data Security issues in big data
Security issues in big data Shallote Dsouza
 
Blue Canopy Semantic Web Approach v25 brief
Blue Canopy Semantic Web Approach v25 briefBlue Canopy Semantic Web Approach v25 brief
Blue Canopy Semantic Web Approach v25 briefNick Savage
 
Mining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering AlgorithmMining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering AlgorithmManishankar Medi
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introductionDr-Dipali Meher
 
Nuestar "Big Data Cloud" Major Data Center Technology nuestarmobilemarketing...
Nuestar "Big Data Cloud" Major Data Center Technology  nuestarmobilemarketing...Nuestar "Big Data Cloud" Major Data Center Technology  nuestarmobilemarketing...
Nuestar "Big Data Cloud" Major Data Center Technology nuestarmobilemarketing...IT Support Engineer
 
Log analyzer Needle in a haystack
Log analyzer  Needle in a haystackLog analyzer  Needle in a haystack
Log analyzer Needle in a haystackCenterRetro
 
Key Principles Of Data Mining
Key Principles Of Data MiningKey Principles Of Data Mining
Key Principles Of Data Miningtobiemuir
 
Making ‘Big Data’ Your Ally – Using data analytics to improve compliance, due...
Making ‘Big Data’ Your Ally – Using data analytics to improve compliance, due...Making ‘Big Data’ Your Ally – Using data analytics to improve compliance, due...
Making ‘Big Data’ Your Ally – Using data analytics to improve compliance, due...emermell
 

What's hot (20)

Modern trends in information systems
Modern trends in information systemsModern trends in information systems
Modern trends in information systems
 
Importance of Data Mining
Importance of Data MiningImportance of Data Mining
Importance of Data Mining
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
big data
big databig data
big data
 
Data warehouse and data mining
Data warehouse and data miningData warehouse and data mining
Data warehouse and data mining
 
Python's Role in the Future of Data Analysis
Python's Role in the Future of Data AnalysisPython's Role in the Future of Data Analysis
Python's Role in the Future of Data Analysis
 
DOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCEDOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCE
 
Requirements document for big data use cases
Requirements document for big data use casesRequirements document for big data use cases
Requirements document for big data use cases
 
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAA REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdf
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligence
 
Security issues in big data
Security issues in big data Security issues in big data
Security issues in big data
 
Blue Canopy Semantic Web Approach v25 brief
Blue Canopy Semantic Web Approach v25 briefBlue Canopy Semantic Web Approach v25 brief
Blue Canopy Semantic Web Approach v25 brief
 
Mining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering AlgorithmMining Stream Data using k-Means clustering Algorithm
Mining Stream Data using k-Means clustering Algorithm
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introduction
 
Nuestar "Big Data Cloud" Major Data Center Technology nuestarmobilemarketing...
Nuestar "Big Data Cloud" Major Data Center Technology  nuestarmobilemarketing...Nuestar "Big Data Cloud" Major Data Center Technology  nuestarmobilemarketing...
Nuestar "Big Data Cloud" Major Data Center Technology nuestarmobilemarketing...
 
Log analyzer Needle in a haystack
Log analyzer  Needle in a haystackLog analyzer  Needle in a haystack
Log analyzer Needle in a haystack
 
Key Principles Of Data Mining
Key Principles Of Data MiningKey Principles Of Data Mining
Key Principles Of Data Mining
 
Making ‘Big Data’ Your Ally – Using data analytics to improve compliance, due...
Making ‘Big Data’ Your Ally – Using data analytics to improve compliance, due...Making ‘Big Data’ Your Ally – Using data analytics to improve compliance, due...
Making ‘Big Data’ Your Ally – Using data analytics to improve compliance, due...
 

Similar to Stream Meets Batch for Smarter Analytics- Impetus White Paper

8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data Project8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data ProjectCitiusTech
 
About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopLynn Langit
 
Ab cs of big data
Ab cs of big dataAb cs of big data
Ab cs of big dataDigimark
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperImpetus Technologies
 
Data warehouse
Data warehouseData warehouse
Data warehouseRajThakuri
 
Web Analytics Wednesday Melbourne Meet Up
Web Analytics Wednesday Melbourne Meet UpWeb Analytics Wednesday Melbourne Meet Up
Web Analytics Wednesday Melbourne Meet UpNarbeh Yousefian
 
How Analytics Has Changed in the Last 10 Years (and How It’s Staye.docx
How Analytics Has Changed in the Last 10 Years (and How It’s Staye.docxHow Analytics Has Changed in the Last 10 Years (and How It’s Staye.docx
How Analytics Has Changed in the Last 10 Years (and How It’s Staye.docxpooleavelina
 
Dealing with Dark Data
Dealing with Dark DataDealing with Dark Data
Dealing with Dark DataKazoup
 
Emerging database landscape july 2011
Emerging database landscape july 2011Emerging database landscape july 2011
Emerging database landscape july 2011navaidkhan
 
Harness the power of data
Harness the power of dataHarness the power of data
Harness the power of dataHarsha MV
 
Health Plan Survey Paper
Health Plan Survey PaperHealth Plan Survey Paper
Health Plan Survey PaperLisa Olive
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfrajsharma159890
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and howbobosenthil
 
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...Experfy
 
Real Time Analytics
Real Time AnalyticsReal Time Analytics
Real Time AnalyticsMohsin Hakim
 

Similar to Stream Meets Batch for Smarter Analytics- Impetus White Paper (20)

8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data Project8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data Project
 
The ABCs of Big Data
The ABCs of Big DataThe ABCs of Big Data
The ABCs of Big Data
 
Unlocking big data
Unlocking big dataUnlocking big data
Unlocking big data
 
About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for Hadoop
 
Ab cs of big data
Ab cs of big dataAb cs of big data
Ab cs of big data
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White Paper
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Web Analytics Wednesday Melbourne Meet Up
Web Analytics Wednesday Melbourne Meet UpWeb Analytics Wednesday Melbourne Meet Up
Web Analytics Wednesday Melbourne Meet Up
 
All About Big Data
All About Big Data All About Big Data
All About Big Data
 
How Analytics Has Changed in the Last 10 Years (and How It’s Staye.docx
How Analytics Has Changed in the Last 10 Years (and How It’s Staye.docxHow Analytics Has Changed in the Last 10 Years (and How It’s Staye.docx
How Analytics Has Changed in the Last 10 Years (and How It’s Staye.docx
 
Dealing with Dark Data
Dealing with Dark DataDealing with Dark Data
Dealing with Dark Data
 
Emerging database landscape july 2011
Emerging database landscape july 2011Emerging database landscape july 2011
Emerging database landscape july 2011
 
Harness the power of data
Harness the power of dataHarness the power of data
Harness the power of data
 
Health Plan Survey Paper
Health Plan Survey PaperHealth Plan Survey Paper
Health Plan Survey Paper
 
Machine Data Analytics
Machine Data AnalyticsMachine Data Analytics
Machine Data Analytics
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
 
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
 
Real Time Analytics
Real Time AnalyticsReal Time Analytics
Real Time Analytics
 

More from Impetus Technologies

Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...
Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...
Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...Impetus Technologies
 
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix WebinarFuture-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix WebinarImpetus Technologies
 
Building Real-time Streaming Apps in Minutes- Impetus Webinar
Building Real-time Streaming Apps in Minutes- Impetus WebinarBuilding Real-time Streaming Apps in Minutes- Impetus Webinar
Building Real-time Streaming Apps in Minutes- Impetus WebinarImpetus Technologies
 
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...Impetus Technologies
 
Impetus White Paper- Handling Data Corruption in Elasticsearch
Impetus White Paper- Handling  Data Corruption  in ElasticsearchImpetus White Paper- Handling  Data Corruption  in Elasticsearch
Impetus White Paper- Handling Data Corruption in ElasticsearchImpetus Technologies
 
Real-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix WebinarReal-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix WebinarImpetus Technologies
 
Real-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix WebinarReal-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix WebinarImpetus Technologies
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Impetus Technologies
 
Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...
Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...
Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...Impetus Technologies
 
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Impetus Technologies
 
SPARK USE CASE- Distributed Reinforcement Learning for Electricity Market Bi...
SPARK USE CASE-  Distributed Reinforcement Learning for Electricity Market Bi...SPARK USE CASE-  Distributed Reinforcement Learning for Electricity Market Bi...
SPARK USE CASE- Distributed Reinforcement Learning for Electricity Market Bi...Impetus Technologies
 
Enterprise Ready Android and Manageability- Impetus Webcast
Enterprise Ready Android and Manageability- Impetus WebcastEnterprise Ready Android and Manageability- Impetus Webcast
Enterprise Ready Android and Manageability- Impetus WebcastImpetus Technologies
 
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Impetus Technologies
 
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...Impetus Technologies
 
Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...
Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...
Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...Impetus Technologies
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
Webinar maturity of mobile test automation- approaches and future trends
Webinar  maturity of mobile test automation- approaches and future trendsWebinar  maturity of mobile test automation- approaches and future trends
Webinar maturity of mobile test automation- approaches and future trendsImpetus Technologies
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labImpetus Technologies
 
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...Impetus Technologies
 
Performance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus WebcastPerformance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus WebcastImpetus Technologies
 

More from Impetus Technologies (20)

Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...
Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...
Data Warehouse Modernization Webinar Series- Critical Trends, Implementation ...
 
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix WebinarFuture-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
 
Building Real-time Streaming Apps in Minutes- Impetus Webinar
Building Real-time Streaming Apps in Minutes- Impetus WebinarBuilding Real-time Streaming Apps in Minutes- Impetus Webinar
Building Real-time Streaming Apps in Minutes- Impetus Webinar
 
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise- StreamAna...
 
Impetus White Paper- Handling Data Corruption in Elasticsearch
Impetus White Paper- Handling  Data Corruption  in ElasticsearchImpetus White Paper- Handling  Data Corruption  in Elasticsearch
Impetus White Paper- Handling Data Corruption in Elasticsearch
 
Real-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix WebinarReal-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix Webinar
 
Real-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix WebinarReal-world Applications of Streaming Analytics- StreamAnalytix Webinar
Real-world Applications of Streaming Analytics- StreamAnalytix Webinar
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
 
Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...
Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...
Accelerating Hadoop Solution Lifecycle and Improving ROI- Impetus On-demand W...
 
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
 
SPARK USE CASE- Distributed Reinforcement Learning for Electricity Market Bi...
SPARK USE CASE-  Distributed Reinforcement Learning for Electricity Market Bi...SPARK USE CASE-  Distributed Reinforcement Learning for Electricity Market Bi...
SPARK USE CASE- Distributed Reinforcement Learning for Electricity Market Bi...
 
Enterprise Ready Android and Manageability- Impetus Webcast
Enterprise Ready Android and Manageability- Impetus WebcastEnterprise Ready Android and Manageability- Impetus Webcast
Enterprise Ready Android and Manageability- Impetus Webcast
 
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
 
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
Leveraging NoSQL Database Technology to Implement Real-time Data Architecture...
 
Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...
Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...
Maturity of Mobile Test Automation: Approaches and Future Trends- Impetus Web...
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Webinar maturity of mobile test automation- approaches and future trends
Webinar  maturity of mobile test automation- approaches and future trendsWebinar  maturity of mobile test automation- approaches and future trends
Webinar maturity of mobile test automation- approaches and future trends
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
 
Performance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus WebcastPerformance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus Webcast
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Stream Meets Batch for Smarter Analytics- Impetus White Paper

  • 1. Stream Meets Batch for Smarter Analytics W H I T E P A P E R Abstract This white paper focuses on dealing with Big Data problems in real time. It discusses how the traditional batch paradigm and real time paradigm can work together to deliver smarter, quicker and better insights on large volumes of data. It also talks about additions to existing solutions to deal with low latency use cases. The paper also guides you on picking the right strategy and right technology stack to address real-time, Big Data analytics problems. Impetus Technologies, Inc. www.impetus.com
  • 2. Stream meets Batch for smarter analytics 2 Table of Contents Introduction..............................................................................................2 The archive data analytics platform .........................................................3 Emerging use cases of Batch processing .....................................3 Downsides of Batch processing ...................................................3 Live data analytics platform......................................................................3 The stream processing system .................................................................5 Benefits of stream processing .....................................................6 Interesting use cases of live data analytics..................................6 Integration of archive data and live data analysis....................................7 Smarter Data Ingestion.............................................................................7 Adaptive analysis ......................................................................................8 Case Study: Auto categorize news articles ..............................................9 Summary.................................................................................................10 Introduction Evolution in digital media and technologies has led to an exponential growth in the volume of data produced by mankind. The data has grown to Exabytes, and is expanding daily. As digital technologies is touching every aspect of our lives, we have data being generated from posts of social media sites, e-mails, digital pictures, online videos, sensor data for climate information, GPS signals, cell phone data, browsing data, transactional data of online shoppers, etc. We categorize such data of data as Big Data. Enterprises are identifying smarter ways to extract valuable information out of this data. This valuable information can be used to predict market trends, optimize business processes, create effective campaigns and improve user services. Analysis of the data can fall into two broad classes– real-time and historic. Together, both kinds of analysis provide a 360 degree view of valuable information.
  • 3. Stream meets Batch for smarter analytics 3 Significant amount of work has been done in the area of historic or batch analytics, with the evolution of solutions built over the Hadoop platform or similar platforms like R Analytics and HPCC. However, enterprises are lagging behind in the area of real-time analysis of Big Data and even more on the combination of the two. The real challenge is dealing with historic information, to find insights, and smarter ways to effectively use those insights with real-time data. | This paper primarily focuses on Big Data real-time processing strategies to enable existing platforms to handle low latency use cases. It empowers businesses to gain quick insight and in turn maximize Return on Investment (ROI). The Archive Data Analytics Platform Batch or archive data processing is the most widely used approach for analyzing big volumes of data. In batch processing, the data is aggregated into a single entity called the batch or a job. The biggest covet in batching is that it won’t give you a partial result of the analysis. For results, you have to wait until the batch processing is done. Batch analysis is best suited for deeper analysis of data which requires full view of the data. Consider an e-commerce web site, where the requirement is to recommend to users the products of their taste, to maximize sales. Emerging use cases of Batch processing Deeper analytics Classification of data Clustering of data Recommendations on user tastes Downsides of Batch processing High latency results Classification of data Bigger hardware requirements Limited ad-hoc capabilities Live Data Analytics Platform Time is the key. Analytics solutions for domains like defense, credit card fraud detection, intelligence, law enforcement, online trading and security need to
  • 4. Stream meets Batch for smarter analytics 4 quickly analyze, identify and react to the patterns of threats by continuously processing the enormous amounts of data generated from network logs, e- mails, social media feeds, sensor data, web feeds and many other sources. For such applications, timely response is the only key to their business. Otherwise high latency information is of no use. Enterprises need a revolutionary upgrade in their capabilities to extract, transform, analyze and quickly respond to the huge volume of data coming in real time. Today, many enterprises are struggling to manage and analyze massive and growing volumes of data in real time. Lately, few technologies and tools have emerged to meet the challenges of analyzing high volumes of data in real time or near real time. This section talks about a few of the existing approaches with their downsides: 1. Relational Database management systems: RDBMSs have been available for years for OLTP as well as data warehouse class of applications. But they do not scale and perform for high volume streaming data because of indexing limitations. 2. Main memory databases: Modified versions of DBMSs target the same set of functionalities as traditional DBMSs but with higher throughput, by storing data in the main memory instead of physical storage. Like traditional systems, they also fail when it comes to Big Data requirements. 3. Rule engines: Sales and marketing has ‘repurposed’ them to deal with Big Data real time applications. Their downside is a lack of suitable storage systems and hence the need for different infrastructure for persistence of data.
  • 5. Stream meets Batch for smarter analytics 5 The Stream Processing System The stream processing system is a completely new paradigm well suited for handling continuous data. It offers high scalability, performance and flexibility over other traditional approaches. Conventional systems run continuous queries over stored static data whereas a stream processing system runs static queries over continuous unbounded data. A stream processing system for continuous unbounded data is analogous to a DBMS for structural stored data. The stream processing platform consists of three major components: data import connectors, output connectors and ETL components. Various types of incoming data from multiple sources are pulled into the platform using input connectors. In the next stage, the data is cleansed, filtered, transformed, clustered, classified or correlated and the resulting information used for notifications, reporting and analyses.
  • 6. Stream meets Batch for smarter analytics 6 Benefits of stream processing • Online accumulation • Real-time analytics • Live BI competences • Smart ingestion into data warehouse (details in next section) Interesting use cases of live data analytics Fraud detection– Analysis on millions of real time credit card transactions to detect and prevent any fraud cases using predictive algorithms. Also, text in insurance claim documents can be analyzed to identify probable fraud cases. Patient health monitoring–An analytical solution can capture streams of data coming from medical equipment that monitors a patient’s heart rate, blood pressure, sugar levels and temperature and predict if an infection or compilation can occur. Omni channel retail – Data from various independent sources can be analyzed to enhance the shoppers’ experience by recommending products, customized campaigns, and location based offerings.
  • 7. Stream meets Batch for smarter analytics 7 Integration of Archive Data and Live Data Analysis Both archive data analysis and live data analysis can handle their own class of use cases, and they complement each other. At times, enterprises require close integration of both platforms to get a full 360 degree view of the information. This section focuses on the benefits of integrating these two classes of platforms. Smarter Data Ingestion Recently, an interesting trend has been found in Big Data repositories. Lots of data stored in data a warehouse is of very little or no business use and will never appear in business reports. It is also stated to be a ‘Big Data fetish’ problem. To overcome this problem, it is essential to identify what is to be stored and store what’s relevant to the business. Streaming systems can be used to address the Big Data fetish problem. Data coming from various data sources can be cleansed, extracted, transformed, filtered and normalized in the streaming system. Processed data then can be persisted in data warehouses for deeper analytics. This approach will reduce the overall cost of data storage by a significant amount. An example can be viewed in e-mail or SMS processing use cases such as Lawyers.com. In this use case, lots of storage optimization can be achieved by identifying spam and corrupted messages before dumping them into the data store. This can be achieved using streams.
  • 8. Stream meets Batch for smarter analytics 8 Adaptive Analysis Both live data analytics platforms and archive data analytics platforms can exchange data between. They can also be used smartly to exchange or share intelligence. This will help improve the effectiveness, accuracy and quality of analysis by absorbing these intelligences. Exchange of intelligence can be achieved in two ways: 1. Archive to live exchange: Deeper analytics algorithms such as Recommendation, Classification, Clustering, Statistical and pattern finding algorithms are applied over huge volume of data accumulated over long periods of time. For instance, classification model generation over historic e-mails or finding item similarity models for recommendations. The generated model can later be utilized by a corresponding component in streams to identify quick, real time insight over continuous data. For instance, incoming e-mail or a document stream can be classified or categorized in real time. In this scenario, deeper analysis is helping live streams in decision making. 2. Live to archive exchange: The stream validates unbounded incoming data using models generated by the batch processing platform. In case, conflicts goes beyond a threshold, a level stream processing platform can signal to the batch platform that it is time to update or rebuild a new model. For instance, if we are categorizing incoming documents on the Wikipedia categorization model and if the percentage of the default category or unidentified category goes beyond a threshold level, then streams can signal to the batch processing platform to re-build the categorization model using a new set of documents. In this scenario, the achieved platform is assisted by the live platform for better quality of analytics.
  • 9. Stream meets Batch for smarter analytics 9 Case Study: Auto Categorize News Articles This section describes how integration concepts explained in the above can be applied in real world use cases. Consider an example of auto categorization of new article streams or feeds coming from different data sources. A flow diagram of this use cases is shown below: Incoming new article streams and feeds are first cleansed and parsed to extract meaningful data, with the garbage data getting thrown off. In the second stage, the extracted data is pushed into the batch processing platform for deeper analytics. At the same time, the stream processing platform categorizes the new articles using the model generated by the batch processing platform. If the percentage of a default or unknown category crosses a threshold limit, the stream platform can trigger or ask the batch platform to re-generate a new model. Once the batch platform is done with model generation, the updated model is pushed to the stream platform to start categorizing documents in real time again.
  • 10. Stream meets Batch for smarter analytics 10 Summary In conclusion it can be said that an ideal analytics platform is one which can support offline analytics as well as online or real time analytics with equal ease. These are two completely different paradigms which not only complement each other but assist each other for effective analytics. Together, they can provide effective, quick and 360 degree insight into large data. Having this integration strategy in place can empower the platform to target almost any type of use cases. This paper describes different integration points where these paradigms can interact with each other for delivering smart, quick and complete analytics over Big Data. About Impetus Impetus Technologies is a leading provider of Big Data solutions for the Fortune 500®. We help customers effectively manage the “3-Vs” of Big Data and create new business insights across their enterprises. Website: www.bigdata.impetus.com | Email: bigdata@impetus.com © 2013 Impetus Technologies, Inc. All rights reserved. Product and company names mentioned herein may be trademarks of their respective companies. May 2013