Real time data ingestion and Hybrid Cloud

Great Ideas….Simple Solutions
Data Ingestion Platform (DiP)
Neeraj Sabharwal @allaboutbdata

About me
Xavient Corporate Overview2
• Head of Cloud, Data & Analytics @Xavient
• Spent couple of years @Hortonworks
• Over a decade in Cloud & Data domain
• Started career as Oracle DBA
Disclosure– More memes coming up…

Agenda
Platform
Data
Access
Hybrid
Cloud

Data Ingestion Platform (DiP)4
Before we start …
** Near real time is ok as I am easy going but no more hours or days wait on data

Problem
UI/API Platform
Data
Access
No…near
real-time
access
Cloud

Shifting the gear – Let’s get technical

Streaming Blueprint
Data Collection
Messaging Tier Streaming Engine Analysis Tier
In memory
Data Store
Data Access
** Near real time is ok as I am easy going but no more hours or days wait on data

Messaging Bus
• Open-source message broker
• Unified, high-throughput, low-latency platform for handling real-time data feeds
• Massively scalable pub/sub message queue architected as a distributed transaction
log

Emotions

Streaming engines
Storm - Distributed real-time computation system for processing large volumes of high-
velocity data
Flink - Streaming dataflow engine that provides data distribution, communication, and
fault tolerance for distributed computations over data streams
Apex- Enterprise-grade unified stream and batch processing engine
Spark Streaming - Apache Spark's language-integrated API to stream processing, letting
you write streaming jobs the same way you write batch jobs. It supports Java, Scala and
Python

CTM

Platform (DiP)

Features
Easy to use UI
Multiple Streaming
Engines
Supports xml, json
and tsv data formats
Manual data
entry via UI
Upload files for
batch processing
Hybrid Cloud
Batch and Real time
views of data
Data visualization
and analytics
YARN featuresData
Ingestion
Platform

Use Cases – Any Data
Sentimental Analysis Log Analysis
Click Stream Analysis
Analyze Machine and
Sensor Data
Social Media and
Customer Sentiment

UI
https://techblog.xavient.com/

What was in the previous slide? Is that for real?
No more Memes …Enough now J

DiP Technology Stack
Messaging System
Target System
Reporting System
Source System
Streaming API’s
Programming
Language
IDE
Build tool
Operating System
Apache Kafka
HDFS, NoSql, Apache Hive
Apache Phoenix, Apache Zeppelin
Web Client
Apache Apex, Apache Flink,
Apache Spark and Apache Storm
Java
Eclipse
Apache Maven
CentOS 7

DiP High Level Architecture

DiP using Storm
• Multiple processing paradigm - Real-time , Interactive and Batch processes
• Reliable – each unit of data (tuple) will be processed at least once or exactly once.
• Fast and scalable - parallel calculations are run across a cluster of machines.
• Fault-tolerant - workers automatically restarts in case they die .
Apache Storm features

DiP using Spark Streaming
• Multiple processing paradigm - Batch and Interactive
• Ease of Use –contains high-level operators written in Java, Scala and Python
• Fault Tolerance - lost work and operator state can be recovered with no extra code
• Code Reusability – same code can be used for batch processing, join streams against historical data, or to run ad-
hoc queries on stream state
Spark Streaming features

DiP using Apex
Modular - Malhar, a library of operators , comes bundled with Apex for quick development cycles
• Supports both stream and batch processing
• Supports operator exchange at runtime
• Supports fault tolerance and dynamic scaling
Apache Apex features

DiP using Flink
Multiple processing paradigm - distributed, stream and batch processing.
Several APIsfor creating applications are supported
• Data Stream API for unbounded streams embedded in Java and Scala
• Data Set API for static data embedded in Java, Scala, and Python,
• Table API with a SQL-like expression language embedded in Java and Scala.
Fault tolerance for distributed computations over data streams
Apache Flink features

DiP-Druid Architecture (High Level)
Credit: https://imply.io/docs/latest/
https://techblog.xavient.com/kafka-druid-integration-with-ingestion-dip-real-time-data

Data Access
Apache Zeppelin/ Custom UI
• Data Stored on HDFS as Hive External
Tables
• Data stored on HBaseas Phoenix View

Custom UI “Co-Dev”
• Integrated with elastic
search
• Enterprise security and
SSO
• Recommendation model
based on user profile, tags
and activity
• Chat
• Blog/Droplet features
• Tasks creation and follow-
up
• Notifications
• Smart phone app

DiP @ Hallwaze.com

Get involved
https://github.com/XavientInformationSystems/Data-Ingestion-Platform
Co-Dev : Reach out in case you want to customize the platform, choose the right
streaming engine based on latency, use case and custom UI/reporting.

Hybrid Cloud

Hadoop and Cloud

Apache Falcon
DiP Hadoop
On-prem
Cloud
Apache Falconis a data management tool for overseeing data pipelines in Hadoop
clusters. It can be used to replicate data from one cluster to another.
Hadoop

Kafka Mirroring
The Kafka mirroring feature is used for creating the replica of an existing cluster, for example, for the
replication of an active datacenter into a passivedatacenter. Kafka providesa mirror maker tool for
mirroring the source cluster intotarget cluster.

Kafka Mirroring – Hybrid Cloud Environment

Cassandra
DiP
Cassandra
Cassandra
On-prem
Cloud
• RDBMS migration
• DSE advance replication
• Kafka

WIP
• Integration with Kafka Connect and Kafka Streaming
• Data Munging, Validation
• Machine Learning
• Search – Elastic , Solr

Thanks!
@allaboutbdata
nsabharwal@xavient.com

Real time data ingestion and Hybrid Cloud

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Real time data ingestion and Hybrid Cloud

Semelhante a Real time data ingestion and Hybrid Cloud (20)

Último

Último (20)

Real time data ingestion and Hybrid Cloud