Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Data platform architecture
1. DATA ARCHITECTURE &
ROAD MAP
NEXT GENERATION DATA PLATFORM AND ARCHITECTURAL PATTERNS
BY SUDHEER KONDLA
SENIOR DATA PLATFORM ARCHITECTS
2. OVERVIEW
• Define a problem
• Understanding problem
• Articulate the problem
• Craft a solution
3. DATA ARCHITECTURE SOLUTION
• In order to solve real time high volume data problem with low latency response time, we need data
platform that has capable of capturing, ingesting , streaming and optionally storing data for batch
analytics. Most of the real time streaming data platforms will have short lived data after processing to
build predictive modelling that enable marketing to offer real time recommendations, the following
characteristics are expected
• Fast Data
• Require fast ingestion
• Real-time analytics
• Fast action
• Time to value
• Benefits
• Capture and use (or discard – time to live or purge)
• Insights real or near real-time
• Agile and Responsive
• Expressive
4. DATA ARCHITECTURE SOLUTION
• In order to solve real time high volume data problem with low latency response time, we need data
platform that has capable of capturing, ingesting , streaming and optionally storing data for batch
analytics. Most of the real time streaming data platforms will have short lived data after processing to
build predictive modelling that enable marketing to offer real time recommendations, the following
characteristics are expected
• Fast Data
• Require fast ingestion
• Real-time analytics
• Fast action
• Time to value
• Benefits
• Capture and use (or discard – time to live or purge)
• Insights real or near real-time
• Agile and Responsive
• Expressive
5. ECHO SYSTEM & INFRASTRUCTURE
• Multiple Data Sources:
• Web/Apps Logs, Twitter (trending), and other social media, blogs, SOR (internal systems), HDFS
• Ingestion/Streaming
• Apache Flume (log capture/aggregation), Kafka (event streaming, data pipelines & messaging)
• Stream Analytics
• Spark/Storm API
• Data Store/Persistence
• HDFS, Cassandra, S3, Hive
• Infrastructure
• IaaS (Cloud) or On-premise or Hybrid Private Cloud
• Orchestration
• Mesos
7. REAL-TIME DATA PIPELINES
Real-time data pipeline
Collect data into Kafka
(Channel Data)
Process micro-batches
(Aggregate, predict &
act)
Persist data for later use
(Historical, Analytics)
Kafka Spark Cassandra
9. CHOOSING RIGHT ECHO SYSTEM
• Kafka:
• Distributed pub-sub messaging and data pipe line system
• Designed for processing real-time activity streams (logs, metrics)
• When to use: real-time decision making, working with streams of continuous data
• Why Kafka: Persistent messaging, High throughput, Fault tolerant.
• Spark:
• What is it: It’s a distributed computing framework that can scale, integrate real time data from many event
streams (Kafka, Flume, HDFS, S3, Twitter and other sources)
• Event Driven, Asynchronous, Scalable, Type-safe and fault tolerant
• Where does fit:
• When you need real time decision making - recommendation, fraud detection, real time forcasting
• Why spark streaming
• Provides high throughput, reliable for live data streams
• Batch, iterative and streaming on same platform
• Fits for machine learning
10. CHOOSING RIGHT ECHO SYSTEM
• Cassandra:
• What is it: Distributed database with high availability (multi-master, high write throughput)
• When to use: Scaling, data needed in multi-data centers (geo locations), Always available and fast response
times.
• Why Cassandra: Easy to scale out, High throughput, Continuous availability , no SPOFs. Easy to integrate with
Spark and supports Spark Streaming and Solr search.