SlideShare uma empresa Scribd logo
1 de 21
Real-time Analytics
Kafka, Apache Samza, Hadoop Yarn, Druid, Tranquility and Metabase.
Leandro Totino Pereira
Devops/Cloud Engineer
Agenda
 What is Analytics?
 How can we get pattern data?
 Ad-hoc solution
 ETL’s types
 Real-Time Streaming
 What is Kafka?
 Apache Hadoop YARN
 Druid
 Tranquility
 Business intelligence web application
What is analytics?
Data-driven decisions
Forecast future results
Reporting
Machine Learning
Metrics/Monitoring
Optimize data
Analytics is the discovery, interpretation, and communication of meaningful patterns in data and
it can be used in the following scenarios.
How can we get pattern data?
In computing, extract, transform, load (ETL) refers to a process in database usage and especially
in data warehousing or you can get by interactive Ad-hoc analysis where a unique solution does
ETL from multiples data source.
Ad-hoc solution
Presto – Multiple Database Support - Mysql,PostgreSQL,S3, Cassandra,
HDFS, etc.
Apache Drill – Multiple NoSQL database support – MongoDB, HBase,
HDFS, S3 and etc.
• Do all ETL steps at once
• Data Cleasing is complex
• Extract information from production servers
Disadvantages
• Don´t need to create complex infrastracture for Analytics
• Don´t nedd to extract informations to other systemsAdvantages:
ETL’s types
Conclusion
In my perpective Batch mode
is totally for legacy system
which cannot migrate to real-
time stream or for small ones.
Batch mode extracts data using copy tools through jobs to populate data warehouse such as
HDFS and finally we can create business analiytcs on the another hand real-time streaming ETL
in real-time.
Real-Time Streaming
Real-Time Streaming topology
You can extract data with a
tool called flume or by your
applications directly. Flume
is able to send data from
various types of sources
and output them to Kafka
and HDFS.
What is Kafka?
Kafka is a distributed messaging system providing fast, highly
scalable and redundant messaging through a pub-sub model
Topic is the container with
which messages are
associated. It´s divided into a
number of partitions.
Each node in the cluster is
called a Kafka broker.
Consumers is responsible for
getting messages from a
topic
Producers is responsible for
publishing data/messages
into a topic
The basic architecture of Kafka is
organized around a few key terms:
topics, producers, consumers, and
brokers.
Apache Hadoop YARN
(Yet Another Resource Negotiator) Client
Submit an application/job.
Node Manager
Provide computacional resources and
Manage application containers
Application Master
Monitor the containers and their resource
consumption
Negotiates appropriate resource for containers
Container
Run the application spawned by
application master
Resource manager
Check Node Manager and available
resources in the cluster. Monitor
application masters.
What is Samza?
Apache Samza is a distributed stream processing framework (application manager into Yarn).
It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance,
processor isolation, security, and resource management. it's commonly used to transform,
cleanup, normalize data before save to data warehouse
You can tranform/cleanup data
between job forward it through
Kafka topics. For example if
the message “I´m Leandro
and I´m system engineer”
got to samza job1 it can
normalize like “name:
Leandro, and I´m system
engineer” and the job samza2
tranform to “name: Leandro,
job: “system engineer”.
Samza Hadoop Integration
We can see in a Yarn Web UI a lot of information about your cluster such as: resource usage and
available, number of Jobs and their status, information about application máster and containers.
Samza work-Flow
You should start a job on the Yarn grid running the samza script run-
job.sh with a specific configuration file for each job. You must setup in
the config file “job name”, the location of yarn package file, the task
class location to find a process method, kafka input topic name,etc..
Druid – Real-time and historical data Data Warehouse
Druid provides low latency (real-time) data ingestion, flexible data exploration, and fast data
aggregation. Existing Druid deployments have scaled to trillions of events and petabytes of
data. Druid is most commonly used to power user-facing analytic applications.
Sub-second OLAP
Queries
Druid’s unique
architecture enables
rapid multi-dimensional
filtering, ad-hoc attribute
groupings, and extremely
fast aggregations.
Real-time Streaming
Ingestion
Druid employs lock-free
ingestion to allow for
simultaneous ingestion
and querying of high
dimensional, high volume
data sets. Explore events
immediately after they
occur.
Power Analytic
Applications
Druid has numerous
features built for multi-
tenancy. Power user-
facing analytic
applications designed to
be used by thousands of
concurrent users.
Cost Effective
Druid is extremely cost
effective at scale and has
numerous features built
in for cost reduction.
Trade off cost and
performance with simple
configuration knobs.
Highly Available
Druid is used to back
SaaS implementations
that need to be up all the
time. Druid supports rolling
updates so your data is
still available and
queryable during software
updates. Scale up or down
without data loss.
Scalable
Existing Druid
deployments handle
trillions of events,
petabytes of data, and
thousands of queries
every second.
Source: http://druid.io/druid.htm
Druid architecture
Druid Components
Historical nodes commonly form the backbone of a Druid cluster. Historical nodes download immutable segments locally and serve
queries over those segments. The nodes have a shared nothing architecture and know how to load segments, drop segments, and
serve queries on segments.
Broker nodes are what clients and applications query to get data from Druid. Broker nodes are responsible for scattering
queries and gathering and merging results. Broker nodes know what segments live where.
Coordinator nodes manage segments on historical nodes in a cluster. Coordinator nodes tell historical nodes to load new
segments, drop old segments, and move segments to load balance.
Real-time processing in Druid can currently be done using standalone realtime nodes or using the indexing service. The real-time logic is
common between these two services. Real-time processing involves ingesting data, indexing the data (creating segments), and handing
segments off to historical nodes. Data is queryable as soon as it is ingested by the realtime processing logic. The hand-off process is also
lossless; data remains queryable throughout the entire process.
Querying Druid data
Request and output is json
format. We are getting values
from field metrics from host
compute-3.
Tranquility – Sending events to Druid
Tranquility is a tool which gets the
final processed data from Kafka
Topics writing it into druid
database/datasources
You must know what data structure is
coming and how it´s going to save into
druid datasource therefore you must
map dimension metrics in tranquility
configuration file.
Business intelligence web application
Business intelligence web applications permits user to explore and visualize into data
warehouse and create reports easily.
Superset – It´s a amazing tool developed by airbnb which permits user create awesome
reports but we got some limitations about querying raw data and not aggregation data.It´s
required on installation many python pip modules.
Tableau – We didn´t have a oportunity to test but It´s a enterprise/comercial solution and
looks like the most complete.
Metabase – It´s easy to install and operate.Setting up reports is pretty straightfoward.
Metabase - Open source business intelligence tool
Get the jar file , run, access it.
https://<Address>:3000
Add database/datasource
connection on web UI.
Ask Question to build
report/analysis.
Thank you!
Questions?
More information:
Linkedin:
https://www.linkedin.com/in/leandro-totino-pereira
Facebook:
https://www.facebook.com/leandro.totinopereira

Mais conteúdo relacionado

Mais procurados

BI Consultancy - Data, Analytics and Strategy
BI Consultancy - Data, Analytics and StrategyBI Consultancy - Data, Analytics and Strategy
BI Consultancy - Data, Analytics and StrategyShivam Dhawan
 
Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceIntroduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceData Science Thailand
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science ProcessVishal Patel
 
Social media analytics powered by data science
Social media analytics powered by data scienceSocial media analytics powered by data science
Social media analytics powered by data scienceNavin Manaswi
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceNiko Vuokko
 
Democratizing Data at Airbnb
Democratizing Data at AirbnbDemocratizing Data at Airbnb
Democratizing Data at AirbnbNeo4j
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewSivashankar Ganapathy
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analyticsSSaudia
 
Data Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsData Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsIke Ellis
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataJoey Li
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data AnalyticsUtkarsh Sharma
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 

Mais procurados (20)

BI Consultancy - Data, Analytics and Strategy
BI Consultancy - Data, Analytics and StrategyBI Consultancy - Data, Analytics and Strategy
BI Consultancy - Data, Analytics and Strategy
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceIntroduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data Science
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science Process
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Social media analytics powered by data science
Social media analytics powered by data scienceSocial media analytics powered by data science
Social media analytics powered by data science
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Democratizing Data at Airbnb
Democratizing Data at AirbnbDemocratizing Data at Airbnb
Democratizing Data at Airbnb
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 
Data Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsData Modeling on Azure for Analytics
Data Modeling on Azure for Analytics
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 

Semelhante a Real time analytics

Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...AboutYouGmbH
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptxbetalab
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworksIJDKP
 
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionEtu Solution
 
data analytics lecture4.pptx
data analytics lecture4.pptxdata analytics lecture4.pptx
data analytics lecture4.pptxNamrataBhatt8
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 
IRJET- Secured Hadoop Environment
IRJET- Secured Hadoop EnvironmentIRJET- Secured Hadoop Environment
IRJET- Secured Hadoop EnvironmentIRJET Journal
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 

Semelhante a Real time analytics (20)

Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
 
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
 
data analytics lecture4.pptx
data analytics lecture4.pptxdata analytics lecture4.pptx
data analytics lecture4.pptx
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
paper
paperpaper
paper
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
IRJET- Secured Hadoop Environment
IRJET- Secured Hadoop EnvironmentIRJET- Secured Hadoop Environment
IRJET- Secured Hadoop Environment
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop
HadoopHadoop
Hadoop
 

Mais de Leandro Totino Pereira

Backup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipesBackup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipesLeandro Totino Pereira
 
Discover/Register Everything in consul
Discover/Register Everything in consulDiscover/Register Everything in consul
Discover/Register Everything in consulLeandro Totino Pereira
 
Monitoring at scale - Sensu Kafka Kafka-connect Cassandra PrestoDB
Monitoring at scale - Sensu Kafka Kafka-connect Cassandra PrestoDBMonitoring at scale - Sensu Kafka Kafka-connect Cassandra PrestoDB
Monitoring at scale - Sensu Kafka Kafka-connect Cassandra PrestoDBLeandro Totino Pereira
 
Gocd – Kubernetes/Nomad Continuous Deployment
Gocd – Kubernetes/Nomad Continuous DeploymentGocd – Kubernetes/Nomad Continuous Deployment
Gocd – Kubernetes/Nomad Continuous DeploymentLeandro Totino Pereira
 
Linkerd – Service mesh with service Discovery backend
Linkerd – Service mesh with service Discovery backendLinkerd – Service mesh with service Discovery backend
Linkerd – Service mesh with service Discovery backendLeandro Totino Pereira
 
DynomiteDB - No spof High-availability Redis cluster solution
DynomiteDB -  No spof High-availability Redis cluster solutionDynomiteDB -  No spof High-availability Redis cluster solution
DynomiteDB - No spof High-availability Redis cluster solutionLeandro Totino Pereira
 
DalmatinerDB and cockroachDB monitoring plataform
DalmatinerDB and cockroachDB monitoring plataformDalmatinerDB and cockroachDB monitoring plataform
DalmatinerDB and cockroachDB monitoring plataformLeandro Totino Pereira
 

Mais de Leandro Totino Pereira (9)

Backup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipesBackup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipes
 
Zabbix at scale with Elasticsearch
Zabbix at scale with ElasticsearchZabbix at scale with Elasticsearch
Zabbix at scale with Elasticsearch
 
Discover/Register Everything in consul
Discover/Register Everything in consulDiscover/Register Everything in consul
Discover/Register Everything in consul
 
Monitoring at scale - Sensu Kafka Kafka-connect Cassandra PrestoDB
Monitoring at scale - Sensu Kafka Kafka-connect Cassandra PrestoDBMonitoring at scale - Sensu Kafka Kafka-connect Cassandra PrestoDB
Monitoring at scale - Sensu Kafka Kafka-connect Cassandra PrestoDB
 
Automate schedule
Automate scheduleAutomate schedule
Automate schedule
 
Gocd – Kubernetes/Nomad Continuous Deployment
Gocd – Kubernetes/Nomad Continuous DeploymentGocd – Kubernetes/Nomad Continuous Deployment
Gocd – Kubernetes/Nomad Continuous Deployment
 
Linkerd – Service mesh with service Discovery backend
Linkerd – Service mesh with service Discovery backendLinkerd – Service mesh with service Discovery backend
Linkerd – Service mesh with service Discovery backend
 
DynomiteDB - No spof High-availability Redis cluster solution
DynomiteDB -  No spof High-availability Redis cluster solutionDynomiteDB -  No spof High-availability Redis cluster solution
DynomiteDB - No spof High-availability Redis cluster solution
 
DalmatinerDB and cockroachDB monitoring plataform
DalmatinerDB and cockroachDB monitoring plataformDalmatinerDB and cockroachDB monitoring plataform
DalmatinerDB and cockroachDB monitoring plataform
 

Último

Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 

Último (20)

Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 

Real time analytics

  • 1. Real-time Analytics Kafka, Apache Samza, Hadoop Yarn, Druid, Tranquility and Metabase. Leandro Totino Pereira Devops/Cloud Engineer
  • 2. Agenda  What is Analytics?  How can we get pattern data?  Ad-hoc solution  ETL’s types  Real-Time Streaming  What is Kafka?  Apache Hadoop YARN  Druid  Tranquility  Business intelligence web application
  • 3. What is analytics? Data-driven decisions Forecast future results Reporting Machine Learning Metrics/Monitoring Optimize data Analytics is the discovery, interpretation, and communication of meaningful patterns in data and it can be used in the following scenarios.
  • 4. How can we get pattern data? In computing, extract, transform, load (ETL) refers to a process in database usage and especially in data warehousing or you can get by interactive Ad-hoc analysis where a unique solution does ETL from multiples data source.
  • 5. Ad-hoc solution Presto – Multiple Database Support - Mysql,PostgreSQL,S3, Cassandra, HDFS, etc. Apache Drill – Multiple NoSQL database support – MongoDB, HBase, HDFS, S3 and etc. • Do all ETL steps at once • Data Cleasing is complex • Extract information from production servers Disadvantages • Don´t need to create complex infrastracture for Analytics • Don´t nedd to extract informations to other systemsAdvantages:
  • 6. ETL’s types Conclusion In my perpective Batch mode is totally for legacy system which cannot migrate to real- time stream or for small ones. Batch mode extracts data using copy tools through jobs to populate data warehouse such as HDFS and finally we can create business analiytcs on the another hand real-time streaming ETL in real-time.
  • 8. Real-Time Streaming topology You can extract data with a tool called flume or by your applications directly. Flume is able to send data from various types of sources and output them to Kafka and HDFS.
  • 9. What is Kafka? Kafka is a distributed messaging system providing fast, highly scalable and redundant messaging through a pub-sub model Topic is the container with which messages are associated. It´s divided into a number of partitions. Each node in the cluster is called a Kafka broker. Consumers is responsible for getting messages from a topic Producers is responsible for publishing data/messages into a topic The basic architecture of Kafka is organized around a few key terms: topics, producers, consumers, and brokers.
  • 10. Apache Hadoop YARN (Yet Another Resource Negotiator) Client Submit an application/job. Node Manager Provide computacional resources and Manage application containers Application Master Monitor the containers and their resource consumption Negotiates appropriate resource for containers Container Run the application spawned by application master Resource manager Check Node Manager and available resources in the cluster. Monitor application masters.
  • 11. What is Samza? Apache Samza is a distributed stream processing framework (application manager into Yarn). It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. it's commonly used to transform, cleanup, normalize data before save to data warehouse You can tranform/cleanup data between job forward it through Kafka topics. For example if the message “I´m Leandro and I´m system engineer” got to samza job1 it can normalize like “name: Leandro, and I´m system engineer” and the job samza2 tranform to “name: Leandro, job: “system engineer”.
  • 12. Samza Hadoop Integration We can see in a Yarn Web UI a lot of information about your cluster such as: resource usage and available, number of Jobs and their status, information about application máster and containers.
  • 13. Samza work-Flow You should start a job on the Yarn grid running the samza script run- job.sh with a specific configuration file for each job. You must setup in the config file “job name”, the location of yarn package file, the task class location to find a process method, kafka input topic name,etc..
  • 14. Druid – Real-time and historical data Data Warehouse Druid provides low latency (real-time) data ingestion, flexible data exploration, and fast data aggregation. Existing Druid deployments have scaled to trillions of events and petabytes of data. Druid is most commonly used to power user-facing analytic applications. Sub-second OLAP Queries Druid’s unique architecture enables rapid multi-dimensional filtering, ad-hoc attribute groupings, and extremely fast aggregations. Real-time Streaming Ingestion Druid employs lock-free ingestion to allow for simultaneous ingestion and querying of high dimensional, high volume data sets. Explore events immediately after they occur. Power Analytic Applications Druid has numerous features built for multi- tenancy. Power user- facing analytic applications designed to be used by thousands of concurrent users. Cost Effective Druid is extremely cost effective at scale and has numerous features built in for cost reduction. Trade off cost and performance with simple configuration knobs. Highly Available Druid is used to back SaaS implementations that need to be up all the time. Druid supports rolling updates so your data is still available and queryable during software updates. Scale up or down without data loss. Scalable Existing Druid deployments handle trillions of events, petabytes of data, and thousands of queries every second. Source: http://druid.io/druid.htm
  • 16. Druid Components Historical nodes commonly form the backbone of a Druid cluster. Historical nodes download immutable segments locally and serve queries over those segments. The nodes have a shared nothing architecture and know how to load segments, drop segments, and serve queries on segments. Broker nodes are what clients and applications query to get data from Druid. Broker nodes are responsible for scattering queries and gathering and merging results. Broker nodes know what segments live where. Coordinator nodes manage segments on historical nodes in a cluster. Coordinator nodes tell historical nodes to load new segments, drop old segments, and move segments to load balance. Real-time processing in Druid can currently be done using standalone realtime nodes or using the indexing service. The real-time logic is common between these two services. Real-time processing involves ingesting data, indexing the data (creating segments), and handing segments off to historical nodes. Data is queryable as soon as it is ingested by the realtime processing logic. The hand-off process is also lossless; data remains queryable throughout the entire process.
  • 17. Querying Druid data Request and output is json format. We are getting values from field metrics from host compute-3.
  • 18. Tranquility – Sending events to Druid Tranquility is a tool which gets the final processed data from Kafka Topics writing it into druid database/datasources You must know what data structure is coming and how it´s going to save into druid datasource therefore you must map dimension metrics in tranquility configuration file.
  • 19. Business intelligence web application Business intelligence web applications permits user to explore and visualize into data warehouse and create reports easily. Superset – It´s a amazing tool developed by airbnb which permits user create awesome reports but we got some limitations about querying raw data and not aggregation data.It´s required on installation many python pip modules. Tableau – We didn´t have a oportunity to test but It´s a enterprise/comercial solution and looks like the most complete. Metabase – It´s easy to install and operate.Setting up reports is pretty straightfoward.
  • 20. Metabase - Open source business intelligence tool Get the jar file , run, access it. https://<Address>:3000 Add database/datasource connection on web UI. Ask Question to build report/analysis.