SlideShare uma empresa Scribd logo
1 de 53
Baixar para ler offline
1 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
A Presentation for:
Getting Started with Hadoop, Spark,
Hive and Kafka
Edelweiss	Kammermann
New	York
March	8th	2018
2 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
IT CONVERGENCE SNAPSHOT
3 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
Over 600 Customers Engagements In More Than 50 Countries
3
EXTENSIVE EXPERTISE ACROSS THE GLOBE
4 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved4 4
About me
ü Computer Engineer, BI and Data Integration Specialist
ü Over 20 years of Consulting and Project Management experience in Oracle technology.
ü Co-founder and Vice President of Uruguayan Oracle User Group (UYOUG)
ü Vice President of LAOUC (Latin America Oracle User Community)
ü BI Manager at ITConvergence
ü Writer and frequent speaker at international conferences: Collaborate, OTN Tour LA,
UKOUG Tech & Apps, OOW, etc
ü Oracle ACE Director
ü Oracle Big Data Implementation Specialist
5 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
Uruguay
6 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
3	Membership	Tiers
• Oracle	ACE	Director
• Oracle	ACE
• Oracle	ACE	Associate
bit.ly/OracleACEProgram
500+	Technical	Experts	
Helping	Peers	Globally
Connect:
Nominate	yourself	or	someone	you	know:	acenomination.oracle.com
@oracleace
Facebook.com/oracleaces
oracle-ace_ww@oracle.com
7 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
Index
What is Big Data?
Hadoop
Hive
Spark
8 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved8 8
What is Big Data?
ü Volume: High amount of data
ü Variety: Different data types formats.
Unstructured/semi-structured data
ü Velocity: Speed which data is created and/or consumed
ü Veracity: Quality of data. Accuracy
ü Value: Data has intrinsic value—but it must be discovered.
9 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved9 9
10 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved10
Hadoop
ü An open source software platform for distributed storage and processing
ü Manage huge volumes of unstructured data
ü Parallel processing of large data set
ü Highly scalable
ü Fault-tolerant
ü Two main components:
ü HDFS: Hadoop Distributed File System for storing information
ü MapReduce: programming framework that process information
11 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved11
HDFS Architecture (Simplified)
Client
NameNode
DataNodes
Manages	metadata	and	access	control
Has	the	info	of	where	the	data	is	(which	
DataNodes contains	the	blocks	of	each	file)	
Keeps	this	info	in	memory.	
Store	and	retrieves	data	
(blocks)	by	client	request.Requests	processes	as	read	
or	write	data
12 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved12
HDFS: Writing Data
Client
NameNode DataNodes
1
2
Divide	the	file	into	fixed	size	blocks	
(usually	64	or	128MB)
For	each	block:	Ask	Namenode
in	which	DataNodes can	write,
Specifying	block	size	and	
replication	factor	
For	each	block:	Provide	DataNodes
addresses,	sorted	in	increasing	
distance
3
13 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved13
HDFS: Writing Data
Client
NameNode DataNodes
1
2
Sends	the	data	of	the	block	and	
the	list	of	nodes	to	the	first	
DataNode
3
4
5
Sends	the	data	to	the	following	
DataNode
Replication	Pipeline
6
Each	DataNode sends	Done	to	
NameNode once	the	block	data	is	
written	to	hard	disk
14 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved14
HDFS: Reading Data
Client
NameNode
DataNode
1
Send	list	of	blocks	of	the	file.
List	of	DataNodes for	each	block
2
4
Send	data	for	required	block
Ask	NameNode for	a	specific	file
3
Download	data	from	the	nearest	
DataNode (send	block	number)
15 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved15
HDFS: Fault Tolerance
ü Node Failure
ü DataNodes send heartbeat every 3 seconds
ü If NameNode doesn’t receive it from 10 min consider that node dead.
ü Communication Failure
ü If ACK is not received from DataNode to the sender after many tries
ü Data Corruption
ü DataNodes send block reports to NameNode not including the blocks that are
corrupted (checksum validation)
16 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved16
HDFS: High Availabilty
ü Secondary NameNode (active-
standby configuration)
ü Namenodes use shared storage
ü Datanodes send block reports to both
namenodes
Shared	Storage
Passive	NameNodeActive	NameNode
17 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved17
HDFS: Command Examples
ü hadoop fs –ls
ü hadoop fs -put <local_path> <hdfs_path>
ü hadoop fs -get <hdfs_path> <local_path>
ü hadoop fs -cat <hdfs_path>
ü hadoop fs -rmr <hdfs_path>
ü hadoop fs –copyFromLocal <local_path> <hdfs_path>
18 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved18
MapReduce
ü Process data from HDFS
ü A MapReduce program is composed by
ü Map() method: performs filtering and sorting of the
<key, value> inputs
ü Reduce() method: summarize the <key,value> pairs
provided by the Mappers
ü Code can be written in many languages (Perl, Python,
Java. etc)
19 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved19
MapReduce Example
20 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved20
MapReduce Code Example
21 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
Hadoop	Demo
22 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved22 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
But…
Map	Reduce	has	a	high	learning	curve….
How	to	analyze	Big	Data	with	some	familiar	language?
23 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved23 23
Hive
ü An open source data warehouse software on top of Apache Hadoop
ü Analyze and query data stored in HDFS
ü Structure the data into tables
ü Tools for simple ETL
ü SQL- like queries (HiveQL)
ü Metadata is stored in an RDBMS
ü Uses MapReduce as execution language
ü Metadata is stored in a RDBMS (
24 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved24 24
HiveQL
ü UPDATE,INSERT,DELETE
ü Limited transaction support
ü Indexes supported
ü Multitable insert support
ü SQL-92 Join support
ü Read only views
25 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved25 25
Hive: Code Example
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
26 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved26 26
Hive: Pros & Cons
ü Pros
ü Familiarity with SQL
ü Interactive
ü Connection through JDBC/ODBC drivers
ü Cons
ü High latency
ü Doesn’t have query cache
ü Only support equal joins
27 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
Hive	Demo
28 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved28 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
But…
Hive	has	high	latency…
What	if	I	want	better	performance	and	analyze	real	time	data?
29 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved29
Spark
ü Apache Spark is a fast, in-memory data processing engine
ü Provides native bindings for Java, Scala, Python and R
ü Supports SQL, streaming data, machine learning and
graph processing.
ü Can run standalone, on Hadoop, or on Apache Mesos
30 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved30
Spark vs MapReduce
ü Spark main advantages vs MapReduce
ü Speed
ü Can perform tasks up to 100 times faster if
all the data can be contained in memory
ü Otherwise can be more than 10 times faster
ü Spark API (developer friendly)
31 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved31
Spark: Code Example
val textFile =	sparkSession.sparkContext.textFile(“hdfs:///tmp/words”)
val counts	=	textFile.flatMap(line	=>	line.split(“	“))
.map(word	=>	(word,	1))
.reduceByKey(_	+	_)
counts.saveAsTextFile(“hdfs:///tmp/words_agg”)
32 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved32
ü Spark Core
ü Spark Streaming
ü Spark SQL
ü MLLib
ü GraphX
Spark: Components
Spark	Core
Spark	
Streaming
Spark	SQL MLlib GraphX
33 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved33
Spark: Resilient Distributed Dataset (RDD)
ü A programming abstraction of objects collection
ü Cannot be modified (immutable)
ü Can be split across a computing cluster.
ü Can be created from text files, SQL databases, NoSQL db (Cassandra, MongoDB,etc)
ü Operations on RDDs
ü Can be split across the cluster and executed in a parallel batch process
ü Fast and scalable parallel processing.
34 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved34
Spark Streaming
ü Takes the data as it comes in and process it in near real time
ü Example: internet of things applications.
ü Breaking the stream down into individuals parts called microbatches,
ü Processed together as small RDDs
ü Reliable: “checkpoints” stores data to disk periodically for fault tolerance.
ü Windowing operations:compute results across a longer time period than your batch
interval
ü Example: Top sales from the past 2 hours.
35 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
Spark	Demo
36 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved36 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
But…
What	if	I	want	to	integrate	Big	Data	with	my	other	systems?
37 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved37
Integration Challenge
RDBMS
Hadoop
NOSQL
Website
38 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved38
Kafka
RDBMS
Hadoop
NOSQL
Website
ü Distributed Streaming
Platform
ü Decouple Data
Streams
ü Fault-tolerant
ü High performance
ü Horizontally scalable
39 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved39
How Kafka works?: Kafka Core
Consumers
RDBMS
NoSQL
Website
Apps
Source	Systems
Producers Hadoop
RDBMS
NoSQL
Analytic	
Tools
Target	Systems
40 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved40
How Kafka works?: Extended API
Kafka	Connect	Sink
RDBMS
NoSQL
Website
Apps
Source	Systems
Kafka	Connect	Source
Hadoop
RDBMS
NoSQL
Analytic	
Tools
Target	Systems
41 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved41
How Kafka works?: Topics & Partitions
ü Messages are stored into Topics
ü Similar concept as a database table
ü Topics
ü Are identified by a unique name
ü Are split into Partitions (for redundancy and performance)
ü Partitions
ü Each partition is ordered
ü When a message arrives to a partition an id is assigned = Offset
42 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved42
How Kafka works?: Brokers
ü Brokers = servers in a Kafka cluster
ü Are identified by an ID number
ü Contain topic partitions
ü Recommended to have at least
3 Brokers in a cluster
Topic	3
Partition	0
Topic	3
Partition	1
Topic	2
Partition	1
Topic	2
Partition	2
Broker	1	 Broker	2 Broker	3
43 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved43
How Kafka works?: Replication Factor
ü Topic should have a Replication Factor > 1 (usually 2 or 3) to avoid losing data
Topic	3
Partition	0
Topic	3
Partition	0
Topic	2
Partition	1
Broker	1	 Broker	2 Broker	3	
Topic	2
Partition	1
44 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved44
How Kafka works?: Replication Factor
ü Topic should have a Replication Factor > 1 (usually 2 or 3) to avoid losing data
Topic	3
Partition	0
Topic	3
Partition	0
Topic	2
Partition	1
Broker	1	 Broker	2 Broker	3	
Topic	2
Partition	1
45 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved45
How Kafka works?: Replication Factor
ü Topic should have a Replication Factor > 1 (usually 2 or 3) to avoid losing data
Topic	3
Partition	0
Broker	2 Broker	3	
Topic	2
Partition	1
46 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved46
How Kafka works?: Producers
ü Producers write data into Topics
ü Can choose type of ACK from
partition
ü ACK=0 (no ack)
ü ACK=1 (only partition leader)
ü ACK=All (all the replicas)
Producer
Topic	1
Partition	1
Topic	1
Partition	0
Broker	1	
Broker	2
0 1 2 3 4 5 6 7
0 1 2 3 4
Offset	Partition	0
Offset	Partition	1
47 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved47
How Kafka works?: Consumers
ü Consumers read data from a Topic
ü Consumer reads
ü In order from each partition
ü In parallel between partitions
Consumer
Topic	1
Partition	1
Topic	1
Partition	0
Broker	1	
Broker	2
0 1 2 3 4 5 6 7
0 1 2 3 4
Offset	Partition	0
Offset	Partition	1
48 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
Kafka	Demo
49 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved49
Want to install those tools?
ü Hadoop
ü https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
ü https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm
ü Hive
ü https://www.tutorialspoint.com/hive/hive_installation.htm
ü Spark
ü https://www.tutorialspoint.com/apache_spark/apache_spark_installation.htm
ü Kafka
ü https://www.tutorialspoint.com/apache_kafka/apache_kafka_installation_steps.htm
ü https://kafka.apache.org/quickstart
50 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved50
Want to play with those tools?
ü Oracle Pre built VM Big Data Lite
ü http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html
ü Cloudera Quickstart VMs
ü https://www.cloudera.com/downloads/quickstart_vms/5-12.html
ü Apache Kafka Docker Container
ü https://github.com/Landoop/fast-data-dev
51 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reservedITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved51 51
Questions?
52 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
53 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved

Mais conteúdo relacionado

Mais procurados

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Apache Phoenix: Transforming HBase into a SQL Database
Apache Phoenix: Transforming HBase into a SQL DatabaseApache Phoenix: Transforming HBase into a SQL Database
Apache Phoenix: Transforming HBase into a SQL Database
DataWorks Summit
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
 

Mais procurados (20)

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Big data-cheat-sheet
Big data-cheat-sheetBig data-cheat-sheet
Big data-cheat-sheet
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
How Impala Works
How Impala WorksHow Impala Works
How Impala Works
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hive
HiveHive
Hive
 
What is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | EdurekaWhat is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | Edureka
 
Apache Phoenix: Transforming HBase into a SQL Database
Apache Phoenix: Transforming HBase into a SQL DatabaseApache Phoenix: Transforming HBase into a SQL Database
Apache Phoenix: Transforming HBase into a SQL Database
 
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
 
Sqoop
SqoopSqoop
Sqoop
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 

Semelhante a Getting started with Hadoop, Hive, Spark and Kafka

Semelhante a Getting started with Hadoop, Hive, Spark and Kafka (20)

The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
 
There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9
 
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
 
Real World Modern Development Use Cases with RackHD and Adobe
Real World Modern Development Use Cases with RackHD and AdobeReal World Modern Development Use Cases with RackHD and Adobe
Real World Modern Development Use Cases with RackHD and Adobe
 
Big Data Best Practices on GCP
Big Data Best Practices on GCPBig Data Best Practices on GCP
Big Data Best Practices on GCP
 
Big Data Best Practices on GCP
Big Data Best Practices on GCPBig Data Best Practices on GCP
Big Data Best Practices on GCP
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
 
OpenStack Days Krakow
OpenStack Days KrakowOpenStack Days Krakow
OpenStack Days Krakow
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
CISCO - Presentation at Hortonworks Booth - Strata 2014
CISCO - Presentation at Hortonworks Booth - Strata 2014CISCO - Presentation at Hortonworks Booth - Strata 2014
CISCO - Presentation at Hortonworks Booth - Strata 2014
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
Big data overview by Edgars
Big data overview by EdgarsBig data overview by Edgars
Big data overview by Edgars
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFW
 
CCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialCCD-410 Cloudera Study Material
CCD-410 Cloudera Study Material
 
Big Data Infrastructure
Big Data InfrastructureBig Data Infrastructure
Big Data Infrastructure
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Managing ScaleIO as Software on Mesos
Managing ScaleIO as Software on MesosManaging ScaleIO as Software on Mesos
Managing ScaleIO as Software on Mesos
 

Mais de Edelweiss Kammermann

Mais de Edelweiss Kammermann (15)

AWDC para desarrolladores y data scientists
AWDC para desarrolladores y data scientists AWDC para desarrolladores y data scientists
AWDC para desarrolladores y data scientists
 
Oracle Autonomous Data Warehouse Cloud and Data Visualization
Oracle Autonomous Data Warehouse Cloud and Data VisualizationOracle Autonomous Data Warehouse Cloud and Data Visualization
Oracle Autonomous Data Warehouse Cloud and Data Visualization
 
Working with Oracle Big Data Cloud Compute Edition and Apache Zeppelin
Working with Oracle Big Data Cloud Compute Edition and Apache ZeppelinWorking with Oracle Big Data Cloud Compute Edition and Apache Zeppelin
Working with Oracle Big Data Cloud Compute Edition and Apache Zeppelin
 
Moving OBIEE to Oracle Analytics Cloud
Moving OBIEE to Oracle Analytics CloudMoving OBIEE to Oracle Analytics Cloud
Moving OBIEE to Oracle Analytics Cloud
 
Como elegir entre BI Cloud, Data Visualization and Oracle Analytics Cloud Ser...
Como elegir entre BI Cloud, Data Visualization and Oracle Analytics Cloud Ser...Como elegir entre BI Cloud, Data Visualization and Oracle Analytics Cloud Ser...
Como elegir entre BI Cloud, Data Visualization and Oracle Analytics Cloud Ser...
 
Oracle Analytics Cloud lo nuevo de Oracle BI en la nube
Oracle Analytics Cloud  lo nuevo de Oracle BI en la nubeOracle Analytics Cloud  lo nuevo de Oracle BI en la nube
Oracle Analytics Cloud lo nuevo de Oracle BI en la nube
 
Data Visualization Tips for Oracle BICS and DVCS
Data Visualization Tips for Oracle BICS and DVCSData Visualization Tips for Oracle BICS and DVCS
Data Visualization Tips for Oracle BICS and DVCS
 
Empowering Business Users: OBIEE 12c Visual Analyzer and Data Mashup
Empowering Business Users: OBIEE 12c Visual Analyzer and Data MashupEmpowering Business Users: OBIEE 12c Visual Analyzer and Data Mashup
Empowering Business Users: OBIEE 12c Visual Analyzer and Data Mashup
 
Integrating Oracle Data Integrator with Oracle GoldenGate 12c
Integrating Oracle Data Integrator with Oracle GoldenGate 12cIntegrating Oracle Data Integrator with Oracle GoldenGate 12c
Integrating Oracle Data Integrator with Oracle GoldenGate 12c
 
Integración de Oracle Data Integrator con Oracle GoldenGate 12c
Integración de Oracle Data Integrator  con Oracle GoldenGate 12cIntegración de Oracle Data Integrator  con Oracle GoldenGate 12c
Integración de Oracle Data Integrator con Oracle GoldenGate 12c
 
OBIEE 11.1.1.7: Upgrade y Nuevas Características
OBIEE 11.1.1.7: Upgrade y Nuevas CaracterísticasOBIEE 11.1.1.7: Upgrade y Nuevas Características
OBIEE 11.1.1.7: Upgrade y Nuevas Características
 
Integrando Oracle BI, BPM y BAM 11g: El ciclo completo de la información
Integrando Oracle BI, BPM y BAM 11g:  El ciclo  completo de la informaciónIntegrando Oracle BI, BPM y BAM 11g:  El ciclo  completo de la información
Integrando Oracle BI, BPM y BAM 11g: El ciclo completo de la información
 
Integrating Oracle BI, BPM and BAM 11g: The complete cycle of information
Integrating Oracle BI, BPM and BAM 11g: The complete cycle of informationIntegrating Oracle BI, BPM and BAM 11g: The complete cycle of information
Integrating Oracle BI, BPM and BAM 11g: The complete cycle of information
 
Bi Publisher 11g: Only good news
Bi Publisher 11g: Only good newsBi Publisher 11g: Only good news
Bi Publisher 11g: Only good news
 
OBI11g: la versión mas esperada
OBI11g: la versión mas esperadaOBI11g: la versión mas esperada
OBI11g: la versión mas esperada
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Getting started with Hadoop, Hive, Spark and Kafka

  • 1. 1 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved A Presentation for: Getting Started with Hadoop, Spark, Hive and Kafka Edelweiss Kammermann New York March 8th 2018
  • 2. 2 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved IT CONVERGENCE SNAPSHOT
  • 3. 3 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved Over 600 Customers Engagements In More Than 50 Countries 3 EXTENSIVE EXPERTISE ACROSS THE GLOBE
  • 4. 4 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved4 4 About me ü Computer Engineer, BI and Data Integration Specialist ü Over 20 years of Consulting and Project Management experience in Oracle technology. ü Co-founder and Vice President of Uruguayan Oracle User Group (UYOUG) ü Vice President of LAOUC (Latin America Oracle User Community) ü BI Manager at ITConvergence ü Writer and frequent speaker at international conferences: Collaborate, OTN Tour LA, UKOUG Tech & Apps, OOW, etc ü Oracle ACE Director ü Oracle Big Data Implementation Specialist
  • 5. 5 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved Uruguay
  • 6. 6 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved 3 Membership Tiers • Oracle ACE Director • Oracle ACE • Oracle ACE Associate bit.ly/OracleACEProgram 500+ Technical Experts Helping Peers Globally Connect: Nominate yourself or someone you know: acenomination.oracle.com @oracleace Facebook.com/oracleaces oracle-ace_ww@oracle.com
  • 7. 7 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved Index What is Big Data? Hadoop Hive Spark
  • 8. 8 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved8 8 What is Big Data? ü Volume: High amount of data ü Variety: Different data types formats. Unstructured/semi-structured data ü Velocity: Speed which data is created and/or consumed ü Veracity: Quality of data. Accuracy ü Value: Data has intrinsic value—but it must be discovered.
  • 9. 9 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved9 9
  • 10. 10 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved10 Hadoop ü An open source software platform for distributed storage and processing ü Manage huge volumes of unstructured data ü Parallel processing of large data set ü Highly scalable ü Fault-tolerant ü Two main components: ü HDFS: Hadoop Distributed File System for storing information ü MapReduce: programming framework that process information
  • 11. 11 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved11 HDFS Architecture (Simplified) Client NameNode DataNodes Manages metadata and access control Has the info of where the data is (which DataNodes contains the blocks of each file) Keeps this info in memory. Store and retrieves data (blocks) by client request.Requests processes as read or write data
  • 12. 12 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved12 HDFS: Writing Data Client NameNode DataNodes 1 2 Divide the file into fixed size blocks (usually 64 or 128MB) For each block: Ask Namenode in which DataNodes can write, Specifying block size and replication factor For each block: Provide DataNodes addresses, sorted in increasing distance 3
  • 13. 13 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved13 HDFS: Writing Data Client NameNode DataNodes 1 2 Sends the data of the block and the list of nodes to the first DataNode 3 4 5 Sends the data to the following DataNode Replication Pipeline 6 Each DataNode sends Done to NameNode once the block data is written to hard disk
  • 14. 14 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved14 HDFS: Reading Data Client NameNode DataNode 1 Send list of blocks of the file. List of DataNodes for each block 2 4 Send data for required block Ask NameNode for a specific file 3 Download data from the nearest DataNode (send block number)
  • 15. 15 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved15 HDFS: Fault Tolerance ü Node Failure ü DataNodes send heartbeat every 3 seconds ü If NameNode doesn’t receive it from 10 min consider that node dead. ü Communication Failure ü If ACK is not received from DataNode to the sender after many tries ü Data Corruption ü DataNodes send block reports to NameNode not including the blocks that are corrupted (checksum validation)
  • 16. 16 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved16 HDFS: High Availabilty ü Secondary NameNode (active- standby configuration) ü Namenodes use shared storage ü Datanodes send block reports to both namenodes Shared Storage Passive NameNodeActive NameNode
  • 17. 17 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved17 HDFS: Command Examples ü hadoop fs –ls ü hadoop fs -put <local_path> <hdfs_path> ü hadoop fs -get <hdfs_path> <local_path> ü hadoop fs -cat <hdfs_path> ü hadoop fs -rmr <hdfs_path> ü hadoop fs –copyFromLocal <local_path> <hdfs_path>
  • 18. 18 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved18 MapReduce ü Process data from HDFS ü A MapReduce program is composed by ü Map() method: performs filtering and sorting of the <key, value> inputs ü Reduce() method: summarize the <key,value> pairs provided by the Mappers ü Code can be written in many languages (Perl, Python, Java. etc)
  • 19. 19 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved19 MapReduce Example
  • 20. 20 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved20 MapReduce Code Example
  • 21. 21 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved Hadoop Demo
  • 22. 22 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved22 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved But… Map Reduce has a high learning curve…. How to analyze Big Data with some familiar language?
  • 23. 23 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved23 23 Hive ü An open source data warehouse software on top of Apache Hadoop ü Analyze and query data stored in HDFS ü Structure the data into tables ü Tools for simple ETL ü SQL- like queries (HiveQL) ü Metadata is stored in an RDBMS ü Uses MapReduce as execution language ü Metadata is stored in a RDBMS (
  • 24. 24 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved24 24 HiveQL ü UPDATE,INSERT,DELETE ü Limited transaction support ü Indexes supported ü Multitable insert support ü SQL-92 Join support ü Read only views
  • 25. 25 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved25 25 Hive: Code Example SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING having_condition] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] [LIMIT number];
  • 26. 26 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved26 26 Hive: Pros & Cons ü Pros ü Familiarity with SQL ü Interactive ü Connection through JDBC/ODBC drivers ü Cons ü High latency ü Doesn’t have query cache ü Only support equal joins
  • 27. 27 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved Hive Demo
  • 28. 28 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved28 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved But… Hive has high latency… What if I want better performance and analyze real time data?
  • 29. 29 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved29 Spark ü Apache Spark is a fast, in-memory data processing engine ü Provides native bindings for Java, Scala, Python and R ü Supports SQL, streaming data, machine learning and graph processing. ü Can run standalone, on Hadoop, or on Apache Mesos
  • 30. 30 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved30 Spark vs MapReduce ü Spark main advantages vs MapReduce ü Speed ü Can perform tasks up to 100 times faster if all the data can be contained in memory ü Otherwise can be more than 10 times faster ü Spark API (developer friendly)
  • 31. 31 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved31 Spark: Code Example val textFile = sparkSession.sparkContext.textFile(“hdfs:///tmp/words”) val counts = textFile.flatMap(line => line.split(“ “)) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile(“hdfs:///tmp/words_agg”)
  • 32. 32 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved32 ü Spark Core ü Spark Streaming ü Spark SQL ü MLLib ü GraphX Spark: Components Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 33. 33 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved33 Spark: Resilient Distributed Dataset (RDD) ü A programming abstraction of objects collection ü Cannot be modified (immutable) ü Can be split across a computing cluster. ü Can be created from text files, SQL databases, NoSQL db (Cassandra, MongoDB,etc) ü Operations on RDDs ü Can be split across the cluster and executed in a parallel batch process ü Fast and scalable parallel processing.
  • 34. 34 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved34 Spark Streaming ü Takes the data as it comes in and process it in near real time ü Example: internet of things applications. ü Breaking the stream down into individuals parts called microbatches, ü Processed together as small RDDs ü Reliable: “checkpoints” stores data to disk periodically for fault tolerance. ü Windowing operations:compute results across a longer time period than your batch interval ü Example: Top sales from the past 2 hours.
  • 35. 35 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved Spark Demo
  • 36. 36 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved36 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved But… What if I want to integrate Big Data with my other systems?
  • 37. 37 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved37 Integration Challenge RDBMS Hadoop NOSQL Website
  • 38. 38 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved38 Kafka RDBMS Hadoop NOSQL Website ü Distributed Streaming Platform ü Decouple Data Streams ü Fault-tolerant ü High performance ü Horizontally scalable
  • 39. 39 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved39 How Kafka works?: Kafka Core Consumers RDBMS NoSQL Website Apps Source Systems Producers Hadoop RDBMS NoSQL Analytic Tools Target Systems
  • 40. 40 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved40 How Kafka works?: Extended API Kafka Connect Sink RDBMS NoSQL Website Apps Source Systems Kafka Connect Source Hadoop RDBMS NoSQL Analytic Tools Target Systems
  • 41. 41 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved41 How Kafka works?: Topics & Partitions ü Messages are stored into Topics ü Similar concept as a database table ü Topics ü Are identified by a unique name ü Are split into Partitions (for redundancy and performance) ü Partitions ü Each partition is ordered ü When a message arrives to a partition an id is assigned = Offset
  • 42. 42 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved42 How Kafka works?: Brokers ü Brokers = servers in a Kafka cluster ü Are identified by an ID number ü Contain topic partitions ü Recommended to have at least 3 Brokers in a cluster Topic 3 Partition 0 Topic 3 Partition 1 Topic 2 Partition 1 Topic 2 Partition 2 Broker 1 Broker 2 Broker 3
  • 43. 43 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved43 How Kafka works?: Replication Factor ü Topic should have a Replication Factor > 1 (usually 2 or 3) to avoid losing data Topic 3 Partition 0 Topic 3 Partition 0 Topic 2 Partition 1 Broker 1 Broker 2 Broker 3 Topic 2 Partition 1
  • 44. 44 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved44 How Kafka works?: Replication Factor ü Topic should have a Replication Factor > 1 (usually 2 or 3) to avoid losing data Topic 3 Partition 0 Topic 3 Partition 0 Topic 2 Partition 1 Broker 1 Broker 2 Broker 3 Topic 2 Partition 1
  • 45. 45 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved45 How Kafka works?: Replication Factor ü Topic should have a Replication Factor > 1 (usually 2 or 3) to avoid losing data Topic 3 Partition 0 Broker 2 Broker 3 Topic 2 Partition 1
  • 46. 46 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved46 How Kafka works?: Producers ü Producers write data into Topics ü Can choose type of ACK from partition ü ACK=0 (no ack) ü ACK=1 (only partition leader) ü ACK=All (all the replicas) Producer Topic 1 Partition 1 Topic 1 Partition 0 Broker 1 Broker 2 0 1 2 3 4 5 6 7 0 1 2 3 4 Offset Partition 0 Offset Partition 1
  • 47. 47 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved47 How Kafka works?: Consumers ü Consumers read data from a Topic ü Consumer reads ü In order from each partition ü In parallel between partitions Consumer Topic 1 Partition 1 Topic 1 Partition 0 Broker 1 Broker 2 0 1 2 3 4 5 6 7 0 1 2 3 4 Offset Partition 0 Offset Partition 1
  • 48. 48 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved Kafka Demo
  • 49. 49 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved49 Want to install those tools? ü Hadoop ü https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html ü https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm ü Hive ü https://www.tutorialspoint.com/hive/hive_installation.htm ü Spark ü https://www.tutorialspoint.com/apache_spark/apache_spark_installation.htm ü Kafka ü https://www.tutorialspoint.com/apache_kafka/apache_kafka_installation_steps.htm ü https://kafka.apache.org/quickstart
  • 50. 50 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved50 Want to play with those tools? ü Oracle Pre built VM Big Data Lite ü http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html ü Cloudera Quickstart VMs ü https://www.cloudera.com/downloads/quickstart_vms/5-12.html ü Apache Kafka Docker Container ü https://github.com/Landoop/fast-data-dev
  • 51. 51 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reservedITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved51 51 Questions?
  • 52. 52 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
  • 53. 53 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved