Mais conteúdo relacionado Semelhante a Getting started with Hadoop, Hive, Spark and Kafka (20) Mais de Edelweiss Kammermann (15) Getting started with Hadoop, Hive, Spark and Kafka1. 1 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
A Presentation for:
Getting Started with Hadoop, Spark,
Hive and Kafka
Edelweiss Kammermann
New York
March 8th 2018
2. 2 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
IT CONVERGENCE SNAPSHOT
3. 3 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
Over 600 Customers Engagements In More Than 50 Countries
3
EXTENSIVE EXPERTISE ACROSS THE GLOBE
4. 4 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved4 4
About me
ü Computer Engineer, BI and Data Integration Specialist
ü Over 20 years of Consulting and Project Management experience in Oracle technology.
ü Co-founder and Vice President of Uruguayan Oracle User Group (UYOUG)
ü Vice President of LAOUC (Latin America Oracle User Community)
ü BI Manager at ITConvergence
ü Writer and frequent speaker at international conferences: Collaborate, OTN Tour LA,
UKOUG Tech & Apps, OOW, etc
ü Oracle ACE Director
ü Oracle Big Data Implementation Specialist
5. 5 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
Uruguay
6. 6 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
3 Membership Tiers
• Oracle ACE Director
• Oracle ACE
• Oracle ACE Associate
bit.ly/OracleACEProgram
500+ Technical Experts
Helping Peers Globally
Connect:
Nominate yourself or someone you know: acenomination.oracle.com
@oracleace
Facebook.com/oracleaces
oracle-ace_ww@oracle.com
7. 7 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
Index
What is Big Data?
Hadoop
Hive
Spark
8. 8 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved8 8
What is Big Data?
ü Volume: High amount of data
ü Variety: Different data types formats.
Unstructured/semi-structured data
ü Velocity: Speed which data is created and/or consumed
ü Veracity: Quality of data. Accuracy
ü Value: Data has intrinsic value—but it must be discovered.
9. 9 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved9 9
10. 10 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved10
Hadoop
ü An open source software platform for distributed storage and processing
ü Manage huge volumes of unstructured data
ü Parallel processing of large data set
ü Highly scalable
ü Fault-tolerant
ü Two main components:
ü HDFS: Hadoop Distributed File System for storing information
ü MapReduce: programming framework that process information
11. 11 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved11
HDFS Architecture (Simplified)
Client
NameNode
DataNodes
Manages metadata and access control
Has the info of where the data is (which
DataNodes contains the blocks of each file)
Keeps this info in memory.
Store and retrieves data
(blocks) by client request.Requests processes as read
or write data
12. 12 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved12
HDFS: Writing Data
Client
NameNode DataNodes
1
2
Divide the file into fixed size blocks
(usually 64 or 128MB)
For each block: Ask Namenode
in which DataNodes can write,
Specifying block size and
replication factor
For each block: Provide DataNodes
addresses, sorted in increasing
distance
3
13. 13 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved13
HDFS: Writing Data
Client
NameNode DataNodes
1
2
Sends the data of the block and
the list of nodes to the first
DataNode
3
4
5
Sends the data to the following
DataNode
Replication Pipeline
6
Each DataNode sends Done to
NameNode once the block data is
written to hard disk
14. 14 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved14
HDFS: Reading Data
Client
NameNode
DataNode
1
Send list of blocks of the file.
List of DataNodes for each block
2
4
Send data for required block
Ask NameNode for a specific file
3
Download data from the nearest
DataNode (send block number)
15. 15 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved15
HDFS: Fault Tolerance
ü Node Failure
ü DataNodes send heartbeat every 3 seconds
ü If NameNode doesn’t receive it from 10 min consider that node dead.
ü Communication Failure
ü If ACK is not received from DataNode to the sender after many tries
ü Data Corruption
ü DataNodes send block reports to NameNode not including the blocks that are
corrupted (checksum validation)
16. 16 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved16
HDFS: High Availabilty
ü Secondary NameNode (active-
standby configuration)
ü Namenodes use shared storage
ü Datanodes send block reports to both
namenodes
Shared Storage
Passive NameNodeActive NameNode
17. 17 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved17
HDFS: Command Examples
ü hadoop fs –ls
ü hadoop fs -put <local_path> <hdfs_path>
ü hadoop fs -get <hdfs_path> <local_path>
ü hadoop fs -cat <hdfs_path>
ü hadoop fs -rmr <hdfs_path>
ü hadoop fs –copyFromLocal <local_path> <hdfs_path>
18. 18 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved18
MapReduce
ü Process data from HDFS
ü A MapReduce program is composed by
ü Map() method: performs filtering and sorting of the
<key, value> inputs
ü Reduce() method: summarize the <key,value> pairs
provided by the Mappers
ü Code can be written in many languages (Perl, Python,
Java. etc)
19. 19 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved19
MapReduce Example
20. 20 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved20
MapReduce Code Example
21. 21 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
Hadoop Demo
22. 22 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved22 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
But…
Map Reduce has a high learning curve….
How to analyze Big Data with some familiar language?
23. 23 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved23 23
Hive
ü An open source data warehouse software on top of Apache Hadoop
ü Analyze and query data stored in HDFS
ü Structure the data into tables
ü Tools for simple ETL
ü SQL- like queries (HiveQL)
ü Metadata is stored in an RDBMS
ü Uses MapReduce as execution language
ü Metadata is stored in a RDBMS (
24. 24 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved24 24
HiveQL
ü UPDATE,INSERT,DELETE
ü Limited transaction support
ü Indexes supported
ü Multitable insert support
ü SQL-92 Join support
ü Read only views
25. 25 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved25 25
Hive: Code Example
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
26. 26 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved26 26
Hive: Pros & Cons
ü Pros
ü Familiarity with SQL
ü Interactive
ü Connection through JDBC/ODBC drivers
ü Cons
ü High latency
ü Doesn’t have query cache
ü Only support equal joins
27. 27 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
Hive Demo
28. 28 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved28 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
But…
Hive has high latency…
What if I want better performance and analyze real time data?
29. 29 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved29
Spark
ü Apache Spark is a fast, in-memory data processing engine
ü Provides native bindings for Java, Scala, Python and R
ü Supports SQL, streaming data, machine learning and
graph processing.
ü Can run standalone, on Hadoop, or on Apache Mesos
30. 30 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved30
Spark vs MapReduce
ü Spark main advantages vs MapReduce
ü Speed
ü Can perform tasks up to 100 times faster if
all the data can be contained in memory
ü Otherwise can be more than 10 times faster
ü Spark API (developer friendly)
31. 31 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved31
Spark: Code Example
val textFile = sparkSession.sparkContext.textFile(“hdfs:///tmp/words”)
val counts = textFile.flatMap(line => line.split(“ “))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile(“hdfs:///tmp/words_agg”)
32. 32 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved32
ü Spark Core
ü Spark Streaming
ü Spark SQL
ü MLLib
ü GraphX
Spark: Components
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
33. 33 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved33
Spark: Resilient Distributed Dataset (RDD)
ü A programming abstraction of objects collection
ü Cannot be modified (immutable)
ü Can be split across a computing cluster.
ü Can be created from text files, SQL databases, NoSQL db (Cassandra, MongoDB,etc)
ü Operations on RDDs
ü Can be split across the cluster and executed in a parallel batch process
ü Fast and scalable parallel processing.
34. 34 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved34
Spark Streaming
ü Takes the data as it comes in and process it in near real time
ü Example: internet of things applications.
ü Breaking the stream down into individuals parts called microbatches,
ü Processed together as small RDDs
ü Reliable: “checkpoints” stores data to disk periodically for fault tolerance.
ü Windowing operations:compute results across a longer time period than your batch
interval
ü Example: Top sales from the past 2 hours.
35. 35 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
Spark Demo
36. 36 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved36 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
But…
What if I want to integrate Big Data with my other systems?
37. 37 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved37
Integration Challenge
RDBMS
Hadoop
NOSQL
Website
38. 38 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved38
Kafka
RDBMS
Hadoop
NOSQL
Website
ü Distributed Streaming
Platform
ü Decouple Data
Streams
ü Fault-tolerant
ü High performance
ü Horizontally scalable
39. 39 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved39
How Kafka works?: Kafka Core
Consumers
RDBMS
NoSQL
Website
Apps
Source Systems
Producers Hadoop
RDBMS
NoSQL
Analytic
Tools
Target Systems
40. 40 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved40
How Kafka works?: Extended API
Kafka Connect Sink
RDBMS
NoSQL
Website
Apps
Source Systems
Kafka Connect Source
Hadoop
RDBMS
NoSQL
Analytic
Tools
Target Systems
41. 41 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved41
How Kafka works?: Topics & Partitions
ü Messages are stored into Topics
ü Similar concept as a database table
ü Topics
ü Are identified by a unique name
ü Are split into Partitions (for redundancy and performance)
ü Partitions
ü Each partition is ordered
ü When a message arrives to a partition an id is assigned = Offset
42. 42 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved42
How Kafka works?: Brokers
ü Brokers = servers in a Kafka cluster
ü Are identified by an ID number
ü Contain topic partitions
ü Recommended to have at least
3 Brokers in a cluster
Topic 3
Partition 0
Topic 3
Partition 1
Topic 2
Partition 1
Topic 2
Partition 2
Broker 1 Broker 2 Broker 3
43. 43 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved43
How Kafka works?: Replication Factor
ü Topic should have a Replication Factor > 1 (usually 2 or 3) to avoid losing data
Topic 3
Partition 0
Topic 3
Partition 0
Topic 2
Partition 1
Broker 1 Broker 2 Broker 3
Topic 2
Partition 1
44. 44 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved44
How Kafka works?: Replication Factor
ü Topic should have a Replication Factor > 1 (usually 2 or 3) to avoid losing data
Topic 3
Partition 0
Topic 3
Partition 0
Topic 2
Partition 1
Broker 1 Broker 2 Broker 3
Topic 2
Partition 1
45. 45 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved45
How Kafka works?: Replication Factor
ü Topic should have a Replication Factor > 1 (usually 2 or 3) to avoid losing data
Topic 3
Partition 0
Broker 2 Broker 3
Topic 2
Partition 1
46. 46 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved46
How Kafka works?: Producers
ü Producers write data into Topics
ü Can choose type of ACK from
partition
ü ACK=0 (no ack)
ü ACK=1 (only partition leader)
ü ACK=All (all the replicas)
Producer
Topic 1
Partition 1
Topic 1
Partition 0
Broker 1
Broker 2
0 1 2 3 4 5 6 7
0 1 2 3 4
Offset Partition 0
Offset Partition 1
47. 47 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved47
How Kafka works?: Consumers
ü Consumers read data from a Topic
ü Consumer reads
ü In order from each partition
ü In parallel between partitions
Consumer
Topic 1
Partition 1
Topic 1
Partition 0
Broker 1
Broker 2
0 1 2 3 4 5 6 7
0 1 2 3 4
Offset Partition 0
Offset Partition 1
48. 48 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
Kafka Demo
49. 49 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved49
Want to install those tools?
ü Hadoop
ü https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
ü https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm
ü Hive
ü https://www.tutorialspoint.com/hive/hive_installation.htm
ü Spark
ü https://www.tutorialspoint.com/apache_spark/apache_spark_installation.htm
ü Kafka
ü https://www.tutorialspoint.com/apache_kafka/apache_kafka_installation_steps.htm
ü https://kafka.apache.org/quickstart
50. 50 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved50
Want to play with those tools?
ü Oracle Pre built VM Big Data Lite
ü http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html
ü Cloudera Quickstart VMs
ü https://www.cloudera.com/downloads/quickstart_vms/5-12.html
ü Apache Kafka Docker Container
ü https://github.com/Landoop/fast-data-dev
51. 51 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reservedITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved51 51
Questions?