https://www.learntek.org/big-data-and-hadoop-training/
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses.
2. Big Data Hadoop Training
What is Hadoop?
Hadoop is a free, Java -based programming framework that supports the processing of large data sets in a
distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.
Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes
of storage capacity. Its distributed file system facilitates rapid data transfer rates among nodes and allows the
system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic
system failure, even if a significant number of nodes become inoperative.
Why Hadoop?
Large Volumes of Data:
Ability to store and process huge amounts of variety (structure, unstructured and semi structured) of data, quickly.
With data volumes and varieties constantly increasing, especially from social media and the Internet of Things
(IoT), that’s a key consideration.
3. Fault Tolerance:
Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically
redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored
automatically.
Flexibility:
Unlike traditional rdecideelational database, you don’t have to process data before storing it, You can store as much
data as you want and how to use it later. That includes unstructured data like text, images and videos etc.
Low Cost:
The open-source framework is free and used commodity hardware to store large quantities of data.
Scalability:
You can easily grow your system to handle more data simply by adding nodes. Little administration is required.
Copyright @ 2018 Learntek. All Rights Reserved. 3
The following topics will be covered in our Big Data and Hadoop Online Training
4. Copyright @ 2018 Learntek. All Rights Reserved.
4
Big Data Hadoop Training Topics :
Hadoop Introduction:
Big Data Hadoop Training : Introduction to Data and System
Types of Data
Traditional way of dealing large data and its problems
Types of Systems & Scaling
What is Big Data
Challenges in Big Data
Challenges in Traditional Application
New Requirements
What is Hadoop? Why Hadoop?
Brief history of Hadoop
Features of Hadoop
Hadoop and RDBMS
Hadoop Ecosystem’s overview
5. Copyright @ 2018 Learntek. All Rights Reserved. 5
Hadoop Installation :
Installation in detail
Creating Ubuntu image in VMware Downloading Hadoop
Installing SSH
Configuring Hadoop, HDFS & MapReduce
Download, Installation & Configuration Hive
Download, Installation & Configuration Pig
Download, Installation & Configuration Sqoop
Download, Installation & Configuration Hive
Configuring Hadoop in Different Modes
6. Copyright @ 2018 Learntek. All Rights Reserved. 6
Hadoop Distribute File System (HDFS) :
File System – Concepts
Blocks
Replication Factor
Version File
Safe mode
Namespace IDs
Purpose of Name Node
Purpose of Data Node
Purpose of Secondary Name Node
Purpose of Job Tracker
Purpose of Task Tracker
HDFS Shell Commands – copy, delete, create directories etc.
Reading and Writing in HDFS
Difference of Unix Commands and HDFS commands
Read / Write in HDFS – Internal Process between
Client, Name Node & Data Nodes.
Accessing HDFS using Java API
Various Ways of Accessing HDFS
Understanding HDFS Java classes and methods
Admin: 1. Commissioning / Decommissioning Data
Node
Balancer
Replication Policy
Network Distance / Topology Script
7. Copyright @ 2018 Learntek. All Rights Reserved. 7
Map Reduce Programming :
About MapReduce
Understanding block and input splits
MapReduce Data types
Understanding Writable
Data Flow in MapReduce Application
Understanding MapReduce problem on datasets
MapReduce and Functional Programming
Writing MapReduce Application
Understanding Mapper function
Understanding Reducer Function
Understanding Driver
Usage of Combiner
Understanding Partitioned
Usage of Distributed Cache
Passing the parameters to mapper and reducer
Analyzing the Results
Log files
Input Formats and Output Formats
Counters, Skipping Bad and unwanted Records
Writing Join’s in MapReduce with 2 Input files. Join Types.
Execute MapReduce Job – Insights.
Exercise’s on MapReduce.
Job Scheduling: Type of Schedulers.
8. Copyright @ 2018 Learntek. All Rights Reserved.
8
Hive
Hive concepts
Schema on Read VS Schema on Write
Hive architecture
Install and configure hive on cluster
Meta Store – Purpose & Type of Configurations
Different type of tables in Hive
Buckets
Partitions
Joins in hive
Hive Query Language
Hive Data Types
Data Loading into Hive Tables
Hive Query Execution
Hive library functions
Hive UDF
Hive Limitations
Pig
Pig basics
Install and configure PIG on a cluster
PIG Library functions
Pig Vs Hive
Write sample Pig Latin scripts
Modes of running PIG
Running in Grunt shell
Running as Java program
PIG UDFs
9. Copyright @ 2018 Learntek. All Rights Reserved. 9
HBase :
HBase concepts
HBase architecture
Region server architecture
File storage architecture
HBase basics
Column access
Scans
HBase use cases
Install and configure HBase on a multi node cluster
Create database, Develop and run sample applications
Access data stored in HBase using Java API
Sqoop :
Install and configure Sqoop on cluster
Connecting to RDBMS
Installing MySQL
Import data from MySQL to hive
Export data to MySQL
Internal mechanism of import/export
10. Copyright @ 2018 Learntek. All Rights Reserved. 10
Oozie :
Introduction to OOZIE
Oozie architecture
XML file specifications
Specifying Work flow
Control nodes
Oozie job coordinator
Flume
Introduction to Flume
Configuration and Setup
Flume Sink with example
Channel
Flume Source with example
Complex flume architecture
11. Copyright @ 2018 Learntek. All Rights Reserved. 11
Zookeeper :
Introduction to Zookeeper
Challenges in distributed Applications
Coordination
ZooKeeper : Design Goals
Data Model and Hierarchical namespace
Client APIs
YARN
Hadoop 1.0 Limitations
MapReduce Limitations
History of Hadoop 2.0
HDFS 2: Architecture
HDFS 2: Quorum based storage
HDFS 2: High availability
HDFS 2: Federation
YARN Architecture
Classic vs YARN
YARN Apps
YARN multitenancy
YARN Capacity Scheduler
Prerequisites :
Knowledge in any programming language, Database knowledge and Linux Operating system. Core Java or Python
knowledge helpful.