Big data - Online Training

The following topics will be covered in our
BIG DATA
Online Training:
Copyright @ 2015 Learntek. All Rights Reserved. 2

What is Hadoop?
Big Data Hadoop Training: Hadoop is a free, Java -based programming
framework that supports the processing of large data sets in a distributed
computing environment. It is part of the Apache project sponsored by the
Apache Software Foundation. Hadoop makes it possible to run applications on
systems with thousands of nodes involving thousands of terabytes of storage
capacity. Its distributed file system facilitates rapid data transfer rates among
nodes and allows the system to continue operating uninterrupted in case of a
node failure. This approach lowers the risk of catastrophic system failure, even
if a significant number of nodes become inoperative.
Copyright @ 2015 Learntek. All Rights Reserved.

Why Hadoop?
• Large Volumes of Data: Ability to store and process huge amounts of variety (structure,
unstructured and semi structured) of data, quickly. With data volumes and varieties
constantly increasing, especially from social media and the Internet of Things (IoT), that’s a
key consideration.
• Computing Power: Hadoop’s distributed computing model processes big data fast. The more
computing nodes you use, the more processing power you have.
• Fault Tolerance: Data and application processing are protected against hardware failure. If a
node goes down, jobs are automatically redirected to other nodes to make sure the
distributed computing does not fail. Multiple copies of all data are stored automatically.
• Flexibility: Unlike traditional relational database, you don’t have to process data before
storing it, You can store as much data as you want and decide how to use it later. That
includes unstructured data like text, images and videos etc.
• Low Cost: The open-source framework is free and used commodity hardware to store large
quantities of data.
• Scalability: You can easily grow your system to handle more data simply by adding nodes.
Little administration is required.

Big Data Hadoop Training: Hadoop Introduction
• Big Data Hadoop Training:
Introduction to Data and System
• Types of Data
• Traditional way of dealing large
data and its problems
• Types of Systems & Scaling
• What is Big Data
• Challenges in Big Data
• Challenges in Traditional
Application
• New Requirements
• What is Hadoop? Why Hadoop?
• Brief history of Hadoop
• Features of Hadoop
• Hadoop and RDBMS
• Hadoop Ecosystem’s overview

Hadoop Installation
• Installation in detail
• Creating Ubuntu image in
VMwareDownloading Hadoop
• Installing SSH
• Configuring Hadoop, HDFS &
MapReduce
• Download, Installation &
Configuration Hive
Configuration Pig
Configuration Sqoop
Configuration Hive
• Configuring Hadoop in Different
Modes

Hadoop Distribute File System (HDFS)
• File System – Concepts
• Blocks
• Replication Factor
• Version File
• Safe mode
• Namespace IDs
• Purpose of Name Node
• Purpose of Data Node
• Purpose of Secondary Name
Node
• Purpose of Job Tracker
• Purpose of Task Tracker
• HDFS Shell Commands –
copy, delete, create
directories etc.
• Reading and Writing in HDFS
• Difference of Unix
Commands and HDFS
commands
• Hadoop Admin Commands
• Hands on exercise with Unix
and HDFS commands
• Read / Write in HDFS –
Internal Process between
Client, NameNode &
DataNodes.
• Accessing HDFS using Java
API
• Various Ways of Accessing
HDFS
• Understanding HDFS Java
classes and methods
• Admin: 1. Commissioning /
DeCommissioning DataNode
• Balancer
• Replication Policy
• Network Distance / Topology
Script

Map Reduce Programming
• About MapReduce
• Understanding block and
input splits
• MapReduce Data types
• Understanding Writable
• Data Flow in MapReduce
Application
• Understanding MapReduce
problem on datasets
• MapReduce and Functional
Programming
• Writing MapReduce
Application
• Understanding Mapper
function
• Understanding Reducer
Function
• Understanding Driver
• Usage of Combiner
• Understanding Partitioner
• Usage of Distributed Cache
• Passing the parameters to
mapper and reducer
• Analysing the Results
• Log files
• Input Formats and Output
Formats
• Counters, Skipping Bad and
unwanted Records
• Writing Join’s in MapReduce
with 2 Input files. Join Types.
• Execute MapReduce Job –
Insights.
• Exercise’s on MapReduce.
• Job Scheduling: Type of
Schedulers.

Hive
• Hive concepts
• Schema on Read VS Schema on
Write
• Hive architecture
• Install and configure hive on
cluster
• Meta Store – Purpose & Type of
Configurations
• Different type of tables in Hive
• Buckets
• Partitions
• Joins in hive
• Hive Query Language
• Hive Data Types
• Data Loading into Hive Tables
• Hive Query Execution
• Hive library functions
• Hive UDF
• Hive Limitations

Pig
• Pig basics
• Install and configure PIG on a cluster
• PIG Library functions
• Pig Vs Hive
• Write sample Pig Latin scripts
• Modes of running PIG
• Running in Grunt shell
• Running as Java program
• PIG UDFs

HBase
• HBase concepts
• HBase architecture
• Region server architecture
• File storage architecture
• HBase basics
• Column access
• Scans
• HBase use cases
• Install and configure HBase on a
multi node cluster
• Create database, Develop and
run sample applications
• Access data stored in HBase
using Java API

Sqoop
• Install and configure Sqoop on cluster
• Connecting to RDBMS
• Installing Mysql
• Import data from Mysql to hive
• Export data to Mysql
• Internal mechanism of import/export

Oozie
• Introduction to OOZIE
• Oozie architecture
• XML file specifications
• Specifying Work flow
• Control nodes
• Oozie job coordinator

Flume
• Introduction to Flume
• Configuration and Setup
• Flume Sink with example
• Channel
• Flume Source with example
• Complex flume architecture

ZooKeeper
• Introduction to ZooKeeper
• Challenges in distributed Applications
• Coordination
• ZooKeeper : Design Goals
• Data Model and Hierarchical namespace
• Cilent APIs

YARN
• Hadoop 1.0 Limitations
• MapReduce Limitations
• History of Hadoop 2.0
• HDFS 2: Architecture
• HDFS 2: Quorum based storage
• HDFS 2: High availability
• HDFS 2: Federation
• YARN Architecture
• Classic vs YARN
• YARN Apps
• YARN multitenancy
• YARN Capacity Scheduler

Prerequisites :
• Knowledge in any programming language, Database knowledge and
Linux Operating system. Core Java or Python knowledge helpful.

Big data - Online Training

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (18)

Semelhante a Big data - Online Training

Semelhante a Big data - Online Training (20)

Mais de Learntek1

Mais de Learntek1 (7)

Último

Último (20)

Big data - Online Training