4. Hadoop Distributed File System (HDFS)
Runs on clusters of inexpensive disks
Write-once data
Stores data in blocks across multiple disks
NameNode responsible for managing
metadata about the actual data
Linux-likeCLI for management of files
Since it’s Open Source, customization is
possible
5. Solving computations by breaking everything into Map or Reduce
jobs
Input and output of jobs is always in Key/Value pairs
Map Input might be a line from a file <LineNumber, LineText>:
<224, “HelloWorld. HelloWorld”>
Map Output might be instance of each word:
<“Hello”, 1>, <“World”, 1>, <“Hello”, 1>, <“World”, 1>
Reduce input would be the output from the Mapper
Reduce output might be the count of occurrence of each word:
<“Hello”, 2>, <“World”, 2>
Generally MapReduce jobs are written in Java
Internally Hadoop does a lot of processing to make this seemless
All data stored in HDFS (except log files)
6. Yet Another Resource Negotiator
By itself not much
Allows a variety of tools to conveniently run
within the Hadoop cluster (MapReduce,
Hbase, Spark, Storm, Solr, etc.)
Think ofYARN as the operating system for
Hadoop
Users generally interact with individual tools
withinYARN rather than directly withYARN
7. MapReduce doesn’t perform well with iterative
algorithms (e.g., graph analysis)
Spark overcomes that flaw …
Supports multipass/iterative algorithms by
reducing/eliminating reads/writes to disk
A replacement for MapReduce
Three principles of Spark operations:
Resilient Distributed Dataset (RDD):The Data
Transformation: Modifies RDD or creates a new RDD
Action: analyzes an RDD and returns a single result
Scala is the preferred language for Spark
8. Part of Apache HadoopYARN
Performance gains
Optimal resource management
Plan reconfiguration at runtime
Dynamic physical data flow decisions
9. An abstraction build on top of Hadoop
Essentially an ETL tool
Use “simple” PigLatin script to create ETL jobs
Pig will convert jobs to Hadoop M/R jobs
Takes away the “pain” of writing Java M/R jobs
Can perform joins, summaries, etc.
Input/Output all within HDFS
Can also write external functions (UDF) and call
them from PigLatin
10. Allows the use of stdin and stdout (linux) as
input and outputs for your M/R jobs
What this means is that you can use C,
Python, and other languages
All the internal work (e.g., shuffling) still
happens within the Hadoop cluster
Only useful if Java skills are weak
11. Collection of machine-learning algorithms
that run on Hadoop
Possible to write your own algorithms in
traditional Java M/R jobs …
… why bother when they exist in Mahout?
Algorithms include: k-means clustering,
latent dirichlet allocation, logistic-regression-
based classifier, random forest decision tree
classifer, etc.
12. Machine Learning Library (MLLib) for Spark
Similar to Mahout, but specifically for Spark
(Remember Spark is not MapReduce)
Algorithms include: Linear SVM and logistic
regression, k-means clustering, multinomial
naïve Bayes, Dimensionality reduction, etc.
13. Still not fully developed
Provides basic authorization in Hadoop
Provides role-based authorization
Works at the application level (the application
needs to call theAPIs)
Works with Hive, Solr and Impala
Drawback: possible to write M/R job to access
non-authorized data)
15. Security Gateway to manage access
History of Hadoop suggests that security was
an afterthought
Each tool had own security implementation
Knox overcomes that complexity
Provides gateway between external (to Hadoop)
apps and internal apps
Authorization, authentication, and auditing
Works with AD and LDAP
16. Transfers data between HDFS and relation
DBs
A very simple command line tool
export data from HDFS to RDBMS
Import data from RDBMS to HDFS
transfers executed as M/R jobs in Hadoop
Filtering possible
Additional options for file formats, delimiters, etc.
17. Data collection and aggregation
Works well with log data
Moves large data files from various servers
into Hadoop cluster
Supports “complex” multihop flows
Key implementation features: source,
channel, sink
Job configuration done via a .config file
18. Data movement between Hadoop clusters
Basically it can copy entire cluster
Primary Usage:
Moving data from test to dev environments
“Dual Ingestion” using two clusters in case one
fails
19. Stream Ingestion (instead of
batch processing)
Quickly perform
transformations of very large
number of small records
Workflow, called topology,
includes spouts as inputs and
bolts as transformations.
Usage:
transform a stream of tweets
into a stream of trending
topics
Bolts can do a lot of work:
aggregate, communicate with
Databases, joins, etc.
20. A Distributed Messaging framework
Fast, scalable, and durable
Single cluster can serve as central data
backbone
Messages are persisted on disk and replicated
across clusters
Uses include: traditional messaging, website
activity tracking, centralized feeds of
operational data
21. Provision, monitoring, and management of a
Hadoop cluster
GUI based tool
Features
Step by step wizard for installing services
Start, stop, configure services
Dashboard for monitoring health and status
Ganglia for metrics collection
Nagios for system alerts
22. Another data abstraction layer
Use HDFS files as tables
Almost SQL-like, but more Hive-like
Add partitions
Users don’t have to worry about location or
format of data
23. IT Infrastructure monitoring
Web based interface
Detection of outages and problems
Send alerts via email or SMS
Automatic restart provisioning
24. PUPPET
Node management tool
Puppet uses declarative
syntax
Configuration file identifies
programs; Puppet
determines their
availability
Broken down as:
Resources, manifests, and
modules
CHEF
Node management tool
Chef uses imperative
syntax
Resource might specify a
certain requirement (a
specific directory is
needed)
Broken down as:
Resources, recipes and
cookbooks
25. Allows coordination between nodes
Sharing “small” amounts of state and config
data
For example, share connection string
Highly scalable and reliable
Some built-in protection from using it as a
datastore
Use API to extend use to other areas like
implementing security
26. A workflow scheduler
Like typical schedulers, you can create
relatively complex rules around jobs
Start, stop, suspend, restart jobs
Control both jobs and tasks
27. Another monitoring tool
Provides a high-level overview of cluster
Computing capability, data transfers, storage
usage
Has support for add-ins for additional
features
Used withinAmbari
28. Feed management and data processing
platform
Feed retention, replications, archival
Supports workflows
Integration with Hive/Hcatalog
Feeds can be any type of data (e.g., Emails)
29. Key-value store
Scales well and efficient storage
Distributed database
Peer-to-peer system
30. NoSQL database with random access
Excellent for sparse data
Behaves like a key-value store
Key + number of bins/columns
Only one datatype: byte string
Concept of column families for similar data
Has CLI, but can be access from Java and Pig
Not meant for transactional system
Limited built-in functionality
Key functions must be added at application level
31. Name-value db with cell-level security
Developed by NSA, but now withApache
Excellent for multitenant storage
Set column visibility rules for user “labels”
Scales well, at petabytes of data
Retrieval operations in seconds
32. In-memory cache
Fast access of large data for short time
Traditional approach to sharing data in HDFS
is to use replicated join (send data to each
node)
Memcached provides a “pool” of memory
across the nodes and stores data in that pool
Effectively a distributed memory pool
Much more efficient than replicating data
33. DocumentWarehouse
Allows searching of text documents
Blur uses HDFS stack; Solr doesn’t
Uses can query data based on indexing
34. JSON document-oriented database
Most popular NoSQL db
Supports secondary indexes
Does not run on Hadoop Stack
Concept of documents (rows) and collections
(tables)
Very scalable … extends simple key-value
storage
35. Interact directly with HDFS data using HQL
HQL similar to SQL (syntax and commands)
HQL queries converted to M/R jobs
HQL does not support:
Updates/Deletes
Transactions
Non-equality joins
36. SQL Access to Hadoop Data
In-memory model for execution (like Spark)
No MapReduce functionality
Much faster than traditional HDFS access
Supports HQL; also support for Java, Scala
APIs
Can also run MLLib algorithms
37. A Graph database (think extended relationships)
Facebook, LinkedIn,Twitter, etc. use graphs to
determine your friends and likely friends
The science of graph theory is a bit complicated
If John is a friend of Mary; Mary is a friend of
Tom;Tom is a friend of Alice …
Find friends who are two paths (degrees) from
John; nightmare to do with SQL
Finding relationships from email exchanges
38. Relational database layer over HBASE
Provides JDBC driver to access data
SQL query converted into HBase scans
Produces regular JDBC resultsets
Versioning support to ensure correct schema
is used
Good performance