SlideShare uma empresa Scribd logo
1 de 39
Baixar para ler offline
Zahid Mian
Part of the Brown-bag Series
 CoreTechnologies
 HDFS
 MapReduce
 YARN
 Spark
 Data Processing
 Pig
 Mahout
 Hadoop Streaming
 MLLib
 Security
 Sentry
 Kerberos
 Knox
 ETL
 Sqoop
 Flume
 DistCp
 Storm
 Monitoring
 Ambari
 HCatalog
 Nagios
 Puppet
 Chef
 ZooKeeper
 Oozie
 Ganglia
 Databases
 Cassandra
 HBase
 Accumulo
 Memcached
 Blur
 Solr
 MongoDB
 Hive
 SparkSQL
 Giraph
 Hadoop Distributed File System (HDFS)
 Runs on clusters of inexpensive disks
 Write-once data
 Stores data in blocks across multiple disks
 NameNode responsible for managing
metadata about the actual data
 Linux-likeCLI for management of files
 Since it’s Open Source, customization is
possible
 Solving computations by breaking everything into Map or Reduce
jobs
 Input and output of jobs is always in Key/Value pairs
 Map Input might be a line from a file <LineNumber, LineText>:
 <224, “HelloWorld. HelloWorld”>
 Map Output might be instance of each word:
 <“Hello”, 1>, <“World”, 1>, <“Hello”, 1>, <“World”, 1>
 Reduce input would be the output from the Mapper
 Reduce output might be the count of occurrence of each word:
 <“Hello”, 2>, <“World”, 2>
 Generally MapReduce jobs are written in Java
 Internally Hadoop does a lot of processing to make this seemless
 All data stored in HDFS (except log files)
 Yet Another Resource Negotiator
 By itself not much
 Allows a variety of tools to conveniently run
within the Hadoop cluster (MapReduce,
Hbase, Spark, Storm, Solr, etc.)
 Think ofYARN as the operating system for
Hadoop
 Users generally interact with individual tools
withinYARN rather than directly withYARN
 MapReduce doesn’t perform well with iterative
algorithms (e.g., graph analysis)
 Spark overcomes that flaw …
 Supports multipass/iterative algorithms by
reducing/eliminating reads/writes to disk
 A replacement for MapReduce
 Three principles of Spark operations:
 Resilient Distributed Dataset (RDD):The Data
 Transformation: Modifies RDD or creates a new RDD
 Action: analyzes an RDD and returns a single result
 Scala is the preferred language for Spark
 Part of Apache HadoopYARN
 Performance gains
 Optimal resource management
 Plan reconfiguration at runtime
 Dynamic physical data flow decisions
 An abstraction build on top of Hadoop
 Essentially an ETL tool
 Use “simple” PigLatin script to create ETL jobs
 Pig will convert jobs to Hadoop M/R jobs
 Takes away the “pain” of writing Java M/R jobs
 Can perform joins, summaries, etc.
 Input/Output all within HDFS
 Can also write external functions (UDF) and call
them from PigLatin
 Allows the use of stdin and stdout (linux) as
input and outputs for your M/R jobs
 What this means is that you can use C,
Python, and other languages
 All the internal work (e.g., shuffling) still
happens within the Hadoop cluster
 Only useful if Java skills are weak
 Collection of machine-learning algorithms
that run on Hadoop
 Possible to write your own algorithms in
traditional Java M/R jobs …
 … why bother when they exist in Mahout?
 Algorithms include: k-means clustering,
latent dirichlet allocation, logistic-regression-
based classifier, random forest decision tree
classifer, etc.
 Machine Learning Library (MLLib) for Spark
 Similar to Mahout, but specifically for Spark
 (Remember Spark is not MapReduce)
 Algorithms include: Linear SVM and logistic
regression, k-means clustering, multinomial
naïve Bayes, Dimensionality reduction, etc.
 Still not fully developed
 Provides basic authorization in Hadoop
 Provides role-based authorization
 Works at the application level (the application
needs to call theAPIs)
 Works with Hive, Solr and Impala
 Drawback: possible to write M/R job to access
non-authorized data)
 Provides Secure Authentication
 Tedious to setup and maintain
 Security Gateway to manage access
 History of Hadoop suggests that security was
an afterthought
 Each tool had own security implementation
 Knox overcomes that complexity
 Provides gateway between external (to Hadoop)
apps and internal apps
 Authorization, authentication, and auditing
 Works with AD and LDAP
 Transfers data between HDFS and relation
DBs
 A very simple command line tool
 export data from HDFS to RDBMS
 Import data from RDBMS to HDFS
 transfers executed as M/R jobs in Hadoop
 Filtering possible
 Additional options for file formats, delimiters, etc.
 Data collection and aggregation
 Works well with log data
 Moves large data files from various servers
into Hadoop cluster
 Supports “complex” multihop flows
 Key implementation features: source,
channel, sink
 Job configuration done via a .config file
 Data movement between Hadoop clusters
 Basically it can copy entire cluster
 Primary Usage:
 Moving data from test to dev environments
 “Dual Ingestion” using two clusters in case one
fails
 Stream Ingestion (instead of
batch processing)
 Quickly perform
transformations of very large
number of small records
 Workflow, called topology,
includes spouts as inputs and
bolts as transformations.
 Usage:
 transform a stream of tweets
into a stream of trending
topics
 Bolts can do a lot of work:
aggregate, communicate with
Databases, joins, etc.
 A Distributed Messaging framework
 Fast, scalable, and durable
 Single cluster can serve as central data
backbone
 Messages are persisted on disk and replicated
across clusters
 Uses include: traditional messaging, website
activity tracking, centralized feeds of
operational data
 Provision, monitoring, and management of a
Hadoop cluster
 GUI based tool
 Features
 Step by step wizard for installing services
 Start, stop, configure services
 Dashboard for monitoring health and status
 Ganglia for metrics collection
 Nagios for system alerts
 Another data abstraction layer
 Use HDFS files as tables
 Almost SQL-like, but more Hive-like
 Add partitions
 Users don’t have to worry about location or
format of data
 IT Infrastructure monitoring
 Web based interface
 Detection of outages and problems
 Send alerts via email or SMS
 Automatic restart provisioning
PUPPET
 Node management tool
 Puppet uses declarative
syntax
 Configuration file identifies
programs; Puppet
determines their
availability
 Broken down as:
Resources, manifests, and
modules
CHEF
 Node management tool
 Chef uses imperative
syntax
 Resource might specify a
certain requirement (a
specific directory is
needed)
 Broken down as:
Resources, recipes and
cookbooks
 Allows coordination between nodes
 Sharing “small” amounts of state and config
data
 For example, share connection string
 Highly scalable and reliable
 Some built-in protection from using it as a
datastore
 Use API to extend use to other areas like
implementing security
 A workflow scheduler
 Like typical schedulers, you can create
relatively complex rules around jobs
 Start, stop, suspend, restart jobs
 Control both jobs and tasks
 Another monitoring tool
 Provides a high-level overview of cluster
 Computing capability, data transfers, storage
usage
 Has support for add-ins for additional
features
 Used withinAmbari
 Feed management and data processing
platform
 Feed retention, replications, archival
 Supports workflows
 Integration with Hive/Hcatalog
 Feeds can be any type of data (e.g., Emails)
 Key-value store
 Scales well and efficient storage
 Distributed database
 Peer-to-peer system
 NoSQL database with random access
 Excellent for sparse data
 Behaves like a key-value store
 Key + number of bins/columns
 Only one datatype: byte string
 Concept of column families for similar data
 Has CLI, but can be access from Java and Pig
 Not meant for transactional system
 Limited built-in functionality
 Key functions must be added at application level
 Name-value db with cell-level security
 Developed by NSA, but now withApache
 Excellent for multitenant storage
 Set column visibility rules for user “labels”
 Scales well, at petabytes of data
 Retrieval operations in seconds
 In-memory cache
 Fast access of large data for short time
 Traditional approach to sharing data in HDFS
is to use replicated join (send data to each
node)
 Memcached provides a “pool” of memory
across the nodes and stores data in that pool
 Effectively a distributed memory pool
 Much more efficient than replicating data
 DocumentWarehouse
 Allows searching of text documents
 Blur uses HDFS stack; Solr doesn’t
 Uses can query data based on indexing
 JSON document-oriented database
 Most popular NoSQL db
 Supports secondary indexes
 Does not run on Hadoop Stack
 Concept of documents (rows) and collections
(tables)
 Very scalable … extends simple key-value
storage
 Interact directly with HDFS data using HQL
 HQL similar to SQL (syntax and commands)
 HQL queries converted to M/R jobs
 HQL does not support:
 Updates/Deletes
 Transactions
 Non-equality joins
 SQL Access to Hadoop Data
 In-memory model for execution (like Spark)
 No MapReduce functionality
 Much faster than traditional HDFS access
 Supports HQL; also support for Java, Scala
APIs
 Can also run MLLib algorithms
 A Graph database (think extended relationships)
 Facebook, LinkedIn,Twitter, etc. use graphs to
determine your friends and likely friends
 The science of graph theory is a bit complicated
 If John is a friend of Mary; Mary is a friend of
Tom;Tom is a friend of Alice …
 Find friends who are two paths (degrees) from
John; nightmare to do with SQL
 Finding relationships from email exchanges
 Relational database layer over HBASE
 Provides JDBC driver to access data
 SQL query converted into HBase scans
 Produces regular JDBC resultsets
 Versioning support to ensure correct schema
is used
 Good performance
Hadoop Technologies

Mais conteúdo relacionado

Mais procurados

Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn HadoopSilicon Halton
 
Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Muthu Natarajan
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureJen Stirrup
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLAdam Muise
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Dataconomy Media
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with SparkArjen de Vries
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use casesJoey Echeverria
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irdatastack
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part IIArjen de Vries
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1Abbas Maazallahi
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 

Mais procurados (20)

Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn Hadoop
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.ir
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
IJET-V3I2P14
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 

Destaque

Destaque (13)

Tipografia
TipografiaTipografia
Tipografia
 
Badass Magazine. Larell Scardelli
Badass Magazine. Larell ScardelliBadass Magazine. Larell Scardelli
Badass Magazine. Larell Scardelli
 
TraVis CTTHES3
TraVis CTTHES3TraVis CTTHES3
TraVis CTTHES3
 
Abused Drugs
Abused DrugsAbused Drugs
Abused Drugs
 
Prison Tattoos
Prison TattoosPrison Tattoos
Prison Tattoos
 
18_10March2016(6-11)
18_10March2016(6-11)18_10March2016(6-11)
18_10March2016(6-11)
 
Editor Bluefish
Editor BluefishEditor Bluefish
Editor Bluefish
 
Elmhurst TechNet Presentation Belfast 090915
Elmhurst TechNet Presentation Belfast 090915Elmhurst TechNet Presentation Belfast 090915
Elmhurst TechNet Presentation Belfast 090915
 
Software access anywhere
Software access anywhereSoftware access anywhere
Software access anywhere
 
Are you ready to implement GST
Are you ready to implement GSTAre you ready to implement GST
Are you ready to implement GST
 
Final Report
Final ReportFinal Report
Final Report
 
6 навчика. синергія (рус)
6 навчика. синергія (рус)6 навчика. синергія (рус)
6 навчика. синергія (рус)
 
Baclofene Posologie
Baclofene PosologieBaclofene Posologie
Baclofene Posologie
 

Semelhante a Hadoop Technologies

Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdfavenkatram
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCPBlibBlobb
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Yahoo Developer Network
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedDouglas Bernardini
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfDIVYA370851
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxUttara University
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkLaxmi8
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!gagravarr
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Amazon Web Services
 

Semelhante a Hadoop Technologies (20)

Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
HADOOP
HADOOPHADOOP
HADOOP
 
The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
 

Mais de zahid-mian

Mongodb Aggregation Pipeline
Mongodb Aggregation PipelineMongodb Aggregation Pipeline
Mongodb Aggregation Pipelinezahid-mian
 
MongoD Essentials
MongoD EssentialsMongoD Essentials
MongoD Essentialszahid-mian
 
Intro to modern cryptography
Intro to modern cryptographyIntro to modern cryptography
Intro to modern cryptographyzahid-mian
 
Hadoop M/R Pig Hive
Hadoop M/R Pig HiveHadoop M/R Pig Hive
Hadoop M/R Pig Hivezahid-mian
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databaseszahid-mian
 
Statistics101: Numerical Measures
Statistics101: Numerical MeasuresStatistics101: Numerical Measures
Statistics101: Numerical Measureszahid-mian
 
Amazon SimpleDB
Amazon SimpleDBAmazon SimpleDB
Amazon SimpleDBzahid-mian
 
C# 6 New Features
C# 6 New FeaturesC# 6 New Features
C# 6 New Featureszahid-mian
 
Introduction to d3js (and SVG)
Introduction to d3js (and SVG)Introduction to d3js (and SVG)
Introduction to d3js (and SVG)zahid-mian
 

Mais de zahid-mian (9)

Mongodb Aggregation Pipeline
Mongodb Aggregation PipelineMongodb Aggregation Pipeline
Mongodb Aggregation Pipeline
 
MongoD Essentials
MongoD EssentialsMongoD Essentials
MongoD Essentials
 
Intro to modern cryptography
Intro to modern cryptographyIntro to modern cryptography
Intro to modern cryptography
 
Hadoop M/R Pig Hive
Hadoop M/R Pig HiveHadoop M/R Pig Hive
Hadoop M/R Pig Hive
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
 
Statistics101: Numerical Measures
Statistics101: Numerical MeasuresStatistics101: Numerical Measures
Statistics101: Numerical Measures
 
Amazon SimpleDB
Amazon SimpleDBAmazon SimpleDB
Amazon SimpleDB
 
C# 6 New Features
C# 6 New FeaturesC# 6 New Features
C# 6 New Features
 
Introduction to d3js (and SVG)
Introduction to d3js (and SVG)Introduction to d3js (and SVG)
Introduction to d3js (and SVG)
 

Último

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Último (20)

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 

Hadoop Technologies

  • 1. Zahid Mian Part of the Brown-bag Series
  • 2.  CoreTechnologies  HDFS  MapReduce  YARN  Spark  Data Processing  Pig  Mahout  Hadoop Streaming  MLLib  Security  Sentry  Kerberos  Knox  ETL  Sqoop  Flume  DistCp  Storm
  • 3.  Monitoring  Ambari  HCatalog  Nagios  Puppet  Chef  ZooKeeper  Oozie  Ganglia  Databases  Cassandra  HBase  Accumulo  Memcached  Blur  Solr  MongoDB  Hive  SparkSQL  Giraph
  • 4.  Hadoop Distributed File System (HDFS)  Runs on clusters of inexpensive disks  Write-once data  Stores data in blocks across multiple disks  NameNode responsible for managing metadata about the actual data  Linux-likeCLI for management of files  Since it’s Open Source, customization is possible
  • 5.  Solving computations by breaking everything into Map or Reduce jobs  Input and output of jobs is always in Key/Value pairs  Map Input might be a line from a file <LineNumber, LineText>:  <224, “HelloWorld. HelloWorld”>  Map Output might be instance of each word:  <“Hello”, 1>, <“World”, 1>, <“Hello”, 1>, <“World”, 1>  Reduce input would be the output from the Mapper  Reduce output might be the count of occurrence of each word:  <“Hello”, 2>, <“World”, 2>  Generally MapReduce jobs are written in Java  Internally Hadoop does a lot of processing to make this seemless  All data stored in HDFS (except log files)
  • 6.  Yet Another Resource Negotiator  By itself not much  Allows a variety of tools to conveniently run within the Hadoop cluster (MapReduce, Hbase, Spark, Storm, Solr, etc.)  Think ofYARN as the operating system for Hadoop  Users generally interact with individual tools withinYARN rather than directly withYARN
  • 7.  MapReduce doesn’t perform well with iterative algorithms (e.g., graph analysis)  Spark overcomes that flaw …  Supports multipass/iterative algorithms by reducing/eliminating reads/writes to disk  A replacement for MapReduce  Three principles of Spark operations:  Resilient Distributed Dataset (RDD):The Data  Transformation: Modifies RDD or creates a new RDD  Action: analyzes an RDD and returns a single result  Scala is the preferred language for Spark
  • 8.  Part of Apache HadoopYARN  Performance gains  Optimal resource management  Plan reconfiguration at runtime  Dynamic physical data flow decisions
  • 9.  An abstraction build on top of Hadoop  Essentially an ETL tool  Use “simple” PigLatin script to create ETL jobs  Pig will convert jobs to Hadoop M/R jobs  Takes away the “pain” of writing Java M/R jobs  Can perform joins, summaries, etc.  Input/Output all within HDFS  Can also write external functions (UDF) and call them from PigLatin
  • 10.  Allows the use of stdin and stdout (linux) as input and outputs for your M/R jobs  What this means is that you can use C, Python, and other languages  All the internal work (e.g., shuffling) still happens within the Hadoop cluster  Only useful if Java skills are weak
  • 11.  Collection of machine-learning algorithms that run on Hadoop  Possible to write your own algorithms in traditional Java M/R jobs …  … why bother when they exist in Mahout?  Algorithms include: k-means clustering, latent dirichlet allocation, logistic-regression- based classifier, random forest decision tree classifer, etc.
  • 12.  Machine Learning Library (MLLib) for Spark  Similar to Mahout, but specifically for Spark  (Remember Spark is not MapReduce)  Algorithms include: Linear SVM and logistic regression, k-means clustering, multinomial naïve Bayes, Dimensionality reduction, etc.
  • 13.  Still not fully developed  Provides basic authorization in Hadoop  Provides role-based authorization  Works at the application level (the application needs to call theAPIs)  Works with Hive, Solr and Impala  Drawback: possible to write M/R job to access non-authorized data)
  • 14.  Provides Secure Authentication  Tedious to setup and maintain
  • 15.  Security Gateway to manage access  History of Hadoop suggests that security was an afterthought  Each tool had own security implementation  Knox overcomes that complexity  Provides gateway between external (to Hadoop) apps and internal apps  Authorization, authentication, and auditing  Works with AD and LDAP
  • 16.  Transfers data between HDFS and relation DBs  A very simple command line tool  export data from HDFS to RDBMS  Import data from RDBMS to HDFS  transfers executed as M/R jobs in Hadoop  Filtering possible  Additional options for file formats, delimiters, etc.
  • 17.  Data collection and aggregation  Works well with log data  Moves large data files from various servers into Hadoop cluster  Supports “complex” multihop flows  Key implementation features: source, channel, sink  Job configuration done via a .config file
  • 18.  Data movement between Hadoop clusters  Basically it can copy entire cluster  Primary Usage:  Moving data from test to dev environments  “Dual Ingestion” using two clusters in case one fails
  • 19.  Stream Ingestion (instead of batch processing)  Quickly perform transformations of very large number of small records  Workflow, called topology, includes spouts as inputs and bolts as transformations.  Usage:  transform a stream of tweets into a stream of trending topics  Bolts can do a lot of work: aggregate, communicate with Databases, joins, etc.
  • 20.  A Distributed Messaging framework  Fast, scalable, and durable  Single cluster can serve as central data backbone  Messages are persisted on disk and replicated across clusters  Uses include: traditional messaging, website activity tracking, centralized feeds of operational data
  • 21.  Provision, monitoring, and management of a Hadoop cluster  GUI based tool  Features  Step by step wizard for installing services  Start, stop, configure services  Dashboard for monitoring health and status  Ganglia for metrics collection  Nagios for system alerts
  • 22.  Another data abstraction layer  Use HDFS files as tables  Almost SQL-like, but more Hive-like  Add partitions  Users don’t have to worry about location or format of data
  • 23.  IT Infrastructure monitoring  Web based interface  Detection of outages and problems  Send alerts via email or SMS  Automatic restart provisioning
  • 24. PUPPET  Node management tool  Puppet uses declarative syntax  Configuration file identifies programs; Puppet determines their availability  Broken down as: Resources, manifests, and modules CHEF  Node management tool  Chef uses imperative syntax  Resource might specify a certain requirement (a specific directory is needed)  Broken down as: Resources, recipes and cookbooks
  • 25.  Allows coordination between nodes  Sharing “small” amounts of state and config data  For example, share connection string  Highly scalable and reliable  Some built-in protection from using it as a datastore  Use API to extend use to other areas like implementing security
  • 26.  A workflow scheduler  Like typical schedulers, you can create relatively complex rules around jobs  Start, stop, suspend, restart jobs  Control both jobs and tasks
  • 27.  Another monitoring tool  Provides a high-level overview of cluster  Computing capability, data transfers, storage usage  Has support for add-ins for additional features  Used withinAmbari
  • 28.  Feed management and data processing platform  Feed retention, replications, archival  Supports workflows  Integration with Hive/Hcatalog  Feeds can be any type of data (e.g., Emails)
  • 29.  Key-value store  Scales well and efficient storage  Distributed database  Peer-to-peer system
  • 30.  NoSQL database with random access  Excellent for sparse data  Behaves like a key-value store  Key + number of bins/columns  Only one datatype: byte string  Concept of column families for similar data  Has CLI, but can be access from Java and Pig  Not meant for transactional system  Limited built-in functionality  Key functions must be added at application level
  • 31.  Name-value db with cell-level security  Developed by NSA, but now withApache  Excellent for multitenant storage  Set column visibility rules for user “labels”  Scales well, at petabytes of data  Retrieval operations in seconds
  • 32.  In-memory cache  Fast access of large data for short time  Traditional approach to sharing data in HDFS is to use replicated join (send data to each node)  Memcached provides a “pool” of memory across the nodes and stores data in that pool  Effectively a distributed memory pool  Much more efficient than replicating data
  • 33.  DocumentWarehouse  Allows searching of text documents  Blur uses HDFS stack; Solr doesn’t  Uses can query data based on indexing
  • 34.  JSON document-oriented database  Most popular NoSQL db  Supports secondary indexes  Does not run on Hadoop Stack  Concept of documents (rows) and collections (tables)  Very scalable … extends simple key-value storage
  • 35.  Interact directly with HDFS data using HQL  HQL similar to SQL (syntax and commands)  HQL queries converted to M/R jobs  HQL does not support:  Updates/Deletes  Transactions  Non-equality joins
  • 36.  SQL Access to Hadoop Data  In-memory model for execution (like Spark)  No MapReduce functionality  Much faster than traditional HDFS access  Supports HQL; also support for Java, Scala APIs  Can also run MLLib algorithms
  • 37.  A Graph database (think extended relationships)  Facebook, LinkedIn,Twitter, etc. use graphs to determine your friends and likely friends  The science of graph theory is a bit complicated  If John is a friend of Mary; Mary is a friend of Tom;Tom is a friend of Alice …  Find friends who are two paths (degrees) from John; nightmare to do with SQL  Finding relationships from email exchanges
  • 38.  Relational database layer over HBASE  Provides JDBC driver to access data  SQL query converted into HBase scans  Produces regular JDBC resultsets  Versioning support to ensure correct schema is used  Good performance