SlideShare uma empresa Scribd logo
1 de 19
Overview
         of
     Big Data
Hadoop Ecosystem and
  NoSQL Databases
            Khanderao Kand
          CTO GloMantra Inc.
      Entrepreneur and Technologist
           Twitter @khanderao
Big Data

The Dominant trend for 2013 will, once again, be Big Data

Gartner reports must have technology for “Competetive
advantage by 2015”

IDC forecasts that the market for Big Data is expected to
grow from $3.2 billion in 2010 to $16.9 billion in 2015 in its
report, Worldwide Big Data Technology and Services 2012-2015.

By 2016, revenue from the big data sector will approach $24
billion, reaching $48.3 billion by 2018.
The image was taken from the Atacama desert in western South America by Yuri
Beletsky (Las Campanas Observatory, Carnegie Institution for Science) on July 11, 2012.
Copyright Yuri Beletsky
Alignment…

Explosion of data from site logs, search engines, social
media…

Google published paper on Map Reduce and Google File
System, inspired Doug Cutting working on Apache Lucene-
Nutch, Hadoop born

Yahoo took further with 1000 nodes in 2008

Possible to process very very large data on commodity
hardware

Apache Open source
Big Data Stack


                          Patents

Speed

        Matlab
               SAS SPSS
             R
               SciPy
                          Mahout
                                    Scale

Speed         kdb
        Esper, S4
        MySQL
            MongoDB
                          Hbase
                          Hadoop    Scale
Big Data Architecture
                        Analytics Products                   Apps

                                                               BI
                        BI Tools - Dev                    Visualization



Unstructured
   Data
  Lucene              Hadoop                 No-SQL         RDBMS
   Nutch             Map Reduce              Hadoop         No-SQL
                                             Based
                                                            SOLR

 Structured                                   System
    Data            ETL         Workflow
                                              Admin
                    Data           &
                                             Monitoring
  RDBMS          Integration    Scheduler
  Datalogs
  Streams
HDFS
Large Data Set
                                     Client 1                     Client2
Write Once – Read Many
Fault Tolerant                                  NameNode
Distributed File System       Read
                                                                          Write

Name Node – Data Node
Fixed Size Data Blocks
Checksum
                                     Rack1                       Rack N
Files – Sequence of blocks                         Replication

Replicated over Balanced Cluster
Heartbeat Report from Nodes
Map Reduce




•   Two Step, Map and Reduce, approach of solving problem
•   Move the code to the data
•   Map step process data on nodes
•   Reduce step aggregates results from all Map nodes with reduce algorithm
•   JobTracker distributes and tracks tasks
•   TaskTracker on processing nodes communicated task status to JobTrackers
•   Inspired by Functional Programming
Hadoop Ecosystem

                BI Analytics           Apps           RDBMS



Workflow
                Chukwa         Oozie          Flume
Orchestration



 Data           Avro     Pig         Hive     Sqoop




                                                                           Security, Recovery, Infra
 Access                                                HBase




                                                               zookeeper
                           Network




                                                                                                       Nagios, Ganglia
Processing               Map Reduce

                                     HCatalog
Storage                                HDFS
Apache Hive

SQL-like HiveQL

Warehousing Apps

Compiles to MapReduce Tasks

Facebook, Netflix, etc.
Apache Pig Latin
Higher Level scripting above Map Reduce

Procedureal (unlike SQL) by easy like SQL

Constructs like FOREACH, GROUP

Supports User Defined Functions

From Yahoo

Good for Integrating and writing Hadoop JObs
Sqoop
Data Bulk Load

Data Import Export

RDBMS and NoSQL

HDFS, Hbase

Data Sliced

Sliced Transferred via MaP only Jobs
Chukwa & Flume

Hadoop Subproject

Large scale log processing

On Map R

Collection and analysis

Batch Oriented

Components:
  Agents
  Collectors
  MR Jobs for Parsing & Archiving
  HICC : Hadoop Infra Care Center Web App
Big „Fast‟ Data
Real time adhoc querry:

Once again Google Percolater and Dremel inspired

Cloudera : Impala
  SQL like querry on HDFS
  Lower latency
  By pass Map Reduce

Apache Drill
NoSQL DataBases
Document Databases : MongoDB, CouchDB

Column Databases:   Cassandra, Hbase

KV Pair:

Graph DB: Neo4J
MongoDB
Document Oriented

Flexible - No Fix Schema

Distributed – Sharding based on diff policies

Fault Tolerant via Replication

Easy to install use

JSON – BSON format storage

Javascript based Querry

Java, Python, other languages

Opensource, Supported by 10Gen

Fast Read
CouchDB
Document Oriented
JSON format
HTTP/REST interface
MapReduce, Javascript
Replication support
Multi version CC
Written in Erlang
Fast Write – Read
Good Availability
Apache Cassandra
Based on Amazon Dynamo Db

Column oriented

Theoretically infinite columns

Columns as tupple N,V, timestamp

Organized as column family

(unlike Hbase)Not Hadoop based

Equal Nodes, easier to config and manage

Parallel write

Netflix,,etc.
Apache HBase
Modeled as Google Big Table

Column Oriented

Column Family stored together as against all columns in row

Predefine table schema with columns

However columns can be added in runtime

Fault Tolerant

Runs on HDFS

MapReduce based

Interface via REST, AVRO, Thrift

Facebook‟s messaging platform

Mais conteúdo relacionado

Mais procurados

Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
DataWorks Summit
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
Databricks
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarn
Michael Joseph
 

Mais procurados (20)

Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructure
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoop
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarn
 

Destaque

Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Chicago Hadoop Users Group
 
Internet of Things and Big Data: Vision and Concrete Use Cases
Internet of Things and Big Data: Vision and Concrete Use CasesInternet of Things and Big Data: Vision and Concrete Use Cases
Internet of Things and Big Data: Vision and Concrete Use Cases
MongoDB
 
Big Data Solutions for Healthcare
Big Data Solutions for HealthcareBig Data Solutions for Healthcare
Big Data Solutions for Healthcare
Odinot Stanislas
 

Destaque (19)

Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data ppt
Big Data pptBig Data ppt
Big Data ppt
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
MongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big DataMongoDB & Hadoop - Understanding Your Big Data
MongoDB & Hadoop - Understanding Your Big Data
 
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
big data overview ppt
big data overview pptbig data overview ppt
big data overview ppt
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data Warehousing
 
Big data Ppt
Big data PptBig data Ppt
Big data Ppt
 
8.17.11 big data and hadoop with informatica slideshare
8.17.11 big data and hadoop with informatica slideshare8.17.11 big data and hadoop with informatica slideshare
8.17.11 big data and hadoop with informatica slideshare
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
Internet of Things and Big Data: Vision and Concrete Use Cases
Internet of Things and Big Data: Vision and Concrete Use CasesInternet of Things and Big Data: Vision and Concrete Use Cases
Internet of Things and Big Data: Vision and Concrete Use Cases
 
NoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsNoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture Patterns
 
Big Data Solutions for Healthcare
Big Data Solutions for HealthcareBig Data Solutions for Healthcare
Big Data Solutions for Healthcare
 
Big Data in Retail - Examples in Action
Big Data in Retail - Examples in ActionBig Data in Retail - Examples in Action
Big Data in Retail - Examples in Action
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Semelhante a Big data hadoop ecosystem and nosql

Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
Jesus Rodriguez
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatation
Ashish Saraf
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
BOSC 2010
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
 

Semelhante a Big data hadoop ecosystem and nosql (20)

Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Hive and querying data
Hive and querying dataHive and querying data
Hive and querying data
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatation
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 

Big data hadoop ecosystem and nosql

  • 1. Overview of Big Data Hadoop Ecosystem and NoSQL Databases Khanderao Kand CTO GloMantra Inc. Entrepreneur and Technologist Twitter @khanderao
  • 2. Big Data The Dominant trend for 2013 will, once again, be Big Data Gartner reports must have technology for “Competetive advantage by 2015” IDC forecasts that the market for Big Data is expected to grow from $3.2 billion in 2010 to $16.9 billion in 2015 in its report, Worldwide Big Data Technology and Services 2012-2015. By 2016, revenue from the big data sector will approach $24 billion, reaching $48.3 billion by 2018.
  • 3. The image was taken from the Atacama desert in western South America by Yuri Beletsky (Las Campanas Observatory, Carnegie Institution for Science) on July 11, 2012. Copyright Yuri Beletsky
  • 4. Alignment… Explosion of data from site logs, search engines, social media… Google published paper on Map Reduce and Google File System, inspired Doug Cutting working on Apache Lucene- Nutch, Hadoop born Yahoo took further with 1000 nodes in 2008 Possible to process very very large data on commodity hardware Apache Open source
  • 5. Big Data Stack Patents Speed Matlab SAS SPSS R SciPy Mahout Scale Speed kdb Esper, S4 MySQL MongoDB Hbase Hadoop Scale
  • 6. Big Data Architecture Analytics Products Apps BI BI Tools - Dev Visualization Unstructured Data Lucene Hadoop No-SQL RDBMS Nutch Map Reduce Hadoop No-SQL Based SOLR Structured System Data ETL Workflow Admin Data & Monitoring RDBMS Integration Scheduler Datalogs Streams
  • 7. HDFS Large Data Set Client 1 Client2 Write Once – Read Many Fault Tolerant NameNode Distributed File System Read Write Name Node – Data Node Fixed Size Data Blocks Checksum Rack1 Rack N Files – Sequence of blocks Replication Replicated over Balanced Cluster Heartbeat Report from Nodes
  • 8. Map Reduce • Two Step, Map and Reduce, approach of solving problem • Move the code to the data • Map step process data on nodes • Reduce step aggregates results from all Map nodes with reduce algorithm • JobTracker distributes and tracks tasks • TaskTracker on processing nodes communicated task status to JobTrackers • Inspired by Functional Programming
  • 9. Hadoop Ecosystem BI Analytics Apps RDBMS Workflow Chukwa Oozie Flume Orchestration Data Avro Pig Hive Sqoop Security, Recovery, Infra Access HBase zookeeper Network Nagios, Ganglia Processing Map Reduce HCatalog Storage HDFS
  • 10. Apache Hive SQL-like HiveQL Warehousing Apps Compiles to MapReduce Tasks Facebook, Netflix, etc.
  • 11. Apache Pig Latin Higher Level scripting above Map Reduce Procedureal (unlike SQL) by easy like SQL Constructs like FOREACH, GROUP Supports User Defined Functions From Yahoo Good for Integrating and writing Hadoop JObs
  • 12. Sqoop Data Bulk Load Data Import Export RDBMS and NoSQL HDFS, Hbase Data Sliced Sliced Transferred via MaP only Jobs
  • 13. Chukwa & Flume Hadoop Subproject Large scale log processing On Map R Collection and analysis Batch Oriented Components: Agents Collectors MR Jobs for Parsing & Archiving HICC : Hadoop Infra Care Center Web App
  • 14. Big „Fast‟ Data Real time adhoc querry: Once again Google Percolater and Dremel inspired Cloudera : Impala SQL like querry on HDFS Lower latency By pass Map Reduce Apache Drill
  • 15. NoSQL DataBases Document Databases : MongoDB, CouchDB Column Databases: Cassandra, Hbase KV Pair: Graph DB: Neo4J
  • 16. MongoDB Document Oriented Flexible - No Fix Schema Distributed – Sharding based on diff policies Fault Tolerant via Replication Easy to install use JSON – BSON format storage Javascript based Querry Java, Python, other languages Opensource, Supported by 10Gen Fast Read
  • 17. CouchDB Document Oriented JSON format HTTP/REST interface MapReduce, Javascript Replication support Multi version CC Written in Erlang Fast Write – Read Good Availability
  • 18. Apache Cassandra Based on Amazon Dynamo Db Column oriented Theoretically infinite columns Columns as tupple N,V, timestamp Organized as column family (unlike Hbase)Not Hadoop based Equal Nodes, easier to config and manage Parallel write Netflix,,etc.
  • 19. Apache HBase Modeled as Google Big Table Column Oriented Column Family stored together as against all columns in row Predefine table schema with columns However columns can be added in runtime Fault Tolerant Runs on HDFS MapReduce based Interface via REST, AVRO, Thrift Facebook‟s messaging platform