1. Overview
of
Big Data
Hadoop Ecosystem and
NoSQL Databases
Khanderao Kand
CTO GloMantra Inc.
Entrepreneur and Technologist
Twitter @khanderao
2. Big Data
The Dominant trend for 2013 will, once again, be Big Data
Gartner reports must have technology for “Competetive
advantage by 2015”
IDC forecasts that the market for Big Data is expected to
grow from $3.2 billion in 2010 to $16.9 billion in 2015 in its
report, Worldwide Big Data Technology and Services 2012-2015.
By 2016, revenue from the big data sector will approach $24
billion, reaching $48.3 billion by 2018.
3. The image was taken from the Atacama desert in western South America by Yuri
Beletsky (Las Campanas Observatory, Carnegie Institution for Science) on July 11, 2012.
Copyright Yuri Beletsky
4. Alignment…
Explosion of data from site logs, search engines, social
media…
Google published paper on Map Reduce and Google File
System, inspired Doug Cutting working on Apache Lucene-
Nutch, Hadoop born
Yahoo took further with 1000 nodes in 2008
Possible to process very very large data on commodity
hardware
Apache Open source
5. Big Data Stack
Patents
Speed
Matlab
SAS SPSS
R
SciPy
Mahout
Scale
Speed kdb
Esper, S4
MySQL
MongoDB
Hbase
Hadoop Scale
6. Big Data Architecture
Analytics Products Apps
BI
BI Tools - Dev Visualization
Unstructured
Data
Lucene Hadoop No-SQL RDBMS
Nutch Map Reduce Hadoop No-SQL
Based
SOLR
Structured System
Data ETL Workflow
Admin
Data &
Monitoring
RDBMS Integration Scheduler
Datalogs
Streams
7. HDFS
Large Data Set
Client 1 Client2
Write Once – Read Many
Fault Tolerant NameNode
Distributed File System Read
Write
Name Node – Data Node
Fixed Size Data Blocks
Checksum
Rack1 Rack N
Files – Sequence of blocks Replication
Replicated over Balanced Cluster
Heartbeat Report from Nodes
8. Map Reduce
• Two Step, Map and Reduce, approach of solving problem
• Move the code to the data
• Map step process data on nodes
• Reduce step aggregates results from all Map nodes with reduce algorithm
• JobTracker distributes and tracks tasks
• TaskTracker on processing nodes communicated task status to JobTrackers
• Inspired by Functional Programming
11. Apache Pig Latin
Higher Level scripting above Map Reduce
Procedureal (unlike SQL) by easy like SQL
Constructs like FOREACH, GROUP
Supports User Defined Functions
From Yahoo
Good for Integrating and writing Hadoop JObs
12. Sqoop
Data Bulk Load
Data Import Export
RDBMS and NoSQL
HDFS, Hbase
Data Sliced
Sliced Transferred via MaP only Jobs
13. Chukwa & Flume
Hadoop Subproject
Large scale log processing
On Map R
Collection and analysis
Batch Oriented
Components:
Agents
Collectors
MR Jobs for Parsing & Archiving
HICC : Hadoop Infra Care Center Web App
14. Big „Fast‟ Data
Real time adhoc querry:
Once again Google Percolater and Dremel inspired
Cloudera : Impala
SQL like querry on HDFS
Lower latency
By pass Map Reduce
Apache Drill
16. MongoDB
Document Oriented
Flexible - No Fix Schema
Distributed – Sharding based on diff policies
Fault Tolerant via Replication
Easy to install use
JSON – BSON format storage
Javascript based Querry
Java, Python, other languages
Opensource, Supported by 10Gen
Fast Read
18. Apache Cassandra
Based on Amazon Dynamo Db
Column oriented
Theoretically infinite columns
Columns as tupple N,V, timestamp
Organized as column family
(unlike Hbase)Not Hadoop based
Equal Nodes, easier to config and manage
Parallel write
Netflix,,etc.
19. Apache HBase
Modeled as Google Big Table
Column Oriented
Column Family stored together as against all columns in row
Predefine table schema with columns
However columns can be added in runtime
Fault Tolerant
Runs on HDFS
MapReduce based
Interface via REST, AVRO, Thrift
Facebook‟s messaging platform