This document contains a presentation on integrating Hadoop with NoSQL databases. It discusses using Sqoop to transfer data between Hadoop and NoSQL databases like Couchbase and MongoDB. It provides examples of using Sqoop to import and export data between these systems. The presentation also highlights some key uses cases and benefits of using Hadoop and NoSQL databases together for applications involving large datasets.
2. Goto
Night
CPH,
June
6th
2013
How
to
integrate
Hadoop
with
your
NoSQL
database?
Tugdual
“Tug”
Grall
Technical
Evangelist
Monday, June 10, 13
3. Goto
Night
CPH,
June
6th
2013
About
Me
• Tugdual
“Tug”
Grall
Couchbase
• Technical
Evangelist
eXo
• CTO
Oracle
• Developer/Product
Manager
• Mainly
Java/SOA
Developer
in
consul@ng
firms
• Web
• @tgrall
• hAp://blog.grallandco.com
• tgrall
• NantesJUG
co-‐founder
• Pet
Project
:
• hAp://www.resultri.com
Monday, June 10, 13
4. Goto
Night
CPH,
June
6th
2013 4
0
0.50
1.00
1.50
2.00
2000 2006 2011
Source:
IDC
2011
Digital
Universe
Study
(hKp://www.emc.com/collateral/demos/microsites/emc-‐digital-‐universe-‐2011/index.htm)
Trillions
of
Gigabytes
(ZeKabytes) Big
Data
High
Data
Variety
and
Velocity
Unstructured
and
Semi-‐
Structured
Data
Structured
Data
Text,
Log
Files,
Click
Streams,
Blogs,
Tweets,
Audio,
Video,
etc.
More
Flexible
Data
Model
Required
Monday, June 10, 13
5. Goto
Night
CPH,
June
6th
2013
<50%?
2027
95%
RelaOonal
Technology
$30B
Database
Market
Being
Disrupted
2013
All
new
database
growth
will
be
NoSQL
RelaOonal
Technology
RelaOonal
Technology
RelaOonal
Technology
NoSQL
Technology
Other
Monday, June 10, 13
6. Goto
Night
CPH,
June
6th
2013
Cloudera
Hortonworks
Opera@onal
vs.
Analy@c
Databases
Couchbase
Mongo
AnalyOc
Databases
Get
insights
from
data
Real-‐Ome,
InteracOve
Databases
Fast
access
to
data
NoSQL
Monday, June 10, 13
7. Goto
Night
CPH,
June
6th
2013
Lack
of
flexibility/
rigid
schemas
Inability
to
scale
out
data
Performance
challenges Cost All
of
these Other
49%
35%
29%
16%
12%
11%
Source:
Couchbase
Survey,
December
2011,
n
=
1351.
Monday, June 10, 13
9. Goto
Night
CPH,
June
6th
2013
What
is
Hadoop?
• Highly
scalable
• Unstructured
data
• Open
source
• Big
Data
OperaOng
System
• Changing
the
World
One
Petabyte
at
a
Time
Monday, June 10, 13
10. Goto
Night
CPH,
June
6th
2013
What
is
Hadoop?
• Simplest
unit
of
compute
and
storage
CPU
Disks Application
Data
Monday, June 10, 13
11. Goto
Night
CPH,
June
6th
2013
What
is
Hadoop?
• And
when
it
grows?
Application
Data
Monday, June 10, 13
12. Goto
Night
CPH,
June
6th
2013
What
is
Hadoop?
• And
when
it
grows
more?
Monday, June 10, 13
13. Goto
Night
CPH,
June
6th
2013
What
is
Hadoop?
• NoSQL
to
the
rescue
Application
Data
Monday, June 10, 13
14. Goto
Night
CPH,
June
6th
2013
What
is
Hadoop?
• Hadoop
is
a
different
paradigm
Application
Data
Monday, June 10, 13
16. Goto
Night
CPH,
June
6th
2013
Hadoop
and
NoSQL
Monday, June 10, 13
17. Goto
Night
CPH,
June
6th
2013
events
profiles,
campaigns
profiles,
real
@me
campaign
sta@s@cs
40
milliseconds
to
respond
with
the
decision.
2
3
1
Ad
and
offer
targeOng
Monday, June 10, 13
18. Goto
Night
CPH,
June
6th
2013
Logs
Couchbase Server Cluster
Hadoop Cluster
sqoop import
Logs
Logs
Logs
Logs
Ad Targeting
Platform
sqoop export
flume
flow
Moving
Parts
Monday, June 10, 13
19. Goto
Night
CPH,
June
6th
2013
events&
user&profiles&
make&&
recommenda2ons&
2&
3&
1&
Content
Oriented Site
Legacy Relational
Database
Content
&
RecommendaOon
TargeOng
Monday, June 10, 13
20. Goto
Night
CPH,
June
6th
2013
Logs
Couchbase Server Cluster
Hadoop Cluster
sqoop import
Logs
Logs
Logs
Logs
Content Driven
Web Site
sqoop export
Original RDBMS
In order to keep up with changing needs on
richer, more targeted content that is delivered
to larger and larger audiences very quickly,
data behind content driven sites is shifting to
Couchbase.
Hadoop excels at complex analytics which
may involve multiple steps of processing
which incorporate a number of different data
sources.
sqoop import
flume
flow
Moving
Parts
Monday, June 10, 13
21. Goto
Night
CPH,
June
6th
2013
Sqoop is a tool designed to transfer data between Hadoop and relational
databases.
You can use Sqoop to import data from a relational database management
system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File
System (HDFS), transform the data in Hadoop MapReduce, and then
export the data back into an RDBMS.
sqoop.apache.org
What
is
Sqoop?
Monday, June 10, 13
22. Goto
Night
CPH,
June
6th
2013
• Traditional ETL
Application DataData
T
What
is
Sqoop?
Monday, June 10, 13
23. Goto
Night
CPH,
June
6th
2013
• A different paradigm
Data
Applicatio
n
Data
What
is
Sqoop?
Monday, June 10, 13
24. Goto
Night
CPH,
June
6th
2013
• A very scalable different paradigm
Data
Application
Data
Application
Data
Application
Data
What
is
Sqoop?
Monday, June 10, 13
25. Goto
Night
CPH,
June
6th
2013
• Where did the Transform go?
Application
Data
TTT TTT TTT TTT
What
is
Sqoop?
Monday, June 10, 13
26. Goto
Night
CPH,
June
6th
2013
What
is
Sqoop?
• Sqoop
“SQL-‐Hadoop”
Default
connec@on
is
via
JDBC
• Lots
of
custom
connectors
Couchbase,
VoltDB,
Ver@ca
Teradata,
Netezza
Oracle,
MySQL,
Postgres
Monday, June 10, 13
27. Goto
Night
CPH,
June
6th
2013
Sqoop
:
Import
sqoop import --connect jdbc:mysql://rdbms1.demo.com/CRM
--table customers
Monday, June 10, 13
28. Goto
Night
CPH,
June
6th
2013
Sqoop
:
Export
sqoop export --connect jdbc:mysql://rdbms1.demo.com/ANALYTICS
--table sales
--export-dir /user/hive/warehouse/zip_profits
--input-fields-terminated-by '0001'
Monday, June 10, 13
29. Goto
Night
CPH,
June
6th
2013
Sqoop
:
Import
sqoop import –-connect http://localhost:8091/pools
--table DUMP
Monday, June 10, 13
30. MapReduceJob
Goto
Night
CPH,
June
6th
2013
Sqoop
:
Import
HDFS
Map
HDFS
Map
HDFS
Map
Sqoop
Client
Metadata
Launches
Monday, June 10, 13
31. Goto
Night
CPH,
June
6th
2013
Sqoop
:
Export
sqoop export --connect http://localhost:8091/pools
--table DUMP
--export-dir /user/hive/profiles/recommendation
--username social
Monday, June 10, 13
32. Goto
Night
CPH,
June
6th
2013
Sqoop
:
Export
MapReduceJob
HDFS
Map
HDFS
Map
HDFS
Map
Sqoop
Client
Metadata
Launches
Monday, June 10, 13
33. Goto
Night
CPH,
June
6th
2013
DemonstraOon
Monday, June 10, 13
35. Goto
Night
CPH,
June
6th
2013
Easy
Scalability
Consistent
High
Performance
Always
On
24x365
Grow
cluster
without
applica@on
changes,
without
down@me
with
a
single
click
Consistent
sub-‐millisecond
read
and
write
response
@mes
with
consistent
high
throughput
No
down@me
for
so`ware
upgrades,
hardware
maintenance,
etc.
Flexible
Data
Model
JSON
document
model
with
no
fixed
schema.
JSON
JSON
JSON
JSONJSON
PERFORMANCE
Couchbase
Server
Core
Principles
Monday, June 10, 13
36. Goto
Night
CPH,
June
6th
2013
Couchbase
Handles
Real
World
Scale
Monday, June 10, 13
37. Goto
Night
CPH,
June
6th
2013
Couchbase
Server
2.0
Heartbeat
Process
monitor
Global
singleton
supervisor
ConfiguraQon
manager
on
each
node
Rebalance
orchestrator
Node
health
monitor
one
per
cluster
vBucket
state
and
replicaQon
manager
hdp
REST
management
API/Web
UI
HTTP
8091
Erlang
port
mapper
4369
Distributed
Erlang
21100
-‐
21199
Erlang/OTP
storage
interface
Couchbase
EP
Engine
11210
Memcapable
2.0
Moxi
11211
Memcapable
1.0
Memcached
New
Persistence
Layer
8092
Query
APIQuery
Engine
Data
Manager Cluster
Manager
Monday, June 10, 13
38. Goto
Night
CPH,
June
6th
2013
Couchbase
Server
2.0
Heartbeat
Process
monitor
Global
singleton
supervisor
ConfiguraQon
manager
on
each
node
Rebalance
orchestrator
Node
health
monitor
one
per
cluster
vBucket
state
and
replicaQon
manager
hdp
REST
management
API/Web
UI
HTTP
8091
Erlang
port
mapper
4369
Distributed
Erlang
21100
-‐
21199
Erlang/OTP
storage
interface
Couchbase
EP
Engine
11210
Memcapable
2.0
Moxi
11211
Memcapable
1.0
Memcached
New
Persistence
Layer
8092
Query
APIQuery
Engine
Monday, June 10, 13
39. The
Classic
Order
Entry
Structure
Goto
Night
CPH,
June
6th
2013 39
hKp://mar@nfowler.com/bliki/AggregateOrientedDatabase.html
Rela%onal
databases
were
not
designed
with
clusters
in
mind,
which
is
why
people
have
cast
around
for
an
alterna%ve.
Storing
aggregates
as
fundamental
units
makes
a
lot
of
sense
for
running
on
a
cluster.
Monday, June 10, 13
40. Goto
Night
CPH,
June
6th
2013 40
o::1001
{
uid:
“ji22jd”,
customer:
“Ann”,
line_items:
[
{
sku:
0321293533,
quan:
3,
unit_price:
48.0
},
{
sku:
0321601912,
quan:
1,
unit_price:
39.0
},
{
sku:
0131495054,
quan:
1,
unit_price:
51.0
}
],
payment:
{
type:
“Amex”,
expiry:
“04/2001”,
last5:
12345
}
• Easy
to
distribute
data
• Makes
sense
to
applicaQon
programmers
Aggregate
by
Comparison
Monday, June 10, 13
41. Goto
Night
CPH,
June
6th
2013
COUCHBASE
SERVER
CLUSTER
• Docs
distributed
evenly
across
servers
• Each
server
stores
both
acOve
and
replica
docs
Only
one
server
acQve
at
a
Qme
• Client
library
provides
app
with
simple
interface
to
database
• Cluster
map
provides
map
to
which
server
doc
is
on
App
never
needs
to
know
• App
reads,
writes,
updates
docs
• MulOple
app
servers
can
access
same
document
at
same
Ome
User
Configured
Replica
Count
=
1
READ/WRITE/UPDATE
ACTIVE
Doc
5
Doc
2
Doc
Doc
Doc
SERVER
1
ACTIVE
Doc
4
Doc
7
Doc
Doc
Doc
SERVER
2
Doc
8
ACTIVE
Doc
1
Doc
2
Doc
Doc
Doc
REPLICA
Doc
4
Doc
1
Doc
8
Doc
Doc
Doc
REPLICA
Doc
6
Doc
3
Doc
2
Doc
Doc
Doc
REPLICA
Doc
7
Doc
9
Doc
5
Doc
Doc
Doc
SERVER
3
Doc
6
APP
SERVER
1
COUCHBASE
Client
Library
CLUSTER
MAP
COUCHBASE
Client
Library
CLUSTER
MAP
APP
SERVER
2
Doc
9
Basic
OperaOons
Monday, June 10, 13
42. Goto
Night
CPH,
June
6th
2013
COUCHBASE
SERVER
CLUSTER
ACTIVE
Doc
5
Doc
2
Doc
Doc
Doc
SERVER
1
REPLICA
Doc
4
Doc
1
Doc
8
Doc
Doc
Doc
APP
SERVER
1
COUCHBASE
Client
Library
CLUSTER
MAP
COUCHBASE
Client
Library
CLUSTER
MAP
APP
SERVER
2
Doc
9
• Indexing
work
is
distributed
amongst
nodes
• Large
data
set
possible
• Parallelize
the
effort
• Each
node
has
index
for
data
stored
on
it
• Queries
combine
the
results
from
required
nodes
ACTIVE
Doc
5
Doc
2
Doc
Doc
Doc
SERVER
2
REPLICA
Doc
4
Doc
1
Doc
8
Doc
Doc
Doc
Doc
9
ACTIVE
Doc
5
Doc
2
Doc
Doc
Doc
SERVER
3
REPLICA
Doc
4
Doc
1
Doc
8
Doc
Doc
Doc
Doc
9
Query
Indexing
Monday, June 10, 13
43. Goto
Night
CPH,
June
6th
2013
DemonstraOon
Monday, June 10, 13
44. ≠
Goto
Night
CPH,
June
6th
2013
Map
Reduce
...
• Deal
with
“Big
Data”
• “More”
is
beder
than
“Faster”
• Batch
Oriented
• Usually
used
to
“extract/transform”
data
• Fully
distributed
Map,
Shuffle,
Reduce
• Distributed
• Executed
where
the
document
is
• Deal
with
“indexing”
data
• As
fast
as
possible
• Use
to
query
the
data
in
the
Database
Monday, June 10, 13
45. Goto
Night
CPH,
June
6th
2013
Conclusion
• Big
Data
and
Big
Users
working
together
• Use
Hadoop
to
store
“everything”
Batch
oriented
Complex
data
processing
• MapReduce
• Expose
a
subset
of
the
dataset
to
your
applicaOon
Real
@me
analy@cs
Low
latency
Simple
data
interac@ons
and
queries
Monday, June 10, 13
46. Goto
Night
CPH,
June
6th
2013
Q&A
We’re
Hiring!
couchbase.com/careers
@tgrall
tug@couchbase.com
Monday, June 10, 13