Kathleen Ting, Technical Account Manager @ Cloudera and Sqoop Committer
Unlocking data stored in an organization's RDBMS and transferring it to Apache Hadoop is a major concern in the big data industry. Apache Sqoop enables users with information stored in existing SQL tables to use new analytic tools like Apache HBase and Apache Hive. This talk will go over how to deploy and apply Sqoop in your environment as well as transferring data from MySQL, Oracle, PostgreSQL, SQL Server, Netezza, Teradata, and other relational systems. In addition, we'll show you how to keep table data and Hadoop in sync by importing data incrementally as well as how to customize transferred data by calling various database functions.
Apache Sqoop: Unlocking Hadoop for Your Relational Database
1.
Unlocking
Hadoop
for
Your
Rela4onal
DB
Kathleen Ting | @kate_ting
Technical Account Manager, Cloudera | Sqoop PMC Member
Hadoop User Group UK
10 April 2014
2. Who
Am
I?
• Started
3
yr
ago
as
1st
Cloudera
Support
Eng
• Now
manages
Cloudera’s
2
largest
customers
• Sqoop
CommiJer,
PMC
Member
• Co-‐Author
of
the
Apache
Sqoop
Cookbook
3. What
is
Sqoop?
• Apache
Top-‐Level
Project
• SQl
to
hadOOP
• Tool
to
transfer
data
from
rela4onal
databases
• Teradata,
MySQL,
PostgreSQL,
Oracle,
Netezza
• To/From
Hadoop
ecosystem
• HDFS
(text,
sequence
file),
Hive,
HBase,
Avro
3
4. Why
Sqoop?
• Efficient/Controlled
resource
u4liza4on
• Concurrent
connec4ons,
Time
of
opera4on
• Datatype
mapping
and
conversion
• Automa4c,
and
User
override
• Metadata
propaga4on
• Sqoop
Record
• Hive
Metastore
• Avro
5. Agenda
Sqoop
1
• Sqoop
1
Architecture
• Sqoop
1
Command
Line
• Sqoop
1
Examples
• Sqoop
1
Challenges
• Troubleshoo4ng
Sqoop
1
• Common
Sqoop
1
Issues
• Protec4ng
Your
Password
• Sqoop
Works
on
CLI
Not
in
Oozie
• Choosing
Proper
Connector
• Overriding
Type
Mapping
Sqoop
2
• Sqoop
2
Architecture
• Sqoop
2
Design
Goals
• Sqoop
2
UI
in
Hue
Resources
6. Agenda
Sqoop
1
• Sqoop
1
Architecture
• Sqoop
1
Command
Line
• Sqoop
1
Examples
• Sqoop
1
Challenges
• Troubleshoo4ng
Sqoop
1
• Common
Sqoop
1
Issues
• Protec4ng
Your
Password
• Sqoop
Works
on
CLI
Not
in
Oozie
• Choosing
Proper
Connector
• Overriding
Type
Mapping
Sqoop
2
• Sqoop
2
Architecture
• Sqoop
2
Design
Goals
• Sqoop
2
UI
in
Hue
Resources
10. Sqoop
1
Challenges
• Cryp4c,
contextual
command
line
arguments
• Security
concerns
• Type
mapping
is
not
clearly
defined
• Client
needs
access
to
Hadoop
binaries/configura4on
and
database
• JDBC
model
is
enforced
10
11. Troubleshoo4ng
Sqoop
1
• Versions:
Sqoop,
Hadoop,
OS,
JDBC
• Console
log
aaer
running
with
the
--verbose flag
• Capture
the
en4re
output
via
sqoop import … &> sqoop.log
• En4re
Sqoop
command
including
the
op4ons-‐file
if
applicable
• Expected
output
and
actual
output
• Table
defini4on
• Small
input
data
set
that
triggers
the
problem
• Especially
with
export,
malformed
data
is
oaen
the
culprit
• Hadoop
task
logs
• Oaen
the
task
logs
contain
further
informa4on
describing
the
problem
• Permissions
on
input
files
12. Troubleshoo4ng
Sqoop
1
Imported
table
has
more
rows
than
source
table?
• Data
contains
char
used
as
Hive’s
delimiters
• Clean
up
data
• --hive-drop-import-delims
• Removes
n, t, and 01 char
• --hive-delims-replacement “SPECIAL”
• Replaces
n, t, and 01
char
with
string
SPECIAL
• Not
restricted
to
Hive
-‐
any
import
job
using
text
files
• Ensure
output
files
have
one
line
per
imported
row
13. Agenda
Sqoop
1
• Sqoop
1
Architecture
• Sqoop
1
Command
Line
• Sqoop
1
Examples
• Sqoop
1
Challenges
• Troubleshoo4ng
Sqoop
1
• Common
Sqoop
1
Issues
• Protec4ng
Your
Password
• Sqoop
Works
on
CLI
Not
in
Oozie
• Choosing
Proper
Connector
• Overriding
Type
Mapping
Sqoop
2
• Sqoop
2
Architecture
• Sqoop
2
Design
Goals
• Sqoop
2
UI
in
Hue
Resources
14. Common
Sqoop
1
Issues
• Protec4ng
Your
Password
• Sqoop
Works
on
CLI
Not
in
Oozie
• Choosing
Proper
Connector
• Overriding
Type
Mapping
15. Common
Sqoop
1
Issues
• Protec4ng
Your
Password
• Sqoop
Works
on
CLI
Not
in
Oozie
• Choosing
Proper
Connector
• Overriding
Type
Mapping
17. Common
Sqoop
1
Issues
• Protec4ng
Your
Password
• Sqoop
Works
on
CLI
Not
in
Oozie
• Choosing
Proper
Connector
• Overriding
Type
Mapping
18. Sqoop
Works
on
CLI
Not
in
Oozie
Character parameter '|' has multiple characters;
only the first will be used.
Got error creating database manager:
java.io.IOException:
No manager for connect string: "jdbc:teradata...”
19. Sqoop
Works
on
CLI
Not
in
Oozie
sqoop import --password "spEci@l$"
–connect 'jdbc:x:/yyy;db=sqoop’
• Remove
all
escaping
that
you’ve
added
for
the
shell
• Use
<arg>
vs
<command>
tags
as
content
is
considered
to
be
one
parameter
• Put
all
-‐D
parameters
into
configura4on
sec4on
• Install
driver
into
workflow’s
lib/
directory
or
shared
ac4on
library
/user/oozie/share/lib/sqoop/
20. Common
Sqoop
1
Issues
• Protec4ng
Your
Password
• Sqoop
Works
on
CLI
Not
in
Oozie
• Choosing
Proper
Connector
• Overriding
Type
Mapping
21. Choosing
Proper
Connector
• JDBC
driver
is
dependency
for
all
three
connectors
• Sqoop
automa4cally
chooses
most
op4mal
connector
(OraOoop,
built-‐in,
Generic
JDBC
Connector)
• Or
explicitly
chose:
--connection-manager
com.quest.oraoop.OraOopConnManager
22. Common
Sqoop
1
Issues
• Protec4ng
Your
Password
• Sqoop
Works
on
CLI
Not
in
Oozie
• Choosing
Proper
Connector
• Overriding
Type
Mapping
23. Overriding
Type
Mapping
-‐-‐map-‐column-‐java
parameter
• comma
separated
list
of
key-‐value
pairs
• key
=
exact
column
name
• value
=
target
Java
type
sqoop import
--map-column-java
c1=Float,c2=String,c3=String ...
24. Agenda
Sqoop
1
• Sqoop
1
Architecture
• Sqoop
1
Command
Line
• Sqoop
1
Examples
• Sqoop
1
Challenges
• Troubleshoo4ng
Sqoop
1
• Common
Sqoop
1
Issues
• Protec4ng
Your
Password
• Sqoop
Works
on
CLI
Not
in
Oozie
• Choosing
Proper
Connector
• Overriding
Type
Mapping
Sqoop
2
• Sqoop
2
Architecture
• Sqoop
2
Design
Goals
• Sqoop
2
UI
in
Hue
Resources
26. Sqoop
2
Design
Goals
• Security
and
Separa4on
of
Concerns
• Role
based
access
and
use
• Ease
of
extension
• No
low-‐level
Hadoop
knowledge
needed
• No
func4onal
overlap
between
Connectors
• Ease
of
Use
• Uniform
func4onality
• Domain
specific
interac4ons
27. Sqoop
2
UI
in
Hue
• Troubleshoo4ng
• sqoop.log
file
is
located
in
@LOGDIR@
and
the
rest
should
be
in
server/logs/*
• Look
for
catalina.out,
catalina.log,
localhost-‐*.log