3. www.datatons.com
Motivation
● Everybody wants to jump into Big Data
● Everybody wants their new setup to be cheap
– Cloud is an excellent option for this
● These environments generally start as a PoC
– They should be re-implemented
– Sometimes they are not
4. www.datatons.com
Motivation
● You may need to move your Hadoop cluster
– You want to reduce costs
– You need more performance
– Because of corporate policy
– For legal reasons
● But moving big data volumes is a problem!
– Example: 20 TB at 10 MB/s ~ 2 ½ days
6. www.datatons.com
Classic UNIX methods
● Well-known file transfer technologies:
– (s)FTP
– Rsync
– NFS + cp
● You need to set up a staging area
● This acts as an intermediate space between
Hadoop and the classic UNIX world
8. www.datatons.com
Classic UNIX methods
● Disadvantages:
– Needs a big staging area
– Transfer times are slow
– Single nodes act as bottlenecks
– Metadata needs to be copied separately
– Everything must be stopped during the copy to avoid
data loss
– Total downtime: several hours or days (don't even try
if your data is bigger)
9. www.datatons.com
Using Amazon S3
● AWS S3 storage is also an option for staging
● Cheaper than VM disks
● Available almost everywhere
● An access key is needed
– Create a user only with S3 permissions
● Transfer is done using distcp
– (We'll see more about this later)
11. www.datatons.com
Distcp
● Distcp copies data between two Hadoop clusters
● No staging area needed (Hadoop native)
● High throughput
● Metadata needs to be copied separately
● Clusters need to be connected
– Via VPN for hdfs protocol
– NAT can be used when using webhdfs
● Kerberos complicates matters
13. www.datatons.com
Remote cluster access
● As a side note, remote filesystems can also be
used outside distcp
● For example, as LOCATION for Hive tables
● While we're at it...
● We can transform data
– For example, convert files to Parquet
● Is this the right time?
15. www.datatons.com
Requirements
● Install servers in the new platform
– Enough to hold ALL data
– Same OS + config as original platform
– Config management tools are helpful for this
● Set up connectivity
– VPN (private networking) is needed
● Rack-aware configuration: new nodes need to be on a
new rack
● System times and time zones should be consistent
17. www.datatons.com
Starting the copy
● New nodes will have a DataNode role
● No computing yet (YARN, Impala, etc.)
● DataNode roles will be stopped at first
● When started:
– If there is only one rack in the original platform, the
copy process will begin immediately
– If there is more than one rack in the original,
manual intervention will be required
22. www.datatons.com
Transfer speed
● Two parameters affect the data transfer speed:
– dfs.datanode.balance.bandwidthPerSec
– dfs.namenode.replication.work.multiplier.per.iteration
● No jobs are launched in the new nodes
– Data flow is almost exclusive to the copy
24. www.datatons.com
Moving master roles
● When possible, take advantage of HA:
– Zookeeper (just add two)
– NameNode
– ResourceManager
● Others need to be migrated manually:
– Hive metastore DB needs to be copied
– Having a DNS name for the DB helps
26. www.datatons.com
Moving data I/O
● Once data is copied (fully or most of it), new
computation roles will be deployed:
– NodeManager
– Impalad
● Roles will be stopped at first
● Auxiliary nodes (front-end, app nodes, etc) need to
be deployed in the new platform
● A planned intervention (at a low usage time) needs to
take place
28. www.datatons.com
During the intervention
● The cluster is stopped
● If necessary, client configuration is redeployed
● Services are started and tested in this order:
– Zookeeper
– HDFS
– YARN (only for the new platform)
– Impala (only for the new platform)
● Auxiliary services in the new platform are tested
● Green light? Change the DNS for the entry points
30. www.datatons.com
Conclusions and afterthoughts
● Minimal downtime, similar to non-Hadoop
planned works
● Data and service are never at risk
● Hadoop tools are used to solve a Hadoop
problem
● No user impact: no change in data or access
● Kerberos is not an issue (same REALM + kdc)