The way you operate your Big Data environment is not going to be the same anymore. This session is based on our experience managing on-premise environments
and taking the lesson from innovative data-driven companies that successfully migrated their multi PB Hadoop clusters. Where to start and what decisions you have to make to gradually becoming cloud ready. The examples would refer to Google Cloud Platform yet the challenges are common.
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetInData
1. Elephants in The Cloud
or How to Become Cloud Ready
Krzysztof Adamski, GetInData
2. So You Say You Don’t Use Cloud?
HR System Online Documents Mobile PhoneEmail Server
3. Trust as a Key Factor
Image source: https://www.forbes.com/sites/louiscolumbus/2017/04/23/2017-state-of-cloud-adoption-and-security
4. More Secure or Not
In the end, do you
really think you can
provide better
infrastructure security
than cloud providers
???
5. Migration Questions?
How fast can you start/expand your analytics initiative?
1
How often is your cluster fully busy and your employees want more computing
power right now?2
How much time you spend on maintaining your infra?
3
How much time does it take you to gracefully apply all the security patches in
your Hadoop cluster?4
Do you need hardware that you don’t have in your data-center e.g. GPU,
terrible amounts of RAM5
7. Migration Goals
Transition from infrastructure engineering
towards data engineering
1
Use the best possible technology stack in the
world
2
Free your time
3
Attract the best engineers
4
Ultimate world domination ;)
5
9. Before You Start
Be smart
with which
service you
choose
Avoid
lock-in
Try to
estimate
the costs
See what
others
are doing
Technology
choices
Yet another
migration
Hardware,
engineering, legal
Netflix, Spotify, Etsy
15. Strong Global Consistency
Google Cloud Storage provides strong global consistency for the following
operations, including both data and metadata:
● Read-after-write
● Read-after-metadata-update
● Read-after-delete
● Bucket listing, Object listing
● Granting access to resources
16. Eventual Consistency
● Revoking access from resources
It typically takes about a minute for revoking access to take effect. In some
cases it may take longer.
Beware of a cache though.
17. Pricing
● Pay-per-second billing
Keep in mind that if you often do sub-10
minute analyses using VMs, serverless
options may be better suited since VMs
are relatively slow to boot and serverless
functions are billed at every 100ms.
21. Baby Steps
Prepare your hadoop cluster to interact
with object storage.
1
Look for existing operators for popular
tools like Apache Airflow.
2
Make a copy of your critical datasets to
the cloud.
3
Use both BigQuery for fast analytics and
GCS output for more advanced trials.
4
Audit costs per query.
5
22. Networking
High bandwidth, low
latency and consistent
network connectivity is
critical.
Pay attention to such
things like choosing the
right region, number of
cores or even TCP
window size.
But to get the full speed
dedicated interconnect /
direct peering is the way
to go.
Multiple VPN tunnels
are a good starting
point to increase
bandwidth.
Transfer appliances for
offline data migration.
24. Package Your Deployments
● Containers (docker) for tooling.
● Deployment artifacts (Spark / MR
jars).
● Tools like Spydra can help you
executing your packages in both
worlds
$ cat examples.json
{
"client_id": "simple-spydra-test",
"cluster_type": "dataproc",
"log_bucket": "spydra-test-logs",
"region": "europe-west1",
"cluster": {
"options": {
"project": "spydra-test"
}
},
"submit": {
"job_args": [
"pi",
"8",
"100"
],
"options": {
"jar": "hadoop-mapreduce-examples.jar"
}
}
}
$ spydra submit --spydra-json example.json
25. Other Important Features
● Cluster pooling - using init actions to kill old clusters
● Autoscaling - based on the workload
● Preemptible instances:
○ A reasonable choice for your cluster
○ Keep in mind final resilience (idempotence)
○ Available also with GPUs
26. No Long-Lived Services
● No patching! - YAY
● No wasting resources
● Latest security patches
applied automatically
27. Predictions
Forrester predicts
SaaS vendors will de-prioritize
their platform efforts to attain
global scale.
They will compete more at the platform level by running
portions of their services on AWS, Azure, GCP or Oracle Cloud
in 2018.
”
”