12. • Enhanced Lambda Architecture
• Inbound activities written to Ingestion Processor
• Hbase and then Kafka
• High volume (e.g. web) activities
• First written to Kafka, then enriched
• Spark Streaming applications consume events from Kafka
• Solr Indexing
• Email Reports
• Campaign Processing
• HBase is used for simple historical queries, and is system of record
High Level Architecture
33. Challenge
• Oozie workflows can be challenging to build and debug
• Capacity planning and resource management in the shared Hadoop
cluster is very complex
Recommendation
• Only use Oozie workflows for automating complex or long running
processes, or use a different orchestration platform
• Constantly reevaluate your capacity plan based on current deployment
Migration and Management Challenges
18 months ago our team kicked off an ambitious project which we have since named orion.
A group of us came to Hadoop summit to learn as much as we could. That experience is the inspiration for this talk
We wanted to share is about what we have learned over the last 18 months. What worked well, and what we would do differently
Although the talk isn’t about the project… we have a few slides up front to set the context around what we are working on
If you have been near technology at all in the last couple of years you know that the world has become very connected.
The number of connected devices blows my mind. It’s not just phones anymore…
Amazon dash buttons, coffee makers, propane tanks, garage doors. These devices are sending 10’s of billions of activities and user interactions every day...
Orion is our platfor
Our marketing platform ingests the user interactions process them into relevant marketing touchpoints
Its enables marketers to create marketing campaigns around these activities to build relationships with their customers
Become the fabric for marketers
Its been a great experience building this
Here are a few of the requirements
Near real time processing
At least a 1 billion activities per customer per day.
customer demands from increasing devices caused us to evaluate next get queueing and streaming...
reduction in infrastructure COGS primarily from expensive enterprise class filers...
reduction in people COGS by gained efficiency from reducing tech stack from using too many similar technologies ...
Multitenant… of course
Secure
Customer isolation and improved resource management
Arch requirement driven from biz requirement
Improve utilization over the existing system
Lots of customers in same infra, without starving
Encryption from day 1 for safe data storage
Aim for horz scalability
Radically reduce processing latency
Eliminate backlogs
Brownout protection
Bakeoff to decide which platform to use
Build POCs to pick the best tech stack
Researched various technologies hadoop/non
Decided to take day worth of web traffic and build POC
storm/spark as our event processing platforms
Hbase/Cassandra for storage
And Kafka as the event queue
All combos worked, no clear winner
The amount of load generated was not enough to
Community - Spark had much more active community than Storm
Features - Spark solved batch processing, something Storm couldn't do
Team Experience - HBase to leverage existing Hadoop expertise
History – Our team had poor experiences scaling up our existing Cassandra cluster
A few words about the architecture
Main goal is to inject, process and store marketing events
High level diagram of our event processor
Enhanced Lambda Architecture
Inbound activities written to Ingestion Processor
Hbase and then Kafka
High volume (e.g. web) activities
First written to Kafka, then enriched
Spark Streaming applications consume events from Kafka
Solr Indexing
Email Reports
Campaign Processing
HBase is used for simple historical queries, and is system of record
Reiterates my points on the last slide. I included in case you wanted to look at the slides later
Next thing we are going to talk about are some key points during the implementation phase of the project
Lots of learnings around training
Getting first cluster running
And security
One of the 1st things was to build expertise
Grow knowledge in house
Tech talks lead by architecture team on new infrastructure
online courses (Coursera) Scala and Hadoop
onsite training for Scala (which is the peferred language for sparck steam)
Hortonworks Bootcamp to train operators
Training helped us kickstart project – by getting people in the right mindset
Helped people feel included in the project/process
got people thinking about new technologies
Created a nice foundation for design process
Early training was great
Groups who didn’t leverage knowledge immediately lost context
Would set up hadoop environments early to let people get hands on experience right away
Hands on experience should have spanned all teams
For example, developers were developing in Spark standalone mode, made a rough transition into Yarn cluster mode
Hadoop ecosystem is quite complex
Design possibilities are large, only a few right ways
Difference between right and wrong can be very subtle
Best way to navigate is find experts – hire if possible, or get expertise from partner like Hortonworks
You need experts!
Took a scientiic approach – took POC and did some load tests in AWS
Leading indicator was disk 1/O
Asked HW and HP for recommendations for hardware
Our next task was how to figure out how to build our first cluster, which is quite daunting
Built scale model in AWS
Talked to HP and Hortonworks to get best practices around server builds
Leading indicator not disk -- its compute
We can add either disk only or compute only nodes to scale
Do the initial exercise –but don’t get too hung up on the cluster composition
You will end up resizing and tuning as you scale up anyway
We may add compute only nodes, don’t get too stuck on initial sizing
You can always scale up later
Don’t overscale from day one
ZK not in the path of direct user queries
ZK in VMs did not work well
We think it was disk I/O
Moved to physical boxes and life was much better
Zookeeper, for those of you who are new to Hadoop, is the cluster coordination service
Talk about why you talk about capacity with securty?
From beginning infra needs to meet enterprise sec requirements
All applications are isolated
Restrict applications resource usage (disk IO, etc)
Hadoop has support for kerboros (some parts better then others)
HDFS Native disk encryption
Encrypted disks for Kafka because of lack of native support
Isolated yarn queues
Kerboros really really hard
Allow extra time for kerboros
Training first
Find someone who has done it before (much easier then on your own)
Kerboros has different support, not so great in kafka
We still some bugs we are trying to work out
Kafka doesn’t support data encryption (and won’t because of performance)
Disk encryption ended up not being a critical performance blocker
Ended up rolling back kerborization for spark and
Move Kafka and Spark out of Ambari and manage if you don’t need the features
More control over versions
Take patches faster
Only loosely integrated for now
Next phase was when we were ready to validate our newly built event ingestion system
Wanted to validate that the new system performed as a functional superset of the old one
Doing this on a running system is extremely difficult
We decided early on to require all components to implement run in a silent mode –
Allows us to test on real data for correctness with real data – in the wild
We had automeated CI tests in Jenkins
perf testing at AWS
Passive Mode one of the best moves we made – found countless bugs and config issues
Real world load testing
Super valuable – worth the cost of implemenation
By design writes to both the legacy and new system –
Caused performance issue due to slow writes
Cluster didn’t really go all the way down – because we overloaded ZK
We recommend do Passive mode
Use short timeouts or write async
Make sure you have monitors in place even for passive mode
After finished proving the service in passive mode for beta customers
Massive undertaking
Ready to migrate 6000 subs without any service interruption and no downtown
Maybe say customers instead of subscriptions
Non-trivial!
Marketo has a 24/7/365 commitment
Migrate customers a few subs at a time
Create management and migrations tools
Delete data out of relational database
In order to manage the migration created sirius
Human factor – about 10 teams and 30 sub components
Whole team involved closely with the migration
Automers fallback to legacy system if problem arose
Daily standup to track rollout
This is a pic of our management console
All test data
Example
One big challenge - built on top of oozie
Oozie is powerful but very complex
Capacity planning was more complex then we thought
Ended up ramping up customers -> capacity plan -> ramp up
Only use oozie if you have to
Important to capacity plan in the wild -> one team ended up needing 10% their original estimate
We have had several learnings already running this new infrastructure
Its challenging running dozens of applications to keep track of across 100s of servers
First, needed to add monitors for all the new servers ~350
Created a bunch of spark streaming applications all needing metrics to be reported and monitors
Metrics used for capacity planning and ensuring they are meeting the biz metrics for the project
Didn’t want to overwhelm CTP
Built a new monitoring and metrics system
using OPENTSDB and grafana
Allow us to do trend analysis
Sirius console monitors the biz level metrics
In addition added comprehensive set of monitors to our existing system (Nagios)
Hadoop requires a lot of monitoring
Built a custom monitoring infrastructure using OpenTSDB and Grafana
Allows us to do trend analysis on Hadoop and other infrastructure
Instrumented all of our new applications to report metrics
Added comprehensive Hadoop monitors into our pre-existing production monitoring system (Nagios) to alert our operators of infrastructure issues
Big challenge to create all the monitors to make sure we knew the health of the systems
Constantly tuning monitors to make sure we aren’t over or under alerting
Creating “Goldilocks” alerts for the operators, not too noisy, not too quiet
Big challenge with spark streaming and yarn is that there isn’t any built in facility for patching and upgrading with zero downtime
Really true across all hadoop components
Eventually will have 100s of spark streaming jobs running, and need to do it without interruption
Decided early on that we would build our on tooling for managing patches and upgrades
Allows us to deploy a new set of spark streaming application without intteruptions
Kafka consumers are coded to allow jobs to pick up where they left off
Integrated with CI system
Sirius uses the Oozie workflow engine to manage orchestration during patches/upgrades with minimal downtime
One big challenge is that Ambari doesn’t always stop and start infrastructure in a way that doesn’t cause service interruption
Have been close, but not successful
Test under load! It makes a huge difference. You will hit timeout, etc that upset abari
Check out the communities graceful restart scripts. They seem to be further along
Hortonworks has been very good about learning from our issues and improving the upgrade process