A presentation on the Netflix Cloud Architecture and NetflixOSS open source. For the All Things Open 2015 conference in Raleigh 2015/10/19. #ATO2015 #NetflixOSS
2. About Netflix
69M members
2000+ employees (1400 tech)
80+ countries
> 100M hours watch per day
> ⅓ NA internet download traffic
500+ Microservices
Many 10’s of thousands VM’s
3 regions across the world
3. About the Speaker
Cloud platform technologies
Distributed configuration, service discovery, RPC, application
frameworks, non-Java sidecar
Container cloud
Resource management and scheduling, making Docker containers
operational in Amazon EC2/ECS
Open Source
Organize @NetflixOSS meetups & internal group
Performance
Assist across Netflix, but focused mainly on cloud platform perf
With Netflix for ~ 1 year. Previously at IBM here in Raleigh/Durham
(RTP)
@aspyker
ispyker.blog
spot.com
5. Why does Netflix open source?
Allows engineers to gather feedback
Openly talk, through code, on our approach
Collaboration on key projects with the world
Happily use proven outside open source
And improve it for Netflix scale and availability
Netflix culture of freedom and responsibility
Want to open source?
Go for it, be responsible!
Recruiting and Retention
Candidates know exactly what they can work on
NetflixOSS engineers choose to stay at Netflix
6. NetflixOSS is widely used
The architecture has shaped public cloud usage
Immutability, Red/Black Deploys, Chaos,
Regional and worldwide high availability
Offerings
Pivotal Spring Cloud
Large usage
IBM Watson as a Service (on IBM Cloud)
Nike Digital is hiring NetflixOSS experts
Interesting usage
“To help locate new troves of data claiming to be the files stolen from
AshleyMadison, the company’s forensics team has been using a tool
that Netflix released last year called Scumblr”
8. Key aspects of NetflixOSS website
Show how the pieces fit together
Projects now discussed with each other in context
OSS categories mirror internal teams
No artificial categories, focal points for each area
Focus on projects that are core to Netflix
Projects mentioned are core and strategic
11. Elastic, Web and Hyper Scale
Front end
API
Another
Microservice
Temporal
caching
Durable
Storage
Load
Balancers
Strategy Benefit
Make deployments automated Without automation impossible
Expose well designed API to users Offloads presentation complexity to clients
Remove state for mid tier services Allows easy elastic scale out
Push temporal state to client and caching tier Leverage clients, avoids data tier overload
Use partitioned data storage Data design and storage scales with HA
Recommendation
Microservice
13. Micro service
Implementation
Call microservice #2
Highly Available Service Runtime Recipe
Ribbon REST client
with Eureka
Microservice #1
(REST services)
App Service
Microservice #2
Execute
call
Hystrix
Eureka
Server(s)
Eureka
Server(s)
Eureka
Server(s)
Karyon
Fallback
Implementation
Implementation Detail Benefits
Decompose into micro services
• Key user path always available
• Failure does not propagate across service boundaries
Karyon /w automatic Eureka registration
• New instances are quickly found
• Failing individual instances disappear
Ribbon client with Eureka awareness
• Load balances & retries across instances with “smarts”
• Handles temporal instance failure
Hystrix as dependency circuit breaker
• Allows for fast failure
• Provides graceful cross service degradation/recovery
14. IaaS High Availability
Region (us-east-1)
us-east-1e
us-east-1c
Eureka
Web App Service1 Service2
Cluster Auto Recovery and Scaling Services (Auto Scaling Groups)
ELB’s
Rule Why?
Always > 2 of everything 1 is SPOF, 2 doesn’t web scale and slow DR recovery
Including IaaS and cloud services You’re only as strong as your weakest dependency
Use auto scaler/recovery monitoring Clusters guarantee availability and service latency
Use application level health checks Instance on the network != healthy
Worldwide availability Data replication, global front-end routing, cross region traffic
us-east-1d
15. A truly global service
Replicate data across
regions
Be able to redirect traffic
from region to region
Be able to migrate regional
traffic to other regions
Have automated control
across regions
Flux Demo
16. Testing is only way to prove HA
Chaos Monkey
Kill instances in production - runs regularly
Chaos Gorilla
Kills availability zones (single datacenter)
Also testing for split brain important
Chaos Kong
Kill entire region and shift traffic globally
Run frequently but with prior scheduling
18. v
Continuous Delivery
Cluster v1 Canary v2 Cluster V2
Step Technology
Developers test locally Unit test frameworks
Continuous build Continuous build server based on gradle builds
Build “bakes” full instance image Aminator and deployment pipeline bake images from build artifacts
Developer work across dev and test Archaius allows for environment based context
Developers do canary tests, red/black
deployments in prod
Asgard console provides app cluster common devops approach,
security patterns, and visibility
Continuous
Build Server
Baked to images
(AMI’s)
19. From Asgard to Spinnaker
Spinnaker is our CI/CD solution
CI/CD solution including baking and Jenkins integration
Workflow engine for the continuous delivery
Pipeline based deployment including baking
Global visibility across all of our AWS regions
Provides an API first design
A microservices runtime HA architecture
More flexible cloud model so the community can contribute back
improvements not related to AWS
Asgard continues to work side-by-side
Spinnaker is this new end to end CI/CD tool
22. Operational Visibility
Microservice #1 Microservice #2
Visibility Point Technology
Basic IaaS instance monitoring Not enough (not scalable, not app specific)
User like external monitoring SaaS offerings or OSS like Uptime
Targeted performance, sampling Vector performance and app level metrics
Service to service interconnects Hystrix streams ➔Turbine aggregation ➔Hystrix dashboard
Application centric metrics Servo/Spectator gauges, counters, timers sent to metrics store like Atlas
Remote logging Logstash/Kibana or similar log aggregation and analysis frameworks
Threshold monitoring and alerts Services like Atlas and PagerDuty for incident management
Servo/
Spectator
Hystrix/Turbine
External
Uptime
Monitoring Metric/Event
Repositories
LogStash/Elastic
Search/Kibana
Incidents
Atlas
Vector
24. Dynamic, Web Scale & Simpler Security
Security Monkey
Monitors security policies, tracks changes, alerts on situations
Scumblr
Searches internet for security “nuggets” (credentials, hacking discussions)
Sketchy
A safe way to collect text and screenshots from websites
FIDO
Automated event detection, analysis, enrichment & and enforcement
Sleepy Puppy
Delayed cross site scripting propagation testing framework
Lemur
x.509 certificate orchestration framework
25. What did we not cover?
Over 50 github projects
NetflixOSS is “Technical indigestion as a service”
Big Data, Data Persistence and UI Engineering
Big Data tools used well beyond Netflix
Ephemeral, semi and fully persistent data systems
Recent addition of UI OSS and Falcor
27. How do I get started?
All of the previous slides shows NetflixOSS components
Code: http://netflix.github.io
Announcements: http://techblog.netflix.com/
Want to get running a bit faster?
ZeroToCloud
Workshop for getting started with build/bake/deploy in Amazon EC2
ZeroToDocker
Docker images that containing running Netflix technologies (not production
ready, but easy to understand)
28. ZeroToDocker Demo
Mac OS X
Virtual Box
Ubuntu 14.04
single kernel
Container#1
Filesystem+
process
Eureka
Container
ZuulContainer
Another
Container
...
Docker running instances
Single kernel
Contained processes
Zookeeper and Exhibitor
A Microservices app and
surrounding NetflixOSS
services (Zuul to Karyon
with Eureka)