O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Petabytes Scale Analytics Infrastructure @Netflix

1.170 visualizações

Publicada em

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Wrs8.

Tom Gianos and Dan Weeks discuss Netflix' overall big data platform architecture, focusing on Storage and Orchestration, and how they use Parquet on AWS S3 as their data warehouse storage layer. Filmed at qconsf.com.

Daniel Weeks manages the Big Data Compute team at Netflix and is responsible for integrating and enhancing open source big data processing technologies including Spark, Presto, Hadoop. Tom Gianos is a Senior Software Engineer, Big Data Platform at Netflix. He leads development of Genie and has a passion for merging web and big data technologies to solve interesting distributed systems problems.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Petabytes Scale Analytics Infrastructure @Netflix

  1. 1. Daniel C. Weeks Tom Gianos Netflix: Petabyte Scale Analytics Infrastructure in the Cloud
  2. 2. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ netflix-big-data-infrastructure
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  4. 4. ●  Data at Netflix ●  Netflix Scale ●  Platform Architecture ●  Data Warehouse ●  Genie ●  Q&A Overview
  5. 5. Data at Netflix
  6. 6. Our Biggest Challenge is Scale
  7. 7. Netflix Key Business Metrics 86+ million members Global 1000+ devices supported 125+ million hours / day
  8. 8. Netflix Key Platform Metrics 60 PB DW Read 3PB Write 500TB500B Events
  9. 9. Big Data Platform Architecture
  10. 10. Data Pipelines Cloud apps Kafka Ursula Cassandra Aegisthus Dimension Data Event Data 5 min Daily S3 SS Tables
  11. 11. Storage Compute Service Tools Big Data APIBig Data Portal S3 Parquet Transport Visualization Quality Workflow Vis Job/Cluster Vis Interface Orchestration Metadata
  12. 12. Production Ad-hoc Other ~2300 d2.4xl ~1200 d2.4xl
  13. 13. S3 Data Warehouse
  14. 14. Why S3? •  Lots of 9’s •  Features not available in HDFS •  Decouple Compute and Storage
  15. 15. Decoupled Scaling Warehouse Size HDFS Capacity All Clusters 3x Replication No Buffer
  16. 16. Production Ad-hoc S3 Decouple Compute / Storage
  17. 17. Tradeoffs - Performance •  Split Calculation (Latency) –  Impacts job start time –  Executes off cluster •  Table Scan (Latency + Throughput) –  Parquet seeks add latency –  Read overhead and available throughput •  Performance Converges with Volume and Complexity
  18. 18. Tradeoffs - Performance
  19. 19. Metadata •  Metacat: Federated Metadata Service •  Hive Thrift Interface •  Logical Abstraction
  20. 20. Partitioning - Less is More ab_test telemetry etl data_science country_d catalog_d playback_f search_f date=20161101 date=20161102 date=20161103 date=20161104 Database Table Partition
  21. 21. Partition Locations data_science playback_f date=20161101 date=20161102 s3://<bucket>/hive/warehouse/data_science.db/playback_f/dateint=20161101/… s3://<bucket>/hive/warehouse/data_science.db/playback_f/dateint=20161102/…
  22. 22. Parquet
  23. 23. Parquet File Format Column Oriented ●  Store column data contiguously ●  Improve compression ●  Column projection Strong Community Support ●  Spark, Presto, Hive, Pig, Drill, Impala, etc. ●  Works well with S3
  24. 24. Footer RowGroup Metadata row count, size, etc. schema, version, etc. Column Chunk Metadata [encoding, size, min, max] Column Chunk Metadata [encoding, size, min, max] Column Chunk Metadata [encoding, size, min, max] Dict Page Data Page Data Page Column Chunk Data Page Data Page Data Page Column Chunk Dict Page Data Page Data Page Column Chunk Dict Page Data Page Data Page Column Chunk Data Page Data Page Data Page Column Chunk Dict Page Data Page Data Page Column Chunk RowGroupRowGroup
  25. 25. Staging Data •  Partition by low cardinality fields •  Sort by high cardinality predicate fields
  26. 26. Staging Data Sorted Original
  27. 27. Filtered Processed Original
  28. 28. Parquet Tuning Guide http://www.slideshare.net/RyanBlue3/parquet- performance-tuning-the-missing-guide
  29. 29. A Nascent Data Platform Gateway
  30. 30. Need Somewhere to Test Test Gateway Test Prod Gateway Prod
  31. 31. More Users = More Resources Prod Prod Gateway Prod Gateway Prod Gateways Test Prod Gateway Prod Gateway Test Gateways
  32. 32. Clusters for Specific Purposes Prod Prod Gateway Prod Gateway Prod Gateways Test Prod Gateway Prod Gateway Test Gateways Backfill Prod Gateway Prod Gateway Backfill Gateways
  33. 33. User Base Matures Prod Prod Gateway Prod Gateway Prod Gateways Test Prod Gateway Prod Gateway Test Gateways Backfill Prod Gateway Prod Gateway Backfill Gateways R? I need Spark 2.0 I want Spark 1.6.1 There’s a bug in Presto 0.149 need 0.150 My job is slow I need more resources
  34. 34. No one is happy
  35. 35. Genie to the Rescue Prod Test Backfill
  36. 36. Problems Netflix Data Platform Faces •  For Administrators –  Coordination of many moving parts •  ~15 clusters •  ~45 different client executables and versions for those clusters –  Heavy load •  ~45-50k jobs per day –  Hundreds of users with different problems •  For Users –  Don’t want to know details –  All clusters and client applications need to be available for use –  Need to provide tools to make doing their jobs easy
  37. 37. Genie for the Platform Administrator
  38. 38. An administrator wants a tool to… •  Simplify configuration management and deployment •  Minimize impact of changes to users •  Track and respond to problems with system quickly •  Scale client resources as load increases
  39. 39. Genie Configuration Data Model •  Metadata about cluster –  [sched:sla, type:yarn, ver:2.7.1] •  Executable(s) –  [type:spark-submit, ver:1.6.0] •  Dependencies for an executable Cluster Command ApplicaLon 1 0..* 1 0..*
  40. 40. Search Resources
  41. 41. Administration Use Cases
  42. 42. Updating a Cluster •  Start up a new cluster •  Register Cluster with Genie •  Run tests •  Move tags from old to new cluster in Genie –  New cluster begins taking load immediately •  Let old jobs finish on old cluster •  Shut down old cluster •  No down time!
  43. 43. Load Balance Between Clusters •  Different loads at different times of day •  Copy tags from one cluster to another to split load •  Remove tags when done •  Transparent to all clients!
  44. 44. Update Application Binaries •  Copy new binaries to central download location •  Genie cache will invalidate old binaries on next invocation and download new ones •  Instant change across entire Genie cluster
  45. 45. Genie for Users
  46. 46. User wants a tool to… •  Discover a cluster to run job on •  Run the job client •  Handle all dependencies and configuration •  Monitor the job •  View history of jobs •  Get job results
  47. 47. Submitting a Job 1. https://analyticsforinsights.files.wordpress.com/2015/04/superman-data-scientist-graphic.jpg { … “clusterCriteria”:[ “type:yarn”, “sched:sla” ], “commandCriteria”:[ “type:spark”, “ver:1.6.0” ] … } Commands Clusters
  48. 48. Genie Job Data Model Cluster Command ApplicaLon Job Job Request Job ExecuLon Job Metadata 0..* 1 1 1 1 1 1 1 1 1 1 1
  49. 49. Job Request
  50. 50. Python Client Example
  51. 51. Job History
  52. 52. Job Output
  53. 53. Wrapping Up
  54. 54. Data Warehouse •  S3 for Scale •  Decouple Compute & Storage •  Parquet for Speed
  55. 55. Genie at Netflix •  Runs the OSS code •  Runs ~45k jobs per day in production •  Runs on ~25 i2.4xl instances at any given time •  Keeps ~3 months of jobs (~3.1 million) in history
  56. 56. Resources •  http://netflix.github.io/genie/ –  Work in progress for 3.0.0 •  https://github.com/Netflix/genie –  Demo instructions in README •  https://hub.docker.com/r/netflixoss/genie- app/ –  Docker Container
  57. 57. Questions?
  58. 58. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ netflix-big-data-infrastructure

×