O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Letgo Data Platform: A global overview

768 visualizações

Publicada em

How to develop a Big Data platform around Spark.

Publicada em: Dados e análise
  • Sex in your area is here: www.bit.ly/sexinarea
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • Dating for everyone is here: www.bit.ly/2AJerkH
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • Sex in your area for one night is there tinyurl.com/hotsexinarea Copy and paste link in your browser to visit a site)
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • Girls for sex are waiting for you https://bit.ly/2TQ8UAY
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • Meetings for sex in your area are there: https://bit.ly/2TQ8UAY
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui

Letgo Data Platform: A global overview

  1. 1. 1 Letgo Data Platform A global overview
  2. 2. 2 Data Engineer Ricardo Fanjul
  3. 3. 3 LETGO DATA PLATFORM IN NUMBERS 500GB 530+ Data daily Event Types Events Processed Daily 565M 180TB Storage (S3) 8K Events Processed per Second < 1sec NRT Processing Time
  4. 4. 4 OUR DATA JOURNEY
  5. 5. 5 OUR DATA JOURNEY
  6. 6. 6 MOVING TO EVENTS
  7. 7. 7 MOVING TO EVENTS Domain EventsTracking Events
  8. 8. 8 MOVING TO EVENTS “A Domain Event captures the memory of something interesting (external stimulus) which affects the domain” Martin Fowler Domain EventsTracking Events
  9. 9. 9 GO TO EVENTS { "data":{ "id":"1c4da9f0-605e-4708-a8e7-0f2c97dff16e", "type":"message_sent", "attributes":{ "id":"e0abcd6c-c027-489f-95bf-24796e421e8b", "conversation_id":"47596e76-0fb4-4155-9aeb-6a5ba14c9cef", "product_id":"a0ef64a5-0a4d-48b8-9124-dd57371128f5", "from_talker_id":"5mCK6K8VCc", "to_talker_ids":[ "def4442e-6df5-4385-938a-8180ddfb6c5e" ], "message_id":"e0abcd6c-c027-489f-95bf-24796e421e8b", "type":"offer", "message":"Is this item still available?", "belongs_to_spam_conversation":false, "sent_at":1509235499995 } }, "meta":{ "created_at":1509235500012, ... }
  10. 10. 10 THE PATH WE CHOSE
  11. 11. 11 INGEST Data Ingestion Storage PROCESSING Stream Batch DISCOVER Query Data exploitation Orchestration
  12. 12. 12 INGEST Data Ingestion Storage PROCESSING Stream Batch DISCOVER Query Data exploitation Orchestration
  13. 13. 13 DATA INGESTION
  14. 14. 14 DATA INGESTION: Kafka Connect Kafka Connect: Connectors
  15. 15. 15 DATA INGESTION Event A: Event …:
  16. 16. 16 INGEST Data Ingestion Storage PROCESSING Stream Batch DISCOVER Query Data exploitation Orchestration
  17. 17. 17 STORAGE We want to store all events coming from Kafka to S3
  18. 18. 18 STORAGE
  19. 19. 19
  20. 20. 20 Duplicated events
  21. 21. 21 STORAGE: Deduplication Deduplication (Exactly-once)
  22. 22. 22 STORAGE: Deduplication
  23. 23. 23 Late events
  24. 24. 24 STORAGE: Late events
  25. 25. 25 1 2 3 4 1 5 6 Read batch of events from Kafka. Write each event to Cassandra. Write dirty “hours” to compact topic: Key=(event_type, hour). Read dirty “hours” topic. Read all events with dirty hours. Store in S3 1 2 3 4 1 5 6 STORAGE: Late events Dirty Buckets
  26. 26. 26 STORAGE: Late events Some time later: Count events in Cassandra and S3 and check inconsistencies.
  27. 27. 27 S3 Big data problems
  28. 28. 28 1. Eventual consistency 2. Very slow renames: Rename = copy + delete. Some S3 Big Data problems: STORAGE: S3 Big data problems Why?
  29. 29. 29 1. Eventual consistency STORAGE: S3 Big data problems
  30. 30. 30 Available in: S3GUARD Solution STORAGE: S3 Big data problems
  31. 31. 31 ¿Job freeze? 2. Slow renames STORAGE: S3 Big data problems
  32. 32. 32 3.1 • New S3A Committers Solution STORAGE: S3 Big data problems
  33. 33. 33 • Committers Architecture • S3A Committers • Spark and S3 with Ryan Blue: SlideShare, YouTube STORAGE: S3 Big data problems
  34. 34. 34 STORAGE: S3 Big data problems Surprise!
  35. 35. 35 INGEST Data Ingestion Storage PROCESSING Stream Batch DISCOVER Query Data exploitation Orchestration
  36. 36. 36 REAL-TIME USER SEGMENTATION
  37. 37. 37 STREAMING: REAL-TIME USER SEGMENTATION Stream Journal User bucket’s changed 1 2
  38. 38. 38 REAL-TIME USER STATISTICS
  39. 39. 39 STREAMING: REAL-TIME USER STATISTICS What we do: • Sessions (Last session) • Conversation (Last conversation created, counts, …) • Products (Last product approved, counts, ...) • User classification (Last bucket changed, counts, ...) Tool:
  40. 40. 40 REAL-TIME PATTERN DETECTION
  41. 41. 41 STREAMING: REAL-TIME PATTERN DETECTION Is it still available? Is the price negotiable? What condition is it in? I offer you….$ Could we meet at……?
  42. 42. 42 STREAMING: REAL-TIME PATTERN DETECTION Event A + Event B = Complex Event C Event A + NOTHING in 2 hours = Complex Event D Some common use cases: • Fraud detection, ¿Scammers detection? • Real time recommendations
  43. 43. 43 GO TO EVENTS: Tips INGEST Data Ingestion Storage PROCESSING Stream Batch DISCOVER Query Data exploitation Orchestration
  44. 44. 44 GEODATA ENRICHMENT
  45. 45. 45 BATCH: Geodata enrichment { "data": { "id": "105dg3272-8e5f-426f-bca0- 704e98552961", "type": "some_event", "attributes": { "latitude": 39.740028705904, "longitude": -104.97341156236 } }, "meta": { "created_at": 1522886400036 } }
  46. 46. 46 POINT (77.3548351 28.6973627) Coordinates: Longitude and latitude. What we know: BATCH: Geodata enrichment
  47. 47. 47 Where is this point? City, State, zip code, DMA. BATCH: Geodata enrichment
  48. 48. 48 POLYGON((-114.816294 32.508038,-114.814321 32.509023,-114.810159 32.508383,-114.807726 32.508726,-114.805239 32.509985... We represent in this way: States, Cities, Zip codes, DMAs What we wanted to know: BATCH: Geodata enrichment
  49. 49. 49 How we do it: • We load Cities, States, Zip codes, DMAs… polygons from Well-known text (WKT) • Create indexes using JTS Topology Suite • Custom Spark SQL UDF. Tool: BATCH: Geodata enrichment SELECT geodata.dma_name, geodata.dma_number AS dma_number, geodata.city AS city, geodata.state AS state , geodata.zip_code AS zip_code FROM ( SELECT geodata(longitude, latitude) AS geodata FROM …. )
  50. 50. 50 INGEST Data Ingestion Storage PROCESSING Stream Batch DISCOVER Query Data exploitation Orchestration
  51. 51. 51 QUERYING DATA
  52. 52. 52 QUERYING DATA
  53. 53. 53 QUERYING DATA Metastore
  54. 54. 54 WHY WE CHOSE SPARK THRIFT SERVER?
  55. 55. 55 QUERYING DATA: WHY WE CHOSE SPARK THRIFT SERVER?
  56. 56. 56 QUERYING DATA: WHY WE CHOSE SPARK THRIFT SERVER?
  57. 57. 57 QUERYING DATA: WHY WE CHOSE SPARK THRIFT SERVER? Thrift Server
  58. 58. 58 QUERYING DATA
  59. 59. 59 QUERYING DATA CREATE EXTERNAL TABLE IF NOT EXISTS database_name.table_name( some_column STRING..., dt DATE ) PARTITIONED BY (`dt`) USING PARQUET LOCATION 's3a://bucket-name/database_name/table_name' CREATE TABLE IF NOT EXISTS database_name.table_name( some_column STRING, ... dt DATE ) USING json PARTITIONED BY (`dt`) CREATE TABLE IF NOT EXISTS database_name.table_name using com.databricks.spark.redshift options ( dbtable 'schema.redshift_table_name', tempdir 's3a://redshift-temp/', url 'jdbc:redshift://xxxx.redshift.amazonaws.com:5439/letgo?user=xxx&password=xx x', forward_spark_s3_credentials 'true' ) CREATE TEMPORARY VIEW table_name USING org.apache.spark.sql.cassandra OPTIONS ( table "table_name", keyspace "keyspace_name" )
  60. 60. 60 QUERYING DATA CREATE TABLE … USING [parquet,json,csv…] CREATE TABLE … STORED AS… VS
  61. 61. 61 QUERYING DATA CREATE TABLE … STORED AS… VS 70% Higher performance! CREATE TABLE … USING [parquet,json,csv…]
  62. 62. 62 BATCHES WITH SQL
  63. 63. 63 QUERYING DATA: Batches with SQL CREATE EXTERNAL TABLE IF NOT EXISTS database.some_name( user_id STRING, column_b STRING, ... ) USING PARQUET PARTITIONED BY (`dt` STRING) LOCATION 's3a://example/some_table' 1. Creating the table
  64. 64. 64 QUERYING DATA: Batches with SQL INSERT OVERWRITE TABLE database.some_name PARTITION(dt) SELECT user_id, column_b, dt FROM other_table ... 2. Inserting data
  65. 65. 65 QUERYING DATA: Batches with SQL
  66. 66. 66 QUERYING DATA: Batches with SQL 200 files because default value of “spark.sql.shuffle.partition”
  67. 67. 67 QUERYING DATA: Batches with SQL INSERT OVERWRITE TABLE database.some_name PARTITION(dt) SELECT user_id, column_b, dt FROM other_table ... Solution
  68. 68. 68 QUERYING DATA: Batches with SQL DISTRIBUTE BY (dt): Only one file not Sorted CLUSTERED BY (dt, user_id, column_b): Multiple files DISTRIBUTE BY (dt) SORT BY (user_id, column_b): Only one file sorted by user_id, column_b. Good for joins using this properties.
  69. 69. 69 QUERYING DATA: Batches with SQL INSERT OVERWRITE TABLE database.some_name PARTITION(dt) SELECT user_id, column_b, dt FROM other_table ... DISTRIBUTE BY (dt) SORT BY (user_id)
  70. 70. 70 QUERYING DATA: Batches with SQL • Hive Bucketing in Apache Spark: SlideShare, YouTube.
  71. 71. 71 INGEST Data Ingestion Storage PROCESSING Stream Batch DISCOVER Query Data exploitation Orchestration
  72. 72. 72 DATA EXPLOITATION Thrift Server Metastore
  73. 73. 73 DATA EXPLOITATION
  74. 74. 74 DATA EXPLOITATION: Superset
  75. 75. 75 DATA SCIENTISTS (WINTER IS COMING)
  76. 76. 76 DATA EXPLOITATION: Data Scientists
  77. 77. 77 DATA EXPLOITATION: Data Scientists
  78. 78. 78
  79. 79. 79 DATA EXPLOITATION: Data Scientists Too many small files!
  80. 80. 80 DATA EXPLOITATION: Data Scientists Huge Query!
  81. 81. 81 DATA EXPLOITATION: Data Scientists Too much shuffle!
  82. 82. 82 DATA EXPLOITATION: Data Scientists
  83. 83. 83 Some data scientist tips Listen carefully or you will die!
  84. 84. 84 Transversal workshops: BI, Data Scientist, Data engineer. Create a wiki with: Guidelines, tips, links... Learn from your mistakes! DATA EXPLOITATION: Data Scientists Well done!
  85. 85. 85 INGEST Data Ingestion Storage PROCESSING Stream Batch DISCOVER Query Data exploitation Orchestration
  86. 86. 86 ORCHESTRATION • It’s an open source project writen in Python by Airbnb. • Easy to Schedule Jobs, write new tasks and monitoring them.
  87. 87. 87 ORCHESTRATION
  88. 88. 88 ORCHESTRATION I’m happy!!
  89. 89. 89 TIPS
  90. 90. 90 Bigger is better
  91. 91. 91 TIPS: Bigger is better 20 slots Total used: 3x6x4=72 Total wasted: 2x4=8 Total used: 3x26=78 Total wasted: 2 80 slots We prefer big Hadoop instances. We use m4.16xlarge instances. We are migrating to m5.24xlarge instances.
  92. 92. 92 Job Scheduling
  93. 93. 93 TIPS: Job Scheduling 1 2
  94. 94. 94 Be polite
  95. 95. 95 TIPS: Be polite spark.executor.memory = 4G spark.executor.cores = 3 spark.driver.memory = 4G spark.driver.cores = 2 ## dynamicAllocation spark.shuffle.service.enabled=true spark.dynamicAllocation.enabled=true spark.executor.instances=1 spark.dynamicAllocation.minExecutors=1 spark.dynamicAllocation.initialExecutors=3 spark.dynamicAllocation.maxExecutors=20 spark.dynamicAllocation.cachedExecutorIdleTimeout=60s Others are using the same cluster, don´t over reclaim resources.
  96. 96. 96 But not too polite
  97. 97. 97 TIPS: But not too polite
  98. 98. 98 Metrics, metrics!
  99. 99. 99 TIPS: SPARK METRICS What we did: • Custom InfluxDB Spark metrics sink • Custom Spark metrics using Dropwizard • Grafana Dashbords and alerts
  100. 100. 100 Mistakes are good
  101. 101. 101 TIPS: Mistakes are good
  102. 102. 102 Do you want to join us?

×