O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

[AWS Builders] Effective AWS Glue

770 visualizações

Publicada em

AWS에서는 Big Data 분석 및 처리를 위해 다양한 Analytics 서비스를 지원합니다. 이 세션에서는 시간이 지날수록 증가하는 데이터 분석 및 처리를 위해 데이터 레이크 카탈로그를 구축하거나 ETL을 위해 사용되는 AWS Glue 내부 구조를 살펴보고 효율적으로 사용할 수 있는 방법들을 소개합니다.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

[AWS Builders] Effective AWS Glue

  1. 1. 3 3 1 0
  2. 2. 1 2 , 웨비나 0 2 ,
  3. 3. 루키 프로 마스터
  4. 4. 루키 프로 마스터
  5. 5. 모두 마스터가 되어보세요! J
  6. 6. 강연 중 질문하는 방법 AWS Builders Go to Webinar “Questions” 창에 자신이 질문한 내역이 표시됩니다. 기본적으로 모든 질문은 공개로 답변 됩니다만 본인만 답변을 받고 싶으면 (비공개)라고 하고 질문해 주시면 됩니다. 본 컨텐츠는 고객의 편의를 위해 AWS 서비스 설명을 위해 온라인 세미나용으로 별도로 제작, 제공된 것입니다. 만약 AWS 사이트와 컨텐츠 상에서 차이나 불일치가 있을 경우, AWS 사이트(aws.amazon.com)가 우선합니다. 또한 AWS 사이트 상에서 한글 번역문과 영어 원문에 차이나 불일치가 있을 경우(번역의 지체로 인한 경우 등 포함), 영어 원문이 우선합니다. AWS는 본 컨텐츠에 포함되거나 컨텐츠를 통하여 고객에게 제공된 일체의 정보, 콘텐츠, 자료, 제품(소프트웨어 포함) 또는 서비스를 이용함으로 인하여 발생하는 여하한 종류의 손해에 대하여 어떠한 책임도 지지 아니하며, 이는 직접 손해, 간접 손해, 부수적 손해, 징벌적 손해 및 결과적 손해를 포함하되 이에 한정되지 아니합니다. 고지 사항(Disclaimer)
  7. 7. § Introdution § Glue internal § Items § Item1: Processing lots of small files § Item2: Processing a few large files § Item3: Optimizing parallelism § Item4: JDBC partitions § Item5: Python udf & performance § Item6: Scheduler § Item7: Python shell § QnA
  8. 8. Introduction
  9. 9. Fully-managed, serverless extract-transform-load (ETL) service for developers, built by developers 1000s of customers and jobs A year ago …
  10. 10. AWS Glue Serverless data catalog & ETL service Data Catalog ETL Job authoring Discover data and extract schema Auto-generates customizable ETL code in Python and Scala Automatically discovers data and stores schema Data searchable, and available for ETL Generates customizable code Schedules and runs your ETL jobs Serverless, flexible, and built on open standards
  11. 11. Putting it together - data lake with AWS Glue Amazon S3 (Raw data) Amazon S3 (Staging data) Amazon S3 (Processed data) AWS Glue Data Catalog Crawlers Crawlers Crawlers
  12. 12. Select AWS Glue customers
  13. 13. AWS Glue Serverless data catalog & ETL service Data Catalog ETL Job authoring Discover data and extract schema Auto-generates customizable ETL code in Python and Scala Automatically discovers data and stores schema Data searchable, and available for ETL Generates customizable code Schedules and runs your ETL jobs Serverless, flexible, and built on open standards
  14. 14. Glue internal
  15. 15. Programming Environment • ETL in Python • Python 2.7 • Boto 3 • ETL in Scala • Scala 2.11 • Spark Cluster • Spark 2.2.1
  16. 16. Programming Environment • 1 DPU (Data Processing Unit) • 1 m4.xlarge node • 4vCPU • 16G RAM • 2 executors • 1 Executor • 5G RAM • 4 Tasks Driver Executors
  17. 17. Programming Environment • Glue Job • Minimum DPU: 2 • Default DPU: 10 • Ex) 10 DPU Job • 10 node cluster • 1 Master + 9 Core Nodes • 18 executors • 1 driver • 17 executors
  18. 18. Programming Environment • Internal argument to AWS Glue • --conf • --debug • --mode • --JOB_NAME
  19. 19. Basics of ETL Job Programming 1. Initialize 2. Read 3. Transform data 4. Write ## Initialize glueContext = GlueContext(SparkContext.getOrCreate()) ## Create DynamicFrame and retrieve data from source ds0 = glueContext.create_dynamic_frame.from_catalog ( database = "mysql", table_name = "customer", transformation_ctx = "ds0") ## Implement data transformation here ds1 = ds0 ... ## Write DynamicFrame from Catalog ds2 = glueContext.write_dynamic_frame.from_catalog ( frame = ds1, database = "redshift", table_name = "customer_dim", redshift_tmp_dir = args["TempDir"], transformation_ctx = "ds2")
  20. 20. What is Apache Spark? Parallel, scale-out data processing engine Fault-tolerance built-in Flexible interface: Python scripting, SQL Rich eco-system: ML, Graph, analytics, … Apache Spark and AWS Glue ETL Spark core: RDDs SparkSQL Dataframes DynamicFrames AWS Glue ETL AWS Glue ETL libraries Integration: Data Catalog, job orchestration, code-generation, job bookmarks, S3, RDS ETL transforms, more connectors & formats New data structure: DynamicFrames
  21. 21. Dataframes Core data structure for SparkSQL Like structured tables Need schema up-front Each row has same structure Suited for SQL-like analytics Dataframes and Dynamic Frames Dynamic Frames Like dataframes for ETL Designed for processing semi-structured data, e.g. JSON, Avro, Apache logs ...
  22. 22. Public GitHub timeline is … 35+ event types semi-structured payload structure and size varies by event type
  23. 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. schema per-record, no up-front schema needed Easy to restructure, tag, modify Can be more compact than dataframe rows Many flows can be done in single-pass § {“id”:”2489”, “type”: ”CreateEvent”, ”payload”: {“creator”:…}, …} Dynamic Records typeid typeid Dynamic Frame Schema typeid Dynamic Frame internals {“id”:4391, “type”: “PullEvent”, ”payload”: {“assets”:…}, …} typeid {“id”:”6510”, “type”: “PushEvent”, ”payload”: {“pusher”:…}, …} id
  24. 24. ResolveChoice() B B B project B cast B separate into cols B B ApplyMapping() A X Y A X Y C 15+ transforms out-of-the box Dynamic Frame transforms
  25. 25. Semi-structured schema Relational schema FKA B B C.X C.Y PK ValueOffset A C D [ ] X Y B B Transforms and adds new columns, types, and tables on-the-fly Tracks keys and foreign keys across runs SQL on the relational schema is orders of magnitude faster than JSON processing Relationalize() transform
  26. 26. toDF(): Convert to a Dataframe fromDF(): Convert from a Dataframe Spigot(): Sample data of any Dynamic Frame to S3 Unbox(): Parse string column as given format into Dynamic Frame Filter(), Map(): Apply Python UDFs to Dynamic Frames Join(): Join two Dynamic Frames And more …. Useful AWS Glue transforms
  27. 27. 0 200 400 600 800 1000 1200 1400 1600 1800 Day Month Year GitHub Timeline ETL Performance DynamicFrames DataFrames Time(sec) On average: 2x performance improvement Data size (# files) 24 744 8699 Performance: AWS Glue ETL Configuration 10 DPUs Apache Spark 2.1.1 Workload JSON to CSV Filter for Pull events (lower is better)
  28. 28. Lots of small files, e.g. Kinesis Firehose Vanilla Apache Spark (2.1.1) overheads Must reconstruct partitions (2-pass) Too many tasks: task per file Scheduling & memory overheads AWS Glue Dynamic Frames Integration with Data Catalog Automatically group files per task Rely on crawler statistics Performance: Lots of small files 0 1000 2000 3000 4000 5000 6000 7000 8000 1:2K 20:40K 40:80K 80:160K 160:320K 320:640K 640: 1280K AWS Glue ETL small file scalability Spark Glue 1.2 Million Files Spark Out-Of-Memory >= 320: 640K files Grouping Time(sec) # partitions : # files
  29. 29. AWS Glue execution model: data partitions • Apache Spark and AWS Glue are data parallel. • Data is divided into partitions that are processed concurrently. • A stage is a set of parallel tasks – one task per partition Driver Executors Overall throughput is limited by the number of partitions
  30. 30. AWS Glue execution model: jobs and stages
  31. 31. AWS Glue execution model: jobs and stages Actions
  32. 32. AWS Glue execution model: jobs and stages Jobs
  33. 33. AWS Glue execution model: jobs and stages Repartition FilterRead Drop Nulls Write Read Show Job 1 Job 2 Stage 1 Stage 2 Stage 1 Apply Mapping Filter Apply Mapping Jobs
  34. 34. • How is your dataset partitioned? • How is your application divided into jobs and stages? • Data is divided into partitions that are processed concurrently AWS Glue performance: key questions
  35. 35. Enabling job metrics
  36. 36. Item1: Processing lots of small files
  37. 37. Example: Processing lots of small files • Let's look at a straightforward JSON to Parquet conversion job • 1.28 million JSON files in 640 partitions:
  38. 38. Example: Processing lots of small files • First try: use a standard SparkSQL job
  39. 39. Example: Processing lots of small files
  40. 40. Example: Processing lots of small files
  41. 41. Example: Processing lots of small files • Driver memory use is growing fast and approaching the 5g max.
  42. 42. Example: Processing lots of small files • Case 2: Run using AWS Glue DynamicFrames.
  43. 43. Example: Processing lots of small files
  44. 44. Example: Processing lots of small files Driver memory remains below 50% for the entire duration of execution.
  45. 45. Example: Processing lots of small files
  46. 46. Example: Processing lots of small files
  47. 47. Options for grouping files • groupFiles • inPartition: within a partition. • acrossPartition: from different partitions. • groupSize • Target size of each group.
  48. 48. Example: Aggressively grouping files • Execution is much slower, but hasn't crashed. "groupFiles": "acrossPartition"
  49. 49. Example: Aggressively grouping files Executor memory is higher than driver. Only one executor is active.
  50. 50. Item2: Processing a few large files
  51. 51. Example: Processing a few large files • Let's see how this looks on a sample dataset of 5 large csv files. • Each file is • 12.5 GB uncompressed • 1.6 GB gzip • 1.3 GB bzip2 • Script converts data to Parquet.
  52. 52. Example: Processing a few large gzip files • We only have 5 partitions – one for each file. • Job fails after 2 hours.
  53. 53. Example: Processing a few large bzip2 files • Bzip2 files can be split into blocks, so we see up to 104 tasks. • Job completes in 18 minutes.
  54. 54. Example: Processing a few large bzip2 files • With 15 DPU, the number of active executors closely tracks the maximum needed number of executors.
  55. 55. Example: Processing a few large uncompressed files • Uncompressed files can be split into lines, so we construct 64MB partitions. • Job completes in 12 minutes.
  56. 56. Example: Processing a few large files • If you have a choice of compression type, prefer bzip2. • If you are using gzip, make sure you have enough files to fully utilize your resources. • Bandwidth is rarely the bottleneck for AWS Glue jobs, so consider leaving files uncompressed.
  57. 57. Item3: Optimizing parallelism
  58. 58. Example: optimizing parallelism Processing large, split-able bzip2 files. With 10 DPU, metric maximum needed executors shows room for scaling.
  59. 59. § 17 Executors (Maximum Allocated Executors) § 10 DPU = 10 Node Cluster = 1 Master + 9 Core Node § 9 Core Node = 18 Executors = 1 Driver + 17 Executors § 27 Executors (Maximum Needed Executors) § 1 Driver + 27 Executors = 28 Executors = 14 Core Node § 14 Core Node + 1 Master = 15 Node Cluster = 15 DPU DPU
  60. 60. Example: optimizing parallelism With 15 DPU, active executors closely tracks maximum needed executors.
  61. 61. Item4: JDBC partitions
  62. 62. AWS Glue JDBC partitions • For JDBC sources, by default each table is read as a single partition. • AWS Glue automatically partitions datasets with fewer than 10 partitions after the data has been loaded.
  63. 63. Reading JDBC partitions
  64. 64. Reading JDBC partitions
  65. 65. Reading JDBC partitions A single executor is used for the JDBC query Data is repartitioned for the rest of the job.
  66. 66. Options for reading database tables in parallel • hashexpression – Integer expression to use for distribution. • hashfield – Single column to use for distribution. • hashpartitions – Number of parallel queries to make. Default is 7. • Turns into a collection of queries of the form
  67. 67. Options for reading database tables in parallel • Guidelines for picking distribution keys. • For hashexpression, choose a column that is evenly distributed across values. A primary key works well. • If no such field exists, use hashfield to define one. • Example: The taxi dataset does not have a primary key, so we set hashfield to partition based on day of the month: datasource0 = glueContext.create_dynamic_frame.from_catalog( database = "nyctaxi", table_name = "green-mysql-large", additional_options={'hashfield': 'day(lpep_pickup_datetime)', 'hashpartitions': 15})
  68. 68. Options for reading database tables in parallel • Four executors can process 16 partitions concurrently.
  69. 69. Options for reading database tables in parallel • Make sure to understand impact to database engine.
  70. 70. Job Bookmarks for JDBC Queries • Job bookmarks only work when the source table has an ordered primary key. • Updates are not handled today.
  71. 71. Item5: Python performance
  72. 72. Python performance • Using map and filter in Python is expensive for large data sets. • All data is serialized and sent between the JVM and Python. • Alternatives • Use AWS Glue Scala SDK. • Convert to DataFrame and use Spark SQL expressions. Spark JVM Python VM
  73. 73. Item6: Scheduler
  74. 74. Glue
  75. 75. Glue
  76. 76. Boto3
  77. 77. Jenkins with boto3
  78. 78. Oozie with boto3
  79. 79. Airflow with boto3
  80. 80. Item7: Python shell
  81. 81. Announcing a new job type: Python shell A new cost-effective ETL primitive for small to medium tasks Python shell 3rd party service
  82. 82. AWS Glue Python shell specs Python 2.7 environment with boto3, awscli, numpy, scipy, pandas, scikit-learn, PyGreSQL, … cold spin-up: < 20 sec, support for VPCs, no runtime limit sizes: 1 DPU (includes 16GB), and 1/16 DPU (includes 1GB) pricing: $0.44 per DPU-hour, 1-min minimum, per-second billing
  83. 83. Python shell collaborative filtering example Amazon customer reviews dataset (s3://amazon-reviews-pds) Video category Compute low-rank approx of (Customer x Product) ratings using SVD uses scipy sparse matrix and SVD library Step Time (sec) Redshift COPY 13 Extract ratings 5 Generate matrix 1552 SVD (k=1000) 2575 Total 4145 69 min $0.60
  84. 84. 더 나은 세미나를 위해 여러분의 의견을 남겨주세요! ▶ 질문에 대한 답변 드립니다. ▶ 발표자료/녹화영상을 제공합니다. http://bit.ly/awskr-webinar

×