O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Creare e gestire Data Lake e Data Warehouses

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 45 Anúncio

Creare e gestire Data Lake e Data Warehouses

AWS Summit Milano 2019 - Creare e gestire Data Lake e Data Warehouses - Giorgio Nobile, Solutions Architect, AWS | Francesco Marelli, Solutions Architect, AWS | Cliente: THRON

AWS Summit Milano 2019 - Creare e gestire Data Lake e Data Warehouses - Giorgio Nobile, Solutions Architect, AWS | Francesco Marelli, Solutions Architect, AWS | Cliente: THRON

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Creare e gestire Data Lake e Data Warehouses (20)

Anúncio

Mais de Amazon Web Services (20)

Creare e gestire Data Lake e Data Warehouses

  1. 1. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data lakes and analytics Giorgio Nobile – AWS Solutions Architect Francesco Marelli – AWS Solutions Architect Dario De Agostini – CTO Thron A W S S u m m i t 2 0 1 9 - M i l a n
  2. 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T https://bit.ly/AWSDataLakeMilan
  3. 3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Defining the AWS Data Lake Data lake is an architecture with a virtually limitless centralized storage platform capable of categorization, processing, analysis, and consumption of heterogeneous datasets Key data lake attributes • Decoupled storage and compute • Rapid ingest and transformation • Secure multi-tenancy • Query in place • Schema on read
  4. 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Store exabytes of data Stage from landing dock to transformed to curated– Make available in each Load, transform, and catalog once Make data available to many tools Open formats and interfaces support innovation Snowball Snowmobile Kinesis Data Firehose Kinesis Data Streams Amazon S3 Amazon Redshift Amazon EMR Athena Amazon Kinesis Amazon Elasticsearch Service Data lakes help you cost-effectively scale Kinesis Video Streams AI Services Amazon QuickSight
  5. 5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T How it works: Data Lakes and analytics on AWS S3 IAM KMS OLTP ERP CRM LOB Devices Web Sensors Social Kinesis Build Data Lakes quickly • Identify, crawl, and catalog sources • Ingest and clean data • Transform into optimal formats Simplify security management • Enforce encryption • Define access policies • Implement audit login Enable self-service and combined analytics • Analysts discover all data available for analysis from a single data catalog • Use multiple analytics tools over the same data Athena Amazon Redshift AI Services Amazon EMR Amazon QuickSight Data Catalog Amazon S3
  6. 6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T High Performance Why Amazon S3 for the Data Lake? SecureDurable Available Easy to use Scalable & Affordable Integrated
  7. 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Kinesis—Real Time Easily collect, process, and analyze video and data streams in real time Capture, process, and store video streams for analytics Load data streams into AWS data stores Analyze data streams with SQL Build custom applications that analyze data streams Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics SQL
  8. 8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T User-Defined Functions • Bring your own functions & code • Execute without provisioning servers Processing and Querying In Place Fully Managed Process & Query • Catalog, Transform, & Query Data in Amazon S3 • No physical instances to manage Lambda Function
  9. 9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon S3 Select and Amazon Glacier Select Select subset of data from an object based on a SQL expression
  10. 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Motivation Behind Amazon S3 Select GET all the data from S3 objects, and my application will filter the data that I need Redshift Spectrum Example: Customer: Run 50,000 queries Amount of data fetched from S3: 6 PBs Amount of data used in Amazon Redshift: 650 TB Data needed from S3: 10%
  11. 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Before 200 seconds and 11.2 cents # Download and process all keys for key in src_keys: response = s3_client.get_object(Bucket=src_bucket, Key=key) contents = response['Body'].read() for line in contents.split('n')[:-1]: line_count +=1 try: data = line.split(',') srcIp = data[0][:8] …. Amazon S3 Select: Serverless MapReduce After 95 seconds and costs 2.8 cents # Select IP Address and Keys for key in src_keys: response = s3_client.select_object_content (Bucket=src_bucket, Key=key, expression = SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM s3object as obj) contents = response['Body'].read() for line in contents: line_count +=1 try: …. 2X Faster at 1/5 of the cost
  12. 12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Athena—Interactive Analysis Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Supports Multiple Data Formats – Define Schema on Demand $ Query Instantly Pay per query Open Easy
  13. 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Choosing the Right Data Formats There is no such thing as the “best” data format • All involve tradeoffs, depending on workload & tools • CSV, TSV, JSON are easy, but not efficient • Compress & store/archive as raw input • Columnar compressed are generally preferred • Parquet or ORC • Smaller storage footprint = lower cost • More efficient scan & query • Row oriented (AVRO) good for full data scans Key considerations are cost, performance & support
  14. 14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Choosing the Right Data Formats (con’t.) Pay by the amount of data scanned per query Use Compressed Columnar Formats • Parquet • ORC Easy to integrate with wide variety of tools Dataset Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as Text files 1 TB 237 seconds 1.15TB $5.75 Logs stored in Apache Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
  15. 15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data Prep is ~80% of Data Lake Work Building training sets Cleaning and organizing data Collecting data sets Mining data for patterns Refining algorithms Other
  16. 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue—Serverless Data Catalog & ETL Data Catalog ETL Job authoring Discover data and extract schema Auto-generates customizable ETL code in Python and Spark Automatically discovers data and stores schema Data searchable, and available for ETL Generates customizable code Schedules and runs your ETL jobs Serverless
  17. 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Lake Formation (join the preview) Build, secure, and manage a data lake in days Build a data lake in days, not months Build and deploy a fully managed data lake with a few clicks Enforce security policies across multiple services Centrally define security, governance, and auditing policies in one place and enforce those policies for all users and all applications Combine different analytics approaches Empower analyst and data scientist productivity, giving them self- service discovery and safe access to all data from a single catalog
  18. 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Traditionally, analytics looked like this Expensive: Large initial capex + $10k $50k/TB/year GBs-TBs scale [not designed for PB/EBs] Relational data 90% of data was thrown away because of cost OLTP ERP CRM LOB Data Warehouse Business Intelligence
  19. 19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data Lakes evolve the traditional approach OLTP ERP CRM LOB Data Warehouse Business Intelligence Data Lake 1001100001001010111001 0101011100101010000101 1111011010 0011110010110010110 0100011000010 Devices Web Sensors Social Catalog Machine Learning DW Queries Big data processing Interactive Real-time Relational and non-relational data TBs-EBs scale Schema defined during analysis Diverse analytical engines to gain insights Designed for low-cost storage and analytics
  20. 20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T What does data warehouse modernization mean? Easy to use Extends to your Data Lake Don’t waste time on menial administrative tasks and maintenance Directly analyze data stored in your data lake in open formats Any scale of data, workloads, and users Dynamically scale up to guarantee performance even with unpredictable demands and data volumes Faster time-to-insights Consistently fast performance, even with thousands of concurrent queries and users
  21. 21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Redshift Fastest Get faster time-to-insight for all types of analytics workloads; powered by machine learning, columnar storage and MPP Unlimited scale Extends your Data Lake 1/10th the cost Dynamically scale up to guarantee performance even with unpredictable analytical demands and data volumes Analyze data in the Amazon S3 Data Lake in-place and in open formats, together with data loaded into Redshift’s high performance SSDs Start at $0.25 per hour, save costs with automated administration tasks and eliminate business impact due to downtime; as low as $1,000 per terabyte per year Fast, simple, cost-effective data warehouse that can extend queries to your Data Lake Analyze data in open formats such as Parquet, ORC, and JSON, using SQL tools
  22. 22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Redshift architecture Leader Node Simple SQL end point Stores metadata Optimizes query plan Coordinates query execution Compute Nodes Local columnar storage Parallel/distributed execution of all queries, loads, backups, restores, resizes Start at just $0.25/hour DC1: SSD; scale from 160 GB to 326 TB DS2: HDD; scale from 2 TB to 2 PB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC Ingestion / Backup / Restore
  23. 23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Security is built-in Select compliance certifications* 10 GigE (HPC) Customer VPC Internal VPC JDBC/ODBC Compute Nodes Leader Node Network Isolation End-to-end encryption Integration with AWS Key Management Service Amazon S3
  24. 24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Caching Layer Concurrency Scaling for bursts of user activity (Preview) Automatically creates more clusters on- demand Consistently fast performance even with thousands of concurrent queries No advance hydration required Quickly scale to serve changing query workload Backup Redshift Managed S3
  25. 25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Redshift Elastic Resize (GA) Adds additional nodes to Redshift cluster Distributes data across new configuration in minutes Minimal transition time Scale compute and storage on- demand Scale up and down in minutes Redshift Cluster Redshift Managed S3 JDBC/ODBC Leader Node CN2CN1 CN3 CN4 Backup
  26. 26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Redshift intelligent administration Automates data distribution in tables for improved performance and disk space utilization. Provides intelligent recommendations for tuning based on continuous workload analysis. ALL keyA keyB keyC keyD Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 EVEN Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 KEY Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 recommended distribution key No more messing with distkeys! Coming Soon! Advise
  27. 27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Redshift intelligent maintenance VacuumAnalyze WLM Concurrency Setting AutoAuto Auto Maintenance processes like vacuum and analyze will automatically run in the background. Redshift will automatically adjust the WLM concurrency setting to deliver optimal throughput. Moving towards zero-maintenance. Coming Soon!
  28. 28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Run stored procedures in Amazon Redshift Bring your existing Stored Procedure and run in Amazon Redshift. Amazon Redshift will support Stored Procedure in PL/pgSQL format, enabling you to bring your existing Stored Procedure to Amazon Redshift. Migrating to Amazon Redshift is even easier! Coming Soon! where the data is to efficiently run ETL, data validation, and custom business logic.
  29. 29. THE INTELLIGENT DAM PLATFORM SPEAKER DARIO DE AGOSTINI CTO & Co-Founder THRON https://www.linkedin.com/in/dariodeagostini/
  30. 30. THE INTELLIGENT DAM PLATFORM Grazie al supporto dell’Intelligenza Artificiale THRON ti permette di ridurre i costi di gestione di tutte le attività umane legate all’intero ciclo di vita dei contenuti. THE INTELLIGENT DAM PLATFORM
  31. 31. THE INTELLIGENT DAM PLATFORM STORIE DI SUCCESSO
  32. 32. THE INTELLIGENT DAM PLATFORM THRON permette di controllare l’intero ciclo di vita dei contenuti: THRON è stato incluso nel Forrester's Landscape come fornitore emergente di un DAM all’avanguardia per i Marketers, poiché dimostra funzionalità avanzate di analytics e intelligence. – Nick Barber, Senior Analyst WORKFLOW DEI CONTENUTI
  33. 33. THE INTELLIGENT DAM PLATFORM L’ESIGENZA
  34. 34. THE INTELLIGENT DAM PLATFORM VOLUME DI DATI 1,200,000,000 1,300,000,000 1,400,000,000 1,500,000,000 1,600,000,000 1,700,000,000 1,800,000,000 1,900,000,000 2,000,000,000 2,100,000,000 2,200,000,000 Events processed 100 milioni di nuovi eventi al mese Retention fa crescere il volume di dati
  35. 35. THE INTELLIGENT DAM PLATFORM CARICO NON PREVEDIBILE 4 X
  36. 36. THE INTELLIGENT DAM PLATFORM ARCHITETTURA 1/4
  37. 37. THE INTELLIGENT DAM PLATFORM ARCHITETTURA 2/4
  38. 38. THE INTELLIGENT DAM PLATFORM ARCHITETTURA 3/4
  39. 39. THE INTELLIGENT DAM PLATFORM ARCHITETTURA 4/4
  40. 40. THE INTELLIGENT DAM PLATFORM BENEFICI OTTENUTI Uso risorse efficiente: cluster ES passa da 4 istanze I3.2xlarge per dataload a 3 istanze I3.large per erogazione. Utilizzo di Spot instance per EMR. Drastica riduzione tempi sviluppo: data Pipeline astrae gestione flusso dati e rende facilissima la evoluzione, ottima la gestione di timeout e di retry. Riduzione dei costi «accessori»: allarmi tramite SNS e logging centralizzato. Scalabilità: Kinesis e Lambda forniscono grande scalabilità per realtime data processing. Esplorazione dati accessibile: Athena ci fa risparmiare circa un giorno/uomo al mese a fronte di meno di 50$/mese di spesa. Resilienza e alta disponibilità: grazie all’uso dei container su ECS. Realizzata in meno di 7 giorni uomo
  41. 41. THE INTELLIGENT DAM PLATFORM https://medium.com/thron-tech Follow us on Medium Join us, we’re hiring https://www.thron.com/en/about/careers FOLLOW US
  42. 42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T https://bit.ly/AWSDataLakeMilan
  43. 43. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T JSON Payload Example for each event sent { "r": 255, "g": 0, "b": 0, "c": "Red", "device": { "id": "4992157", "browser": "Chrome", "browserVersion": "72.0.3626.109", "os": "Mac OS", "isMobile": false, "isMobileIOS": false, "isMobileAndroid": false }, "dt": { "year": 2019, "month": 2, "day": 25, "hour": 18, "minutes": 43, "seconds": 47, "millis": 725 }, "id": 1551116627725, "region": "Outside Italy", "awsExperience": "1-3 Years", "awsServiceArea": "Management Tools" }
  44. 44. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Demo Application Architecture Amazon CloudFront Amazon Cognito Amazon S3 Static Web Application Users Amazon Kinesis Data Firehose Amazon AthenaAWS Glue Amazon QuickSight Client Mobile client AWS Browser JS SDK S3 Bucket AWS Cloud Region
  45. 45. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

×