O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data Lake Storage

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Carregando em…3
×

Confira estes a seguir

1 de 26 Anúncio

Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data Lake Storage

Baixar para ler offline

Cloud Storage is evolving rapidly, and our Azure Storage portfolio has added a ton of new industry leading capabilities. In this session you will learn the do's and don'ts of building data lakes on Azure Data Lake Storage. You will learn about the commonly used patterns, how to set up your accounts and pipelines to maximize performance, how to organize your data and various options to secure access to your data. We will also cover customer use cases and highlight planned enhancements and upcoming features.

Cloud Storage is evolving rapidly, and our Azure Storage portfolio has added a ton of new industry leading capabilities. In this session you will learn the do's and don'ts of building data lakes on Azure Data Lake Storage. You will learn about the commonly used patterns, how to set up your accounts and pipelines to maximize performance, how to organize your data and various options to secure access to your data. We will also cover customer use cases and highlight planned enhancements and upcoming features.

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data Lake Storage (20)

Anúncio

Mais recentes (20)

Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data Lake Storage

  1. 1. Designing performant and scalable data lakes using Azure Data Lake Storage Rukmani Gopalan @RukmaniGopalan
  2. 2. Agenda • Data Lake Concepts and Patterns • Designing your data lake • Set up • Organize data • Secure data • Manage cost • Optimizing your data lake • Achieve the best performance and scale
  3. 3. Traditional on-prem analytics pipeline Operational database Business/custom apps Operational database Operational database Enterprise data warehouse Data mart Data mart Data mart ETL ETL ETL ETL ETL ETL ETL Reporting Analytics Data mining
  4. 4. Modern data warehouse Logs (structured) Media (unstructured) Files (unstructured) Business/custom apps (structured) Ingest Prep & train Model & serve Store Azure Data Lake Storage Azure DatabricksAzure Data Factory Power BI Azure Synapse Analytics Azure Synapse Analytics
  5. 5. Advanced Analytics Logs (structured) Media (unstructured) Files (unstructured) Business/custom apps (structured) Ingest Prep & train Model & serve Store Azure Data Lake Storage Azure Data Factory Power BI Apps Azure Databricks Azure Synapse Analytics Azure Synapse Analytics Cosmos DB
  6. 6. Realtime Analytics Logs (structured) Media (unstructured) Files (unstructured) Business/custom apps (structured) Ingest Prep & train Model & serve Store Azure Data Lake Storage Azure DatabricksAzure Data Factory Power BI Apps Message Broker Azure Synapse Analytics Azure Synapse Analytics Cosmos DB Sensors and IoT (unstructured)
  7. 7. A “no-compromises” Data Lake: secure, performant, massively-scalable Data Lake storage that brings the cost and scale profile of object storage together with the performance and analytics feature set of data lake storage A z u r e D a t a L a k e S t o r a g e M A N A G E A B L ES C A L A B L E F A S T S E C U R E  No limits on data store size  Global footprint (50 regions)  Optimized for Spark and Hadoop Analytic Engines  Tightly integrated with Azure end to end analytics solutions  Automated Lifecycle Policy Management  Object Level tiering  Support for fine- grained ACLs, protecting data at the file and folder level  Multi-layered protection via at-rest Storage Service encryption and Azure Active Directory integration C O S T E F F E C T I V E I N T E G R AT I O N R E A D Y  Atomic directory operations means jobs complete faster  Object store pricing levels  File system operations minimize transactions required for job completion
  8. 8. Azure Data Lake Storage Cloud Storage platform with first class file/folder semantics and support for multiple protocols and cost/performance tiers. Built on Object Storage. Common Blob Storage Foundation Blob API ADLS API Server Backups, Archive Storage, Semi-structured Data Object Data Hadoop File System, File and Folder Hierarchy, Granular ACLS Atomic File Transactions Analytics Data Object Tiering and Lifecycle Policy Management AAD Integration, RBAC, Storage Account Security HA/DR support through ZRS and RA-GRS NFS v3 (preview) HPC Data, Applications using NFS v3 against large sequentially read data sets File Data
  9. 9. Data Lake Architecture - Summary Store large volume of multi-structured data in its native format Defer work to ‘schematize’ after value & requirements are known (Schema-on-read) Extract high value insights from the multi-structured data Build intelligent business scenarios based on the insights
  10. 10. Designing Your Data Lake • How do I set up my data lake? • How do I organize my data? • How do I secure my data? • How do I manage cost?
  11. 11. How do I set up my data lake? • Centralized vs Federated implementation • Data management and administration – done by a central team vs business units/domains • Blueprint approach to federated data lakes with centralized governance Flexible – single or multiple storage accounts Blueprint
  12. 12. Recommendations  Isolate development vs pre-production and production data lakes  Identify logical datasets, resources and management needs – this drives the centralized vs federated approach • Business unit boundaries • Regional boundaries  Promote sharing data/insights across business units – beware of data silos
  13. 13. How do I organize my data? • Azure Data Lake Storage hierarchy • Storage account Azure resource that contains data objects • Container Organize within storage account - contains a set of files/folders • Folder/directory Organize within container - contains a set of files/folders, Hadoop file system friendly • File Holds data that can be read or written
  14. 14. Recommendations  Organize data based on semantic structure as well as desired access control  Separate the different zones into different accounts, containers or folders depending on business need
  15. 15. How do I secure my data? PERIMETER/NETWORK Service Endpoints Private Endpoints AUTHENTICATION Azure Active Directory (recommended) Shared Keys SAS tokens Shared Key AUTHORIZATION RBACs (coarse grained) POSIX ACLs (fine grained) Shared Key DATA PROTECTION Encryption on-the-wire with HTTPS Encryption at Rest • Service and Customer Managed Keys Diagnostic Logs
  16. 16. A Little More on Authorization  RBACs and ACLs integrated with AAD • RBACs – Storage account and container • ACLs – File and folders  Other access mechanisms (not recommended) • Shared Keys – Disable if not needed (preview) • SAS Tokens – short lived access
  17. 17. Recommendations  Service or Private endpoints for network security  Use Azure Active Directory authentication to manage access  Use RBACs for coarse grained access (at storage account or container level) and ACLs for fine grained access control (at file or folder level)  AAD groups largely simplify your access management issues
  18. 18. How do I manage cost? • Choose the right set of features for your business – cost vs benefit • E.g. Redundancy option – criticality of geo-redundancy for production vs dev environments LRS ZRS GZRS(RA-)GRS Single Region Dual RegionGRS
  19. 19. How do I manage cost? (Continued…) • Control data growth – minimize risk of data swamp • Workspace data management • Leverage lifecycle management policies • Tiering • Retention
  20. 20. Recommendations  Choose the features of data lake storage based on business need  Pre-prod and development environment needs might vary from production environment needs  Leverage lifecycle management policies for better data management  Move data to a cooler tier if not actively used – be aware of higher transaction costs and minimum retention policies  Use retention policies to delete data that is not needed
  21. 21. How do I optimize my data lake? Goal • Optimize for performance AND scale as the data and applications continue to grow on the data lake The basic considerations are… • Optimize for high throughput • Target getting at least a few MBs (higher the better) per transaction. • Optimize data access patterns • Reduce unnecessary scanning of files • Read only the data you need to read. • Write efficiently so downstream applications that read the data benefit
  22. 22. File size and format • Too many small files adversely impact performance • Choosing the right format – better performance AND lower cost • Parquet – integrated optimizations with Azure Synapse Analytics and Azure Databricks • Recommendations  Modify source to ingest larger files into the data lake  Coalesce and convert to right format (E.g. Parquet) in curation phase of your analytics pipelines  Realtime analytics pipelines (E.g. sensor data in IoT application) – microbatch for larger writes
  23. 23. Partition your data for optimized access Partition based on consumption patterns for optimized performance Sensor ID Year Temperature Humidity Pressure
  24. 24. Microsoft Confidential Query Acceleration (Preview)  Optimize access to structured data by filtering data directly in the storage service  Single file predicate evaluation and column projection to optimize analytics engines  Eg:  SELECT _1, _3 FROM BlobStorage WHERE _14 < 250 AND _16 > '2019-07-01'
  25. 25. Guidance from experts Microsoft Docs Explore overviews, tutorials, code samples, and more. Azure Data Lake Storage: https://docs.microsoft.com/azure/storage/blobs/data-lake-storage-introduction Azure Data Lake Storage Guidance Document: https://aka.ms/adls/guidancedoc Azure Synapse Analytics: https://docs.microsoft.com/azure/synapse-analytics
  26. 26. © Copyright Microsoft Corporation. All rights reserved.

Notas do Editor

  • An Azure Virtual Network (VNet) is a representation of your own network in the cloud. It is a logical isolation of the Azure cloud dedicated to your subscription. ... When you create a VNet, your services and VMs within your VNet can communicate directly and securely with each other in the cloud.
  • Symptom: Job latencies
    Investigation
    Storage request throttling
    Root cause
    Too many read operations to storage.
    Large number of row groups in databrick delta parquet file resulted in lots of reads operations.
    Solution
    Adjusted parquet.block.size config value to reduce number of row groups per parquet file
    Job runtimes reduced by 3x
  • Symptom: Job timeouts
    Investigation
    Transaction and throughput peaks, bursty pattern of load
    Storage request throttling
    Root cause
    Data cleanup during SLA job execution
    Large number of partitions
    10’s of thousands of partitions
    Solution
    Reduced number of partitions to 250
    Best practice: Partitioning strategy must align with your query pattern
    Reduced the number of delete operations while SLA job is running

×