O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Training Day)

1.650 visualizações

Publicada em

From theory to implementation - follow the steps of implementing an end-to-end analytics solution illustrated with some best practices and examples in Azure Data Lake.
During this full training day we will share the architecture patterns, tooling, learnings and tips and tricks for building such services on Azure Data Lake. We take you through some anti-patterns and best practices on data loading and organization, give you hands-on time and the ability to develop some of your own U-SQL scripts to process your data and discuss the pros and cons of files versus tables.

This were the slides presented at the SQLBits 2018 Training Day on Feb 21, 2018.

Publicada em: Dados e análise
  • Entre para ver os comentários

Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Training Day)

  1. 1. 5 Data sourcesNon-relational data DESIGNED FOR THE QUESTIONS YOU KNOW!
  2. 2. The Data Lake Approach Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Hadoop, Spark, R, Azure Data Lake Analytics (ADLA) Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  3. 3. Microsoft’s Big Data Journey We needed to better leverage data and analytics to do more experimentation So, we built a Data Lake for Microsoft: • A data lake for everyone to put their data • Tools approachable by any developer • Batch, Interactive, Streaming, ML By the numbers • Exabytes of data under management • 100Ks of Physical Servers • 100Ks of Batch Jobs, Millions of Interactive Queries • Huge Streaming Pipelines • 10K+ Developers running diverse workloads and scenarios 2010 2013 2017 Windows SMSG Live Bing CRM/Dynamics Xbox Live Office365 Malware Protection Microsoft Stores Commerce Risk Skype LCA Exchange Yammer Data Stored
  4. 4. Culture Changes Engineering How is the system performing? What is the experience my customers are having? How does that correlate to other actions? Is my feature successful ? Marketing What can we observe from our customers to increase revenues? Management How do I drive my business based on the data? Field Where are there new opportunities? How can I connect with my customers more deeply? Support How does this customer’s experience compare with others?
  5. 5. HDFS Compatible REST API ADL Store .NET, SQL, Python, R scaled out by U-SQL ADL Analytics Open Source Apache Hadoop ADL Client Azure Databricks HDInsight Hive • Performance at scale • Optimized for analytics • Multiple analytics engines • Single repository sharing
  6. 6. HDFS Compatible REST API ADL Store Storage • Architected and built for very high throughput at scale for Big Data workloads • No limits to file size, account size or number of files • Single-repository for sharing • Cloud-scale distributed filesystem with file/folder ACLS and RBAC • Encryption-at-rest by default with Azure Key Vault • Authenticated access with Azure Active Directory integration • Formal Certifications incl. ISO, SOC, PCI, HIPAA
  7. 7. HDFS Compatible REST API ADL Store Analytics Storage Cloudera CDH Hortonworks HDP Qubole QDS • Open Source Apache® ADL client for commercial and custom Hadoop • Cloud IaaS and Hybrid
  8. 8. Best of Databricks Best of Microsoft Designed in collaboration with the founders of Apache Spark One-click set up; streamlined workflows Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage) Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs) A Z U R E D ATA B R I C K S A F A S T , E A S Y , A N D C O L L A B O R A T I V E A P A C H E S P A R K B A S E D A N A L Y T I C S P L A T F O R M
  9. 9. HDFS Compatible REST API HDInsight ADL Store Hive Analytics Storage • 63% lower TCO than on-premise* • SLA- managed, monitored and supported by Microsoft • Fully managed Hadoop, Spark and R • Clusters deployed in minutes *IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
  10. 10. HDFS Compatible REST API ADL Store .NET, SQL, Python, R scaled out by U-SQL ADL Analytics• Serverless. Pay per job. Starts in seconds. Scales instantly. • Develop massively parallel programs with simplicity • Federated query from multiple data sources
  11. 11. Ingress • Event Hubs • IoT Hub • Kafka Analytics • Stream Analytics • Spark Streaming • Storm Sinks • Data Lake Store • Blob Store • SQL Database • SQL Data Warehouse • Event Hub • Power BI • Table Storage • Service Bus Queues • Service Bus Topics • Cosmos DB • Azure Functions • …..
  12. 12. Azure Data Lake Store 1Create small files 2Copy small files 3Concat + copy file 4ASA 5Event Hub Capture
  13. 13. • Copy • SDK • Tools (Storage Explorer, Visual Studio, 3rd Party) • Data Factory • SQL Integration Services • Streaming from external sources • Generated by cloud analytics
  14. 14. Scales out your custom code in .NET, Python, R over your Data Lake Familiar syntax to millions of SQL & .NET developers Unifies • Declarative nature of SQL with the imperative power of your language of choice (e.g., C#, Python) • Processing of structured, semi-structured and unstructured data • Querying multiple Azure Data Sources (Federated Query) U-SQL A framework for Big Data
  15. 15. Develop massively parallel programs with simplicity A simple U-SQL script can scale from Gigabytes to Petabytes without learning complex big data programming techniques. U-SQL automatically generates a scaled out and optimized execution plan to handle any amount of data. Execution nodes immediately rapidly allocated to run the program. Error handling, network issues, and runtime optimization are handled automatically. @searchlog = EXTRACT UserId int, Start DateTime, Region string, Query string, Duration int, Urls string, ClickedUrls string FROM @"/Samples/Data/SearchLog.tsv" USING Extractors.Tsv(); OUTPUT @searchlog TO @"/Samples/Output/SearchLog_output.tsv" USING Outputters.Tsv();
  16. 16.  Automatic "in-lining" optimized out-of-the- box  Per job parallelization visibility into execution  Heatmap to identify bottlenecks
  17. 17. • Schema on Read • Write to File • Built-in and custom Extractors and Outputters • ADL Storage and Azure Blob Storage “Unstructured” Files EXTRACT Expression @s = EXTRACT a string, b int FROM "filepath/file.csv" USING Extractors.Csv(encoding: Encoding.Unicode); • Built-in Extractors: Csv, Tsv, Text with lots of options, Parquet • Custom Extractors: e.g., JSON, XML, etc. (see http://usql.io) OUTPUT Expression OUTPUT @s TO "filepath/file.csv" USING Outputters.Csv(); • Built-in Outputters: Csv, Tsv, Text, Parquet • Custom Outputters: e.g., JSON, XML, etc. (see http://usql.io) Filepath URIs • Relative URI to default ADL Storage account: "filepath/file.csv" • Absolute URIs: • ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv" • WASB: "wasb://container@account/filepath/file.csv"
  18. 18. • Simple Patterns • Virtual Columns • Only on EXTRACT GA for now • OUTPUT in Private Preview File Sets Simple pattern language on filename and path @pattern string = "/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}"; • Binds two columns date and suffix • Wildcards the filename • Limits on number of files and file sizes can be improved with SET @@FeaturePreviews = "FileSetV2Dot5:on,InputFileGrouping:on, AsyncCompilerStoreAccess:on"; (Will become default between now and middle of year) Virtual columns EXTRACT name string , suffix string // virtual column , date DateTime // virtual column FROM @pattern USING Extractors.Csv(); • Refer to virtual columns in predicates to get partition elimination • Warning gets raised if no partition elimination was found
  19. 19. @rows = SELECT Domain, SUM(Clicks) AS TotalClicks FROM @ClickData GROUP BY Domain;
  20. 20. Read Read Partition Partition Full Agg Write Full Agg Write Full Agg Write Read Partition Partial Agg Partial Agg Partial Agg CNN, FB, WH EXTENT 1 EXTENT 2 EXTENT 3 CNN, FB, WH CNN, FB, WH U-SQL Table Distributed by Domain Read Read Full Agg Full Agg Write Write Read Full Agg Write FB EXTENT 1 WH EXTENT 2 CNN EXTENT 3 Expensive!
  21. 21. ADLA Account/Catalog Database Schema [1,n] [1,n] [0,n] tables views TVFs C# Fns C# UDAgg Clustered Index partitions C# Assemblies C# Extractors Data Source C# Reducers C# Processors C# Combiners C# Outputters Ext. tables User objects Refers toContains Implemented and named by Procedures Creden- tials MD Name C# Name C# Applier Table Types Legend Statistics C# UDTs Packages
  22. 22. • Naming • Discovery • Sharing • Securing U-SQL Catalog Naming • Default Database and Schema context: master.dbo • Quote identifiers with []: [my table] • Stores data in ADL Storage /catalog folder Discovery • Visual Studio Server Explorer • Azure Data Lake Analytics Portal • SDKs and Azure Powershell commands • Catalog Views: usql.databases, usql.tables etc. Sharing • Within an Azure Data Lake Analytics account • Across ADLA accounts that share same Azure Active Directory: • Referencing Assemblies • Calling TVFs, Procedures and referencing tables and views • Inserting into tables Securing • Secured with AAD principals at catalog and Database level
  23. 23. CREATE TABLE T (col1 int , col2 string , col3 SQL.MAP<string,string> , INDEX idx CLUSTERED (col2 ASC) PARTITION BY (col1) DISTRIBUTED BY HASH (driver_id) ); • Structured Data, built-in Data types only (no UDTs) • Clustered Index (needs to be specified): row-oriented • Fine-grained distribution (needs to be specified): • HASH, DIRECT HASH, RANGE, ROUND ROBIN • Addressable Partitions (optional) CREATE TABLE T (INDEX idx CLUSTERED …) AS SELECT …; CREATE TABLE T (INDEX idx CLUSTERED …) AS EXTRACT…; CREATE TABLE T (INDEX idx CLUSTERED …) AS myTVF(DEFAULT); • Infer the schema from the query • Still requires index and distribution (does not support partitioning)
  24. 24. Data Partitioning Tables Distribution Scheme When to use? HASH(keys) Automatic Hash for fast item lookup DIRECT HASH(id) Exact control of hash bucket value RANGE(keys) Keeps ranges together ROUND ROBIN To get equal distribution (if others give skew)
  25. 25. Partitions, Distributions and Clusters TABLE T ( id … , C … , date DateTime, … , INDEX i CLUSTERED (id, C) PARTITIONED BY (date) DISTRIBUTED BY HASH(id) INTO 4 ) PARTITION (@date1) PARTITION (@date2) PARTITION (@date3) HASH DISTRIBUTION 1 HASH DISTRIBUTION 2 HASH DISTRIBUTION 3 HASH DISTRIBUTION 1 HASH DISTRIBUTION 1 HASH DISTRIBUTION 2 HASH DISTRIBUTION 3 HASH DISTRIBUTION 4 HASH DISTRIBUTION 3 C1 C2 C3 C1 C2 C4 C5 C4 C6 C6 C7 C8 C7 C5 C6 C9 C10 C1 C3 /catalog/…/tables/Guid(T)/ Guid(T.p1).ss Guid(T.p2).ss Guid(T.p3).ss LOGICAL PHYSICAL
  26. 26. Benefits of Table clustering and distribution • Faster lookup of data provided by distribution and clustering when right distribution/cluster is chosen • Data distribution provides better localized scale out • Used for filters, joins and grouping Benefits of Table partitioning • Provides data life cycle management (“expire” old partitions): Partition on date/time dimension • Partial re-computation of data at partition level • Query predicates can provide partition elimination Do not use when… • No filters, joins and grouping • No reuse of the data for future queries If in doubt: use sampling (e.g., SAMPLE ANY(x)) and test.
  27. 27. Benefits of Distribution in Tables Benefits • Design for most frequent/costly queries • Manage data skew in partition/table • Manage parallelism in querying (by number of distributions) • Manage minimizing data movement in joins • Provide distribution seeks and range scans for query predicates (distribution bucket elimination) Distribution in tables is mandatory, chose according to desired benefits
  28. 28. Benefits of Clustered Index in Distribution Benefits • Design for most frequent/costly queries • Manage data skew in distribution bucket • Provide locality of same data values • Provide seeks and range scans for query predicates (index lookup) Clustered index in tables is mandatory, chose according to desired benefits Pro Tip: Distribution keys should be prefix of Clustered Index keys: Especially for RANGE distribution Optimizer will make use of global ordering then: If you make the RANGE distribution key a prefix of the index key, U-SQL will repartition on demand to align any UNIONALLed or JOINed tables or partitions! Split points of table distribution partitions are chosen independently, so any partitioned table can do UNION ALL in this manner if the data is to be processed subsequently on the distribution key.
  29. 29. Benefits of Partitioned Tables Benefits • Partitions are addressable • Enables finer-grained data lifecycle management at partition level • Manage parallelism in querying by number of partitions • Query predicates provide partition elimination • Predicate has to be constant-foldable Use partitioned tables for • Managing large amounts of incrementally growing structured data • Queries with strong locality predicates • point in time, for specific market etc • Managing windows of data • provide data for last x months for processing
  30. 30. Partitioned tables  Use partitioned tables for querying parts of large amounts of incrementally growing structured data  Get partition elimination optimizations with the right query predicates Creating partition table CREATE TABLE PartTable(id int, event_date DateTime, lat float, long float , INDEX idx CLUSTERED (vehicle_id ASC) PARTITIONED BY(event_date) DISTRIBUTED BY HASH (vehicle_id) INTO 4); Creating partitions DECLARE @pdate1 DateTime = new DateTime(2014, 9, 14, 00,00,00,00,DateTimeKind.Utc); DECLARE @pdate2 DateTime = new DateTime(2014, 9, 15, 00,00,00,00,DateTimeKind.Utc); ALTER TABLE vehiclesP ADD PARTITION (@pdate1), PARTITION (@pdate2); Loading data into partitions dynamically DECLARE @date1 DateTime = DateTime.Parse("2014-09-14"); DECLARE @date2 DateTime = DateTime.Parse("2014-09-16"); INSERT INTO vehiclesP ON INTEGRITY VIOLATION IGNORE SELECT vehicle_id, event_date, lat, long FROM @data WHERE event_date >= @date1 AND event_date <= @date2; • Filters and inserts clean data only, ignore “dirty” data Loading data into partitions statically ALTER TABLE vehiclesP ADD PARTITION (@pdate1), PARTITION (@baddate); INSERT INTO vehiclesP ON INTEGRITY VIOLATION MOVE TO @baddate SELECT vehicle_id, lat, long FROM @data WHERE event_date >= @date1 AND event_date <= @date2; • Filters and inserts clean data only, put “dirty” data into special partition
  31. 31. What is Table Fragmentation • ADLS is an append-only store! • Every INSERT statement is creating a new file (INSERT fragment) Why is it bad? • Every INSERT fragment contains data in its own distribution buckets, thus query processing loses ability to get “localized” fast access • Query generation has to read from many files now -> slow preparation phase that may time out. • Reading from too many files is disallowed: Current LIMIT: 3000 table partitions and INSERT fragments per job! What if I have to add data incrementally? • Batch inserts into table • Use ALTER TABLE REBUILD/ALTER TABLE REBUILD PARTITION regularly to reduce fragmentation and keep performance.
  32. 32. Dips down to 1 active vertex at these times
  33. 33. High-level Roadmap • Worldwide Region Availability (currently US and EU) • Interactive Access with T-SQL query • Scale out your custom code in the language of choice (.Net, Java, Python, etc) • Process the data formats of your choice (incl. Parquet, ORC; larger string values) • Continued ADF, AAS, ADC, SQL DW, EventHub, SSIS integration • Administrative policies to control usage/cost for storage & compute • Secure data sharing between common AAD and public read-only sharing, fine grained ACLing • Intense focus on developer productivity for authoring, debugging, and optimization • General customer feedback http://aka.ms/adlfeedback
  34. 34. Resources http://usql.io http://blogs.msdn.microsoft.com/azuredatalake/ http://blogs.msdn.microsoft.com/mrys/ https://channel9.msdn.com/Search?term=U-SQL#ch9Search http://aka.ms/usql_reference https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u- sql-programmability-guide https://docs.microsoft.com/en-us/azure/data-lake-analytics/ https://msdn.microsoft.com/en-us/magazine/mt614251 https://msdn.microsoft.com/magazine/mt790200 http://www.slideshare.net/MichaelRys Getting Started with R in U-SQL https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u- sql-python-extensions https://social.msdn.microsoft.com/Forums/azure/en- US/home?forum=AzureDataLake http://stackoverflow.com/questions/tagged/u-sql http://aka.ms/adlfeedback Continue your education at Microsoft Virtual Academy online.