O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

U-SQL - Azure Data Lake Analytics for Developers

5.485 visualizações

Publicada em

U-SQL Presentation for SQL Konferenz in Darmstadt, Feb 2016

Publicada em: Dados e análise
  • Entre para ver os comentários

U-SQL - Azure Data Lake Analytics for Developers

  1. 1. U-SQL - Azure Data Lake Analytics for Developers Michael Rys, Microsoft @MikeDoesBigData, {mrys, usql}@microsoft.com
  2. 2. Why U-SQL?
  3. 3. Characteristics of Big Data analytics •Sample Use Cases • Digital Crime Forensics – Analyze complex attack patterns to understand BotNets and to predict and mitigate future attacks, by analyzing log records with complex custom algorithms • Image Processing – Large-scale image feature extraction and classification using custom code • Shopping Recommendations – Complex pattern analysis and prediction over shopping records using proprietary algorithms Requires processing of any type of data Allows use of custom algorithms Scales efficiently to any size
  4. 4. Status quo: SQL for Big Data  Declarativity does scaling and parallelization for you  Extensibility is bolted on and not “native”  Difficult to work with anything other than structured data  Difficult to extend with custom code
  5. 5. Status quo: Programming languages for Big Data  Extensibility through custom code is “native”  Declarativity is bolted on and not “native”  User often has to care about scale and performance  SQL is second class within string  Often no code reuse/sharing across queries
  6. 6. Why U-SQL? Get benefits of both! • Makes it easy for you by unifying: • Unstructured and structured data processing • Declarative SQL and custom imperative code (C#) • Local and remote queries • Increase productivity and agility from Day 1 and at Day 100 for YOU!  Declarativity and extensibility are equally native to the language
  7. 7. The origins of U-SQL SCOPE – Microsoft’s internal Big Data language • SQL and C# integration model • Optimization and scaling model • Runs 100,000s of jobs daily Hive • Complex data types (Maps, Arrays) • Data format alignment for text files T-SQL/ANSI SQL • Many of the SQL capabilities (windowing functions, meta data model etc.)
  8. 8. Show me U-SQL!
  9. 9. U-SQL language philosophy Declarative query and transformation language • Uses SQL’s SELECT FROM WHERE with GROUP BY/aggregation, joins, SQL analytics functions • Optimizable, scalable Expression-flow programming style • Easy-to-use functional lambda composition • Composable, globally optimizable Operates on unstructured and structured data • Schema on read over files • Relational metadata objects (e.g. database, table) Extensible from ground up • Type system is based on C# • Expression language IS C# • User-defined functions (U-SQL and C#) • User-defined aggregators (C#) • User-defined operators (UDO) (C#) U-SQL provides the Parallelization and Scale-out Framework for Usercode • EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINER, APPLIER Federated query across distributed data sources REFERENCE MyDB.MyAssembly; CREATE TABLE T( cid int, first_order DateTime , last_order DateTime, order_count int , order_amount float ); @o = EXTRACT oid int, cid int, odate DateTime, amount float FROM "/input/orders.txt" USING Extractors.Csv(); @c = EXTRACT cid int, name string, city string FROM "/input/customers.txt" USING Extractors.Csv(); @j = SELECT c.cid, MIN(o.odate) AS firstorder , MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt , AGG<MyAgg.MySum>(c.amount) AS totalamount FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid WHERE c.city.StartsWith("New") && MyNamespace.MyFunction(o.odate) > 10 GROUP BY c.cid; OUTPUT @j TO "/output/result.txt" USING new MyData.Write(); INSERT INTO T SELECT * FROM @j;
  10. 10. Expression-flow programming style • Automatic "in-lining" of U-SQL expressions – whole script leads to a single execution model. • Execution plan that is optimized out-of-the-box and without user intervention. • Per-job and user-driven level of parallelization. • Detailed visibility into execution steps, for debugging. • Heatmap-like functionality to identify performance bottlenecks.
  11. 11. Query data where it lives • Easily query data in multiple Azure data stores without moving it to a single store • Avoid moving large amounts of data across the network between stores • Single view of data irrespective of physical location • Minimize data proliferation issues caused by maintaining multiple copies • Single query language for all data • Each data store maintains its own sovereignty • Design choices based on the need • Push SQL expressions to remote SQL sources • Filters • Joins U-SQL Query Query Azure Storage Blobs Azure SQL in VMs Azure SQL DB Azure Data Lake Analytics Azure SQL Data Warehouse Azure Data Lake Storage
  12. 12. Unstructured files • Schema on read • Write to file • Built-in and custom extractors and outputters • ADL Storage and Azure Blob Storage EXTRACT Expression @s = EXTRACT a string, b int FROM "filepath/file.csv" USING Extractors.Csv(encoding: Encoding.Unicode); • Built-in extractors: Csv, Tsv, Text with lots of options • Custom extractors: e.g., JSON, XML, and so on OUTPUT Expression OUTPUT @s TO "filepath/file.csv" USING Outputters.Csv(); • Built-in outputters: Csv, Tsv, Text • Custom outputters: JSON, XML, and so on Filepath URIs • Relative URI to default ADL Storage account: "/filepath/file.csv" • Absolute URIs: • ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv" • WASB: "wasb://container@account/filepath/file.csv"
  13. 13. U-SQL extensibility Extend U-SQL with C#/.NET Built-in operators, function, aggregates C# expressions (in SELECT expressions) User-defined aggregates (UDAGGs) User-defined functions (UDFs) User-defined operators (UDOs)
  14. 14. Extending U-SQL
  15. 15. Managing Assemblies Create assemblies Reference assemblies Enumerate assemblies Drop assemblies CREATE ASSEMBLY db.assembly FROM @path; CREATE ASSEMBLY db.assembly FROM byte[]; • Can also include additional resource files REFERENCE ASSEMBLY db.assembly; • Referencing .Net Framework Assemblies • Always accessible system namespaces: • U-SQL specific (e.g., for SQL.MAP) • All provided by system.dll system.core.dll system.data.dll, System.Runtime.Serialization.dll, mscorelib.dll (e.g., System.Text, System.Text.RegularExpressions, System.Linq) • Add all other .Net Framework Assemblies with: REFERENCE SYSTEM ASSEMBLY [System.XML]; • Enumerating Assemblies • Powershell command • U-SQL Studio Server Explorer DROP ASSEMBLY db.assembly;
  16. 16. Assembly Dependencies • Assembly must be registered to be referenced • All Assemblies needed for compilation must be referenced in script • All Assemblies needed at runtime either • Need to be referenced in script, or • Need to be registered with the assembly as additional files • Metadata Service does NOT enforce dependencies • Visual Studio Extension provides support for dependency management
  17. 17. Show Me File Sets!
  18. 18. File sets • Simple patterns • Virtual columns • Only on EXTRACT for now Simple pattern language on filename and path @pattern string = "/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}"; • Binds two columns date and suffix • Wildcards the filename • Limits on number of files (between 800 and 3000) Virtual columns EXTRACT name string , suffix string // virtual column , date DateTime // virtual column FROM @pattern USING Extractors.Csv(); • Refer to virtual columns in query to get partition elimination • Virtual columns need to be referenced for DateTime columns and if no wildcard has been given
  19. 19. Let’s do some SQL with U-SQL
  20. 20. @m CROSS APPLY EXPLODE(refs) AS Refs(r); @m(refs) @me, @you @him, @her Refs(r) @me @you @him @her @me, @you @me @you
  21. 21. U-SQL Joins Join operators • INNER JOIN • LEFT or RIGHT or FULL OUTER JOIN • CROSS JOIN • SEMIJOIN • equivalent to IN subquery • ANTISEMIJOIN • Equivalent to NOT IN subquery Notes • ON clause comparisons need to be of the simple form: rowset.column == rowset.column or AND conjunctions of the simple equality comparison • If a comparand is not a column, wrap it into a column in a previous SELECT • If the comparison operation is not ==, put it into the WHERE clause • turn the join into a CROSS JOIN if no equality comparison Reason: Syntax calls out which joins are efficient
  22. 22. U-SQL Analytics Windowing Expression Window_Function_Call 'OVER' '(' [ Over_Partition_By_Clause ] [ Order_By_Clause ] [ Row _Clause ] ')'. Window_Function_Call := Aggregate_Function_Call | Analytic_Function_Call | Ranking_Function_Call. Windowing Aggregate Functions ANY_VALUE, AVG, COUNT, MAX, MIN, SUM, STDEV, STDEVP, VAR, VARP Analytics Functions CUME_DIST, FIRST_VALUE, LAST_VALUE, PERCENTILE_CONT, PERCENTILE_DISC, PERCENT_RANK, LEAD, LAG Ranking Functions DENSE_RANK, NTILE, RANK, ROW_NUMBER
  23. 23. “Top 5”s Surprises for SQL Users AS is not as • C# keywords and SQL keywords overlap • Costly to make case-insensitive -> Better build capabilities than tinker with syntax = != == • Remember: C# expression language null IS NOT NULL • C# nulls are two-valued PROCEDURES but no WHILE No UPDATE nor MERGE • Transform/Recook instead
  24. 24. Meta Data Object Model ADLA Catalog Database Schema [1,n] [1,n] [0,n] tables views TVFs C# Fns C# UDAgg Clustered Index partitions C# Assemblies C# Extractors Data Source C# Reducers C# Processors C# Combiners C# Outputters Ext. tables Procedures Creden- tials C# Applier Table Types Statistics C# UDTs Abstract objects User objects Refers toContains Implemented and named by MD Name C# Name Legend
  25. 25. U-SQL Catalog • Naming • Default database and schema context: master.dbo • Quote identifiers with []: [my table] • Stores data in ADL Storage /catalog folder • Discovery • Visual Studio Server Explorer • Azure Data Lake Analytics Portal • SDKs and Azure PowerShell commands • Sharing • Within an Azure Data Lake Analytics account • Securing • Secured with AAD principals at catalog level (inherited from ADL Storage) • Naming • Discovery • Sharing • Securing
  26. 26. Create shareable data and code
  27. 27. Views and TVFs • Views for simple cases • TVFs for parameterization and most cases Views CREATE VIEW V AS EXTRACT… CREATE VIEW V AS SELECT … • Cannot contain user-defined objects (such as UDFs or UDOs) • Will be inlined Table-Valued Functions (TVFs) CREATE FUNCTION F (@arg string = "default") RETURNS @res [TABLE ( … )] AS BEGIN … @res = … END; • Provides parameterization • One or more results • Can contain multiple statements • Can contain user-code (needs assembly reference) • Will always be inlined • Infers schema or checks against specified return schema
  28. 28. Procedures Allows encapsulation of non-DDL scripts CREATE PROCEDURE P (@arg string = "default“) AS BEGIN …; OUTPUT @res TO …; INSERT INTO T …; END; • Provides parameterization • No result but writes into file or table • Can contain multiple statements • Can contain user code (needs assembly reference) • Will always be inlined • Cannot contain DDL (no CREATE, DROP)
  29. 29. Tables • CREATE TABLE • CREATE TABLE AS SELECT CREATE TABLE T (col1 int , col2 string , col3 SQL.MAP<string,string> , INDEX idx CLUSTERED (col1 ASC) PARTITIONED BY HASH (driver_id) ); • Structured Data • Built-in Data types only (no UDTs) • Clustered index (must be specified): row-oriented • Fine-grained partitioning (must be specified): • HASH, DIRECT HASH, RANGE, ROUND ROBIN CREATE TABLE T (INDEX idx CLUSTERED …) AS SELECT …; CREATE TABLE T (INDEX idx CLUSTERED …) AS EXTRACT…; CREATE TABLE T (INDEX idx CLUSTERED …) AS myTVF(DEFAULT); • Infer the schema from the query • Still requires index and partitioning
  30. 30. Inserting New Data
  31. 31. INSERT • INSERT constant values • INSERT from queries • Multiple INSERTs INSERT constant values INSERT INTO T VALUES (1, "text", new SQL.MAP<string,string>("key","value")); INSERT from queries INSERT INTO T SELECT col1, col2, col3 FROM @rowset; Multiple INSERTs into same table • Is supported • Generates separate file per insert in physical storage: • Can lead to performance degradation • Recommendations: • Try to avoid small inserts • Rebuild table after frequent insertions with: ALTER TABLE T REBUILD;
  32. 32. Additional capabilities and resources • Tools: http://aka.ms/adltoolsVS • Blogs and community page: • http://usql.io • http://blogs.msdn.com/b/visualstudio/ • http://azure.microsoft.com/en-us/blog/topics/big-data/ • https://channel9.msdn.com/Search?term=U-SQL#ch9Search • Documentation and articles and slides: • http://aka.ms/usql_reference • https://azure.microsoft.com/en-us/documentation/services/data-lake-analytics/ • https://msdn.microsoft.com/en-us/magazine/mt614251 • http://www.slideshare.net/MichaelRys • ADL forums and feedback • http://aka.ms/adlfeedback • https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake • http://stackoverflow.com/questions/tagged/u-sql
  33. 33. Unifies SQL declarativity and C# extensibility Unifies querying structured and unstructured data Unifies local and remote queries Increase productivity and agility from Day 1 forward for YOU! Sign up for an Azure Data Lake account and join the Public Preview http://www.azure.com/datalake, download the VS tools, and give us feedback via http://aka.ms/adlfeedback or at http://aka.ms/u-sql-survey! This is why U-SQL!
  34. 34. Friendly Competition Win ADL and U-SQL SWAG 1. Contribute a cool U-SQL project/sample to the Azure/usql Github repo (via http://usql.io) by Apr 30, 2016 2. Tweet your submission to @MikeDoesBigData with #USQLComp 3. We will review the submissions and send some cool swag (U-SQL T-Shirts, ADL Poloshirts etc) to the top 5 submissions