O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

ADL/U-SQL Introduction (SQLBits 2016)

1.037 visualizações

Publicada em

Deck of the Friday presentation on U-SQL at SQLBits 2016

Publicada em: Dados e análise
  • Entre para ver os comentários

ADL/U-SQL Introduction (SQLBits 2016)

  1. 1. SQLBits 2016 (http://www.slideshare.net/MichaelRys) Azure Data Lake & U-SQL Michael Rys, @MikeDoesBigData http://www.azure.com/datalake {mrys, usql}@microsoft.com
  2. 2. The Data Lake Approach
  3. 3. Implement Data Warehouse Reporting & Analytics Development Reporting & Analytics Design Physical DesignDimension Modelling ETL Development ETL Design Install and TuneSetup Infrastructure Traditional data warehousing approach Data sources ETL BI and analytics Data warehouse Understand Corporate Strategy Gather Requirements Business Requirements Technical Requirements
  4. 4. The Data Lake approach Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  5. 5. Source: ComScore 2009-2015 Search Report US 9% 11% 15% 16% 18% 19% 20% 0% 5% 10% 15% 20% 25% 2009 2010 2011 2012 2013 2014 2015 MICROSOFT DOUBLES SEARCH SHARE How Microsoft has used Big Data We needed to better leverage data and analytics to win in search We changed our approach • More experiments by more people! So we… Built an Exabyte-scale data lake for everyone to put their data. Built tools approachable by any developer. Built machine learning tools for collaborating across large experiment models.
  6. 6. Introducing Azure Data Lake Big Data Made Easy
  7. 7. Analytics Storage HDInsight (“managed clusters”) Azure Data Lake Analytics Azure Data Lake Storage Azure Data Lake
  8. 8. Azure Data Lake Storage Service
  9. 9. No limits to SCALE Store ANY DATA in its native format HADOOP FILE SYSTEM (HDFS) for the cloud ENTERPRISE GRADE access control, encryption at rest Optimized for analytic workload PERFORMANCE Azure Data Lake Store A hyper scale repository for big data analytics workloads IN PREVIEW
  10. 10. Azure Data Lake Analytics Service
  11. 11. WebHDFS YARN U-SQL ADL Analytics ADL HDInsight Store HiveAnalytics Storage Azure Data Lake (Store, HDInsight, Analytics)
  12. 12. ADLA complements HDInsight Target the same scenarios, tools, and customers HDInsight For developers familiar with the Open Source: Java, Eclipse, Hive, etc. Clusters offer customization, control, and flexibility in a managed Hadoop cluster ADLA Enables customers to leverage existing experience with C#, SQL & PowerShell Offers convenience, efficiency, automatic scale, and management in a “job service” form factor
  13. 13. No limits to SCALE Includes U-SQL, a language that unifies the benefits of SQL with the expressive power of C# Optimized to work with ADL STORE FEDERATED QUERY across Azure data sources ENTERPRISE GRADE role-based access control and auditing Pay PER QUERY and scale PER QUERY Azure Data Lake Analytics A distributed analytics service built on Apache YARN that dynamically scales to your needs IN PREVIEW
  14. 14. ADL and SQLDW
  15. 15. Query data where it lives Easily query data in multiple Azure data stores without moving it to a single store Benefits • Avoid moving large amounts of data across the network between stores • Single view of data irrespective of physical location • Minimize data proliferation issues caused by maintaining multiple copies • Single query language for all data • Each data store maintains its own sovereignty • Design choices based on the need • Push SQL expressions to remote SQL sources • Filters • Joins U-SQL Query Query Azure Storage Blobs Azure SQL in VMs Azure SQL DB Azure Data Lake Analytics Azure SQL Data Warehouse Azure Data Lake Storage
  16. 16. Azure Data Lake U-SQL
  17. 17. Some sample use cases Digital Crime Unit – Analyze complex attack patterns to understand BotNets and to predict and mitigate future attacks by analyzing log records with complex custom algorithms Image Processing – Large-scale image feature extraction and classification using custom code Shopping Recommendation – Complex pattern analysis and prediction over shopping records using proprietary algorithms Characteristics of Big Data Analytics •Requires processing of any type of data •Allow use of custom algorithms •Scale to any size and be efficient
  18. 18. Status Quo: SQL for Big Data  Declarativity does scaling and parallelization for you  Extensibility is bolted on and not “native”  hard to work with anything other than structured data  difficult to extend with custom code
  19. 19. Status Quo: Programming Languages for Big Data  Extensibility through custom code is “native”  Declarativity is bolted on and not “native”  User often has to care about scale and performance  SQL is 2nd class within string  Often no code reuse/ sharing across queries
  20. 20. Why U-SQL?  Declarativity and Extensibility are equally native to the language! Get benefits of both! Makes it easy for you by unifying: • Unstructured and structured data processing • Declarative SQL and custom imperative Code • Local and remote Queries • Increase productivity and agility from Day 1 and at Day 100 for YOU!
  21. 21. The origins of U-SQL SCOPE – Microsoft’s internal Big Data language • SQL and C# integration model • Optimization and Scaling model • Runs 100’000s of jobs daily Hive • Complex data types (Maps, Arrays) • Data format alignment for text files T-SQL/ANSI SQL • Many of the SQL capabilities (windowing functions, meta data model etc.)
  22. 22. Built-in operators, function, aggregates C# expressions (in SELECT expressions) User-defined aggregates (UDAGGs) User-defined functions (UDFs) User-defined operators (UDOs)
  23. 23. U-SQL Language Philosophy Declarative Query and Transformation Language: • Uses SQL’s SELECT FROM WHERE with GROUP BY/Aggregation, Joins, SQL Analytics functions • Optimizable, Scalable Expression-flow programming style: • Easy to use functional lambda composition • Composable, globally optimizable Operates on Unstructured & Structured Data • Schema on read over files • Relational metadata objects (e.g. database, table) Extensible from ground up: • Type system is based on C# • Expression language IS C# • User-defined functions (U-SQL and C#) • User-defined Aggregators (C#) • User-defined Operators (UDO) (C#) U-SQL provides the Parallelization and Scale-out Framework for Usercode • EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINER, APPLIER Federated query across distributed data sources REFERENCE MyDB.MyAssembly; CREATE TABLE T( cid int, first_order DateTime , last_order DateTime, order_count int , order_amount float ); @o = EXTRACT oid int, cid int, odate DateTime, amount float FROM "/input/orders.txt" USING Extractors.Csv(); @c = EXTRACT cid int, name string, city string FROM "/input/customers.txt" USING Extractors.Csv(); @j = SELECT c.cid, MIN(o.odate) AS firstorder , MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt , AGG<MyAgg.MySum>(c.amount) AS totalamount FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid WHERE c.city.StartsWith("New") && MyNamespace.MyFunction(o.odate) > 10 GROUP BY c.cid; OUTPUT @j TO "/output/result.txt" USING new MyData.Write(); INSERT INTO T SELECT * FROM @j;
  24. 24. Expression-flow Programming Style Automatic "in-lining" of U-SQL expressions – whole script leads to a single execution model. Execution plan that is optimized out- of-the-box and w/o user intervention. Per job and user driven level of parallelization. Detail visibility into execution steps, for debugging. Heatmap like functionality to identify performance bottlenecks.
  25. 25. Unifies natively SQL’s declarativity and C#’s extensibility Unifies querying structured and unstructured Unifies local and remote queries Increase productivity and agility from Day 1 forward for YOU! Sign up for an Azure Data Lake account and join the Public Preview http://www.azure.com/datalake and give us your feedback via http://aka.ms/adlfeedback or at http://aka.ms/u-sql-survey!
  26. 26. Additional resources • Tools: • http://aka.ms/adltoolsVS • Blogs, videos and community page: • http://usql.io (Link to Github with code samples) • http://blogs.msdn.com/b/visualstudio/ • http://azure.microsoft.com/en-us/blog/topics/big-data/ • https://channel9.msdn.com/Search?term=U-SQL#ch9Search • Documentation and articles and slides: • http://aka.ms/usql_reference • https://azure.microsoft.com/en-us/documentation/services/data-lake-analytics/ • https://msdn.microsoft.com/en-us/magazine/mt614251 • http://www.slideshare.net/MichaelRys • ADL forums and feedback • http://aka.ms/adlfeedback • https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake • http://stackoverflow.com/questions/tagged/u-sql
  27. 27. http://aka.ms/AzureDataLake