SlideShare a Scribd company logo
1 of 16
Download to read offline
SQL Server Live! Orlando 2012




                                                           Microsoft's Big Play
                                                                  for Big Data
                                                                 Andrew J. Brust
                                                                    CEO and Founder
                                                                  Blue Badge Insights


                                                                      Level: Intermediate




                        Meet Andrew

                            •   CEO and Founder, Blue Badge Insights
                            •   Big Data blogger for ZDNet
                            •   Microsoft Regional Director, MVP
                            •   Co-chair VSLive! and 17 years as a speaker
                            •   Founder, Microsoft BI User Group of NYC
                                – http://www.msbinyc.com
                            •   Co-moderator, NYC .NET Developers Group
                                – http://www.nycdotnetdev.com
                            •   “Redmond Review” columnist for
                                Visual Studio Magazine and Redmond
                                Developer News
                            •   brustblog.com, Twitter: @andrewbrust




SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved.   1
SQL Server Live! Orlando 2012




                            My New Blog (bit.ly/bigondata)




                            Read all about it!




SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved.   2
SQL Server Live! Orlando 2012




                            What is Big Data?
                            •   100s of TB into PB and higher
                            •   Involving data from: financial data,
                                sensors, web logs, social media, etc.
                            •   Parallel processing often involved
                                – Hadoop is emblematic, but other technologies are Big
                                  Data too
                            •   Processing of data sets too large for
                                transactional databases
                                – Analyzing interactions, rather than transactions
                                – The three V’s: Volume, Velocity, Variety
                            •   Big Data tech sometimes imposed on
                                small data problems




                            What’s MapReduce?
                            •   “Big” input data as key-value pair series
                            •   Partition the data and send to mappers
                                (nodes in cluster)
                            •   Mappers pre-aggregate by key, then all
                                output for (a) given key(s) goes to a
                                reducer
                            •   Reducer completes aggregations; one
                                output per key, with value
                            •   Map and Reduce code natively written as
                                Java functions




SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved.   3
SQL Server Live! Orlando 2012




                            MapReduce, in a Diagram


                                     Input   mapper   Output

                                                               K1

                                     Input   mapper   Output   Input   reducer   Output


                                                                                          Output
                                                               K2
                                     Input   mapper   Output   Input   reducer   Output
                         Input
                                                               K3
                                     Input   mapper   Output
                                                               Input   reducer   Output


                                     Input   mapper   Output


                                     Input   mapper   Output




                            What’s a Distributed File System?
                            •    One where data gets distributed over
                                 commodity drives on commodity servers
                            •    Data is replicated
                            •    If one box goes down, no data lost
                                 – “Shared Nothing”
                            •    BUT: Immutable
                                 – Files can only be written to once
                                 – So updates require drop + re-write (slow)
                                 – You can append though
                                 – Like a DVD/CD-ROM




SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved.   4
SQL Server Live! Orlando 2012




                            Hadoop = MapReduce + HDFS
                            •   Modeled after Google MapReduce + GFS
                            •   Have more data? Just add more nodes to
                                cluster.
                                – Mappers execute in parallel
                                – Hardware is commodity
                                – “Scaling out”
                            •   Use of HDFS means data may well be local
                                to mapper processing
                            •   So, not just parallel, but minimal data
                                movement, which avoids network
                                bottlenecks




                            What’s NoSQL?
                            •   Databases that are non-relational (don’t let
                                name fool you, some actually use SQL)
                            •   Four kinds:
                                – Key-Value Store
                                   Schema-free
                                   FYI: Azure Table Storage is an example
                                – Document Store
                                   All data stored in JSON objects
                                – Wide-Column Store
                                   Define column families, but not columns
                                – Graph database
                                   Manage relationships between objects




SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved.   5
SQL Server Live! Orlando 2012




                            What’s HBase?
                            •    A Wide-Column Store
                            •    Modeled after Google BigTable
                            •    Uses HDFS
                                  – Therefore, Hadoop-compatible
                            •    Hadoop often used with HBase
                                  – But you can use either without the other




                            The Hadoop Stack
                                Log file integration



                                Machine Learning/Data Mining

                                RDBMS Import/Export

                                Query: HiveQL and Pig Latin

                                Database

                                MapReduce, HDFS




SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved.   6
SQL Server Live! Orlando 2012




                            What’s Hive?
                            •   Began as Hadoop sub-project
                                – Now top-level Apache project
                            •   Provides a SQL-like (“HiveQL”)
                                abstraction over MapReduce
                            •   Has its own HDFS table file format (and it’s
                                fully schema-bound)
                            •   Can also work over HBase
                            •   Acts as a bridge to many BI products
                                which expect tabular data




                            Hadoop Distributions
                            •   Cloudera
                            •   Hortonworks
                                – HCatalog: Hive/Pig/MR Interop
                            •   MapR
                                – Network File System replaces HDFS
                            •   IBM InfoSphere BigInsights
                                – HDFS<->DB2 integration
                            •   And now Microsoft…




SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved.   7
SQL Server Live! Orlando 2012




                            Microsoft HDInsight
                            •   Developed with Hortonworks and
                                incorporates Hortonworks Data Platform
                                (HDP) for Windows
                            •   Windows Azure HDInsight and Microsoft
                                HDInsight (for Windows Server)
                                – Single node preview runs on Windows client
                            •   Includes ODBC Driver for Hive
                                – And Excel Add-In that uses it
                            •   JavaScript MapReduce framework
                            •   Contribute it all back to open source
                                Apache Project




                            Azure HDInsight Provisioning
                            •   Give cluster a name
                                – Hostname will be name.cloudapp.net
                            •   Create credentials
                                – Used for ODBC connections and RDP sessions
                            •   Elect whether to use SQL Azure for Hive
                                metabase
                            •   [Choose number of nodes and storage
                                size in cluster]
                            •   Wait for cluster to provision
                            •   Click link to go to portal




SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved.   8
SQL Server Live! Orlando 2012




                            Submitting, Running and
                            Monitoring Jobs
                            •   Upload a JAR
                            •   Use Streaming
                                – Use other languages (i.e. other than Java) to write
                                  MapReduce code
                                – Python is popular option
                                – Any executable works, even C# console apps
                                – On HDInsight, JavaScript works too
                                – Still uses a JAR file: streaming.jar
                            •   Run at command line (passing JAR name
                                and params) or use GUI




SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved.   9
SQL Server Live! Orlando 2012




                            Amenities for
                            Visual Studio/.NET

                                                          MRLib
                                                         (NuGet
                                                        Package)
                                        MR code in
                                           C#,
                                       HadoopJob,                        LINQ to Hive
                                       MapperBase,
                                       ReducerBase
                                                      Hortonworks
                                                     Data Platform for
                                                        Windows

                                                                         OdbcClient +
                                        Debugging                        Hive ODBC
                                                                            Driver



                                                       Deployment




                            Running MapReduce
                            Jobs




SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved.   10
SQL Server Live! Orlando 2012




                            HDInsight Data Sources
                            •   Files in HDFS
                            •   Azure Blob Storage (Azure HDInsight only)
                            •   Hive Tables
                            •   HBase?




                            Review: ODBC Connection Types
                            •   Registry-based
                                – User Data Source Name (DSN)
                                – System DSN
                            •   File-based
                                – File DSN
                            •   String-based
                                – DSN-less connection
                            •   We need file-based
                            •   Wizard obfuscates how to do this
                            •   Don’t forget to open the ODBC port!




SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved.   11
SQL Server Live! Orlando 2012




                            Hive ODBC Setup,
                            Excel Add-In




                            ODBC Driver’s Untold Story
                            •   Works with any Hive install/Hadoop
                                cluster, not just Windows-based ones.
                            •   Simba driver available too




SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved.   12
SQL Server Live! Orlando 2012




                            How Does SQL Server Fit In?
                            •   RDBMS + PDW: Sqoop connectors
                            •   RDBMS: Columnstore Indexes
                                – Enterprise Edition only
                            •   Analysis Services: Tabular Mode
                                – Compatible with ODBC Driver
                                   Multidimensional mode is not
                            •   RDBMS + SSAS Tabular: DirectQuery
                            •   PowerPivot (as with SSAS Tabular)
                            •   Power View
                                – Works against PowerPivot and SSAS Tabular




                            Querying Hadoop from
                            SQL Server BI




SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved.   13
SQL Server Live! Orlando 2012




                            The “Data-Refinery” Idea
                            •   Use Hadoop to “on-board” unstructured
                                data, then extract manageable subsets
                            •   Load the subsets into conventional DW/BI
                                servers and use familiar analytics tool to
                                examine
                            •   This is the current rationalization of
                                Hadoop + BI tools’ coexistence
                            •   Will it stay this way?




                            Usability Impact
                            •   PowerPivot makes analysis much easier,
                                self-service
                            •   Power View is great for discovery and
                                visualization; also self-service
                            •   Combine with the Hive ODBC driver and
                                suddenly Hadoop is accessible to
                                business users
                            •   Caveats
                                – Someone has to write the HiveQL
                                – Can query Big Data, but must have smaller result




SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved.   14
SQL Server Live! Orlando 2012




                            Other Relevant MS Technologies
                            •   SQL Server Components:
                                – SQL Server Parallel Data Warehouse
                                – StreamInsight
                            •   Azure Components:
                                – Data Explorer
                                – DataMarket
                            •   Deprecated MSR Project
                                – Dryad




                            Resources
                            •   Big On Data blog
                                – http://www.zdnet.com/blog/big-data
                            •   Apache Hadoop home page
                                – http://hadoop.apache.org/
                            •   Hive & Pig home pages
                                – http://hive.apache.org/
                                – http://pig.apache.org/
                            •   Hadoop on Azure home page
                                – https://www.hadooponazure.com/
                            •   SQL Server 2012 Big Data
                                – http://bit.ly/sql2012bigdata




SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved.   15
SQL Server Live! Orlando 2012




                            Thank you



                            •   andrew.brust@bluebadgeinsights.com
                            •   @andrewbrust on twitter
                            •   Want to get the free “Redmond Roundup
                                Plus?”
                                – Text “bluebadge” to 22828




SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved.   16

More Related Content

What's hot

A Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooA Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data Hullabaloo
Andrew Brust
 
Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BI
Andrew Brust
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
Don Demcsak
 

What's hot (20)

Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
A Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooA Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data Hullabaloo
 
Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BI
 
SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms
 
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionHadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in Action
 
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth AnalysisCloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Relational vs. Non-Relational
Relational vs. Non-RelationalRelational vs. Non-Relational
Relational vs. Non-Relational
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big data
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
 
NoSQL
NoSQLNoSQL
NoSQL
 
Selecting best NoSQL
Selecting best NoSQL Selecting best NoSQL
Selecting best NoSQL
 
Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational Databases
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
1. introduction to no sql
1. introduction to no sql1. introduction to no sql
1. introduction to no sql
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Non Relational Databases
Non Relational DatabasesNon Relational Databases
Non Relational Databases
 

Similar to Microsoft's Big Play for Big Data

Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013
Nathan Bijnens
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DB
Heriyadi Janwar
 

Similar to Microsoft's Big Play for Big Data (20)

Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
Evolution of the DBA to Data Platform Administrator/Specialist
Evolution of the DBA to Data Platform Administrator/SpecialistEvolution of the DBA to Data Platform Administrator/Specialist
Evolution of the DBA to Data Platform Administrator/Specialist
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Getting Started with MongoDB at Oracle Open World 2012
Getting Started with MongoDB at Oracle Open World 2012Getting Started with MongoDB at Oracle Open World 2012
Getting Started with MongoDB at Oracle Open World 2012
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Nashville analytics summit aug9 no sql mike king dell v1.5
Nashville analytics summit aug9 no sql mike king dell v1.5Nashville analytics summit aug9 no sql mike king dell v1.5
Nashville analytics summit aug9 no sql mike king dell v1.5
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DB
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQL
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon Thomas
 

More from Andrew Brust (7)

Azure ml screen grabs
Azure ml screen grabsAzure ml screen grabs
Azure ml screen grabs
 
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackBig Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
 
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
 
Power View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s DataPower View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s Data
 
Grasping The LightSwitch Paradigm
Grasping The LightSwitch ParadigmGrasping The LightSwitch Paradigm
Grasping The LightSwitch Paradigm
 
Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis
 

Microsoft's Big Play for Big Data

  • 1. SQL Server Live! Orlando 2012 Microsoft's Big Play for Big Data Andrew J. Brust CEO and Founder Blue Badge Insights Level: Intermediate Meet Andrew • CEO and Founder, Blue Badge Insights • Big Data blogger for ZDNet • Microsoft Regional Director, MVP • Co-chair VSLive! and 17 years as a speaker • Founder, Microsoft BI User Group of NYC – http://www.msbinyc.com • Co-moderator, NYC .NET Developers Group – http://www.nycdotnetdev.com • “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News • brustblog.com, Twitter: @andrewbrust SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 1
  • 2. SQL Server Live! Orlando 2012 My New Blog (bit.ly/bigondata) Read all about it! SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 2
  • 3. SQL Server Live! Orlando 2012 What is Big Data? • 100s of TB into PB and higher • Involving data from: financial data, sensors, web logs, social media, etc. • Parallel processing often involved – Hadoop is emblematic, but other technologies are Big Data too • Processing of data sets too large for transactional databases – Analyzing interactions, rather than transactions – The three V’s: Volume, Velocity, Variety • Big Data tech sometimes imposed on small data problems What’s MapReduce? • “Big” input data as key-value pair series • Partition the data and send to mappers (nodes in cluster) • Mappers pre-aggregate by key, then all output for (a) given key(s) goes to a reducer • Reducer completes aggregations; one output per key, with value • Map and Reduce code natively written as Java functions SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 3
  • 4. SQL Server Live! Orlando 2012 MapReduce, in a Diagram Input mapper Output K1 Input mapper Output Input reducer Output Output K2 Input mapper Output Input reducer Output Input K3 Input mapper Output Input reducer Output Input mapper Output Input mapper Output What’s a Distributed File System? • One where data gets distributed over commodity drives on commodity servers • Data is replicated • If one box goes down, no data lost – “Shared Nothing” • BUT: Immutable – Files can only be written to once – So updates require drop + re-write (slow) – You can append though – Like a DVD/CD-ROM SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 4
  • 5. SQL Server Live! Orlando 2012 Hadoop = MapReduce + HDFS • Modeled after Google MapReduce + GFS • Have more data? Just add more nodes to cluster. – Mappers execute in parallel – Hardware is commodity – “Scaling out” • Use of HDFS means data may well be local to mapper processing • So, not just parallel, but minimal data movement, which avoids network bottlenecks What’s NoSQL? • Databases that are non-relational (don’t let name fool you, some actually use SQL) • Four kinds: – Key-Value Store Schema-free FYI: Azure Table Storage is an example – Document Store All data stored in JSON objects – Wide-Column Store Define column families, but not columns – Graph database Manage relationships between objects SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 5
  • 6. SQL Server Live! Orlando 2012 What’s HBase? • A Wide-Column Store • Modeled after Google BigTable • Uses HDFS – Therefore, Hadoop-compatible • Hadoop often used with HBase – But you can use either without the other The Hadoop Stack Log file integration Machine Learning/Data Mining RDBMS Import/Export Query: HiveQL and Pig Latin Database MapReduce, HDFS SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 6
  • 7. SQL Server Live! Orlando 2012 What’s Hive? • Began as Hadoop sub-project – Now top-level Apache project • Provides a SQL-like (“HiveQL”) abstraction over MapReduce • Has its own HDFS table file format (and it’s fully schema-bound) • Can also work over HBase • Acts as a bridge to many BI products which expect tabular data Hadoop Distributions • Cloudera • Hortonworks – HCatalog: Hive/Pig/MR Interop • MapR – Network File System replaces HDFS • IBM InfoSphere BigInsights – HDFS<->DB2 integration • And now Microsoft… SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 7
  • 8. SQL Server Live! Orlando 2012 Microsoft HDInsight • Developed with Hortonworks and incorporates Hortonworks Data Platform (HDP) for Windows • Windows Azure HDInsight and Microsoft HDInsight (for Windows Server) – Single node preview runs on Windows client • Includes ODBC Driver for Hive – And Excel Add-In that uses it • JavaScript MapReduce framework • Contribute it all back to open source Apache Project Azure HDInsight Provisioning • Give cluster a name – Hostname will be name.cloudapp.net • Create credentials – Used for ODBC connections and RDP sessions • Elect whether to use SQL Azure for Hive metabase • [Choose number of nodes and storage size in cluster] • Wait for cluster to provision • Click link to go to portal SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 8
  • 9. SQL Server Live! Orlando 2012 Submitting, Running and Monitoring Jobs • Upload a JAR • Use Streaming – Use other languages (i.e. other than Java) to write MapReduce code – Python is popular option – Any executable works, even C# console apps – On HDInsight, JavaScript works too – Still uses a JAR file: streaming.jar • Run at command line (passing JAR name and params) or use GUI SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 9
  • 10. SQL Server Live! Orlando 2012 Amenities for Visual Studio/.NET MRLib (NuGet Package) MR code in C#, HadoopJob, LINQ to Hive MapperBase, ReducerBase Hortonworks Data Platform for Windows OdbcClient + Debugging Hive ODBC Driver Deployment Running MapReduce Jobs SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 10
  • 11. SQL Server Live! Orlando 2012 HDInsight Data Sources • Files in HDFS • Azure Blob Storage (Azure HDInsight only) • Hive Tables • HBase? Review: ODBC Connection Types • Registry-based – User Data Source Name (DSN) – System DSN • File-based – File DSN • String-based – DSN-less connection • We need file-based • Wizard obfuscates how to do this • Don’t forget to open the ODBC port! SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 11
  • 12. SQL Server Live! Orlando 2012 Hive ODBC Setup, Excel Add-In ODBC Driver’s Untold Story • Works with any Hive install/Hadoop cluster, not just Windows-based ones. • Simba driver available too SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 12
  • 13. SQL Server Live! Orlando 2012 How Does SQL Server Fit In? • RDBMS + PDW: Sqoop connectors • RDBMS: Columnstore Indexes – Enterprise Edition only • Analysis Services: Tabular Mode – Compatible with ODBC Driver Multidimensional mode is not • RDBMS + SSAS Tabular: DirectQuery • PowerPivot (as with SSAS Tabular) • Power View – Works against PowerPivot and SSAS Tabular Querying Hadoop from SQL Server BI SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 13
  • 14. SQL Server Live! Orlando 2012 The “Data-Refinery” Idea • Use Hadoop to “on-board” unstructured data, then extract manageable subsets • Load the subsets into conventional DW/BI servers and use familiar analytics tool to examine • This is the current rationalization of Hadoop + BI tools’ coexistence • Will it stay this way? Usability Impact • PowerPivot makes analysis much easier, self-service • Power View is great for discovery and visualization; also self-service • Combine with the Hive ODBC driver and suddenly Hadoop is accessible to business users • Caveats – Someone has to write the HiveQL – Can query Big Data, but must have smaller result SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 14
  • 15. SQL Server Live! Orlando 2012 Other Relevant MS Technologies • SQL Server Components: – SQL Server Parallel Data Warehouse – StreamInsight • Azure Components: – Data Explorer – DataMarket • Deprecated MSR Project – Dryad Resources • Big On Data blog – http://www.zdnet.com/blog/big-data • Apache Hadoop home page – http://hadoop.apache.org/ • Hive & Pig home pages – http://hive.apache.org/ – http://pig.apache.org/ • Hadoop on Azure home page – https://www.hadooponazure.com/ • SQL Server 2012 Big Data – http://bit.ly/sql2012bigdata SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 15
  • 16. SQL Server Live! Orlando 2012 Thank you • andrew.brust@bluebadgeinsights.com • @andrewbrust on twitter • Want to get the free “Redmond Roundup Plus?” – Text “bluebadge” to 22828 SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 16