SlideShare uma empresa Scribd logo
1 de 30
Microsoft's Big Play for Big Data
                     Andrew J. Brust
                        CEO and Founder
                      Blue Badge Insights
                          Level: Intermediate
Meet Andrew
 •   CEO and Founder, Blue Badge Insights
 •   Big Data blogger for ZDNet
 •   Microsoft Regional Director, MVP
 •   Co-chair VSLive! and 17 years as a speaker
 •   Founder, Microsoft BI User Group of NYC
     – http://www.msbinyc.com
 •   Co-moderator, NYC .NET Developers Group
     – http://www.nycdotnetdev.com
 •   “Redmond Review” columnist for
     Visual Studio Magazine and Redmond Developer
     News
 •   brustblog.com, Twitter: @andrewbrust
My New Blog (bit.ly/bigondata)
Read All About It!
What is Big Data?
•   100s of TB into PB and higher
•   Involving data from: financial data,
    sensors, web logs, social media, etc.
•   Parallel processing often involved
    – Hadoop is emblematic, but other technologies are Big
      Data too
•   Processing of data sets too large for
    transactional databases
    – Analyzing interactions, rather than transactions
    – The three V’s: Volume, Velocity, Variety
•   Big Data tech sometimes imposed on
    small data problems
What’s MapReduce?
•   “Big” input data as key-value pair series
•   Partition the data and send to mappers
    (nodes in cluster)
•   Mappers pre-aggregate by key, then all
    output for (a) given key(s) goes to a
    reducer
•   Reducer completes aggregations; one
    output per key, with value
•   Map and Reduce code natively written as
    Java functions
MapReduce, in a Diagram


        Input   mapper   Output

                                  K1

        Input   mapper   Output   Input   reducer   Output


                                                             Output
                                  K2
        Input   mapper   Output   Input   reducer   Output
Input
                                  K3
        Input   mapper   Output
                                  Input   reducer   Output


        Input   mapper   Output


        Input   mapper   Output
What’s a Distributed File System?
•   One where data gets distributed over
    commodity drives on commodity servers
•   Data is replicated
•   If one box goes down, no data lost
    – Except the name node = SPOF!
•   BUT: HDFS is immutable
    – Files can only be written to once
    – So updates require drop + re-write (slow)
Hadoop = MapReduce + HDFS
•   Modeled after Google MapReduce + GFS
•   Have more data? Just add more nodes to
    cluster.
    – Mappers execute in parallel
    – Hardware is commodity
    – “Scaling out”
•   Use of HDFS means data may well be local
    to mapper processing
•   So, not just parallel, but minimal data
    movement, which avoids network
    bottlenecks
What’s NoSQL?
•   Databases that are non-relational (don’t let
    name fool you, some actually use SQL)
•   Four kinds:
    – Key-Value Store
      Schema-free
      FYI: Azure Table Storage is an example
    – Document Store
      All data stored in JSON objects
    – Wide-Column Store
      Define column families, but not columns
    – Graph database
      Manage relationships between objects
What’s HBase?
•   A Wide-Column Store
•   Modeled after Google BigTable
•   Born at Powerset in 2007
    – Powerset acquired by Microsoft in 2008
    – Adopted in 2010 by Facebook for messaging platform
•   Uses HDFS
    – Therefore, Hadoop-compatible
•   Hadoop often used with HBase
    – But you can use either without the other
The Hadoop Stack
•   Hadoop
    – MapReduce, HDFS
•   HBase
    – Lesser extent: Cassandra, HyperTable
•   Hive, Pig
    – SQL-like “data warehouse” system
    – Data transformation language
•   Sqoop
    – Import/export between HDFS, HBase,
      Hive and relational data warehouses
•   Flume
    – Log file integration
•   Mahout
    – Data Mining
What’s Hive?
•   Began as Hadoop sub-project
    – Now top-level Apache project
•   Provides a SQL-like (“HiveQL”)
    abstraction over MapReduce
•   Has its own HDFS table file format (and it’s
    fully schema-bound)
•   Can also work over HBase
•   Acts as a bridge to many BI products
    which expect tabular data
Hadoop Distributions
•   Cloudera
•   Hortonworks
    – HCatalog: Hive/Pig/MR Interop
•   MapR
    – Network File System replaces HDFS
•   IBM InfoSphere BigInsights
    – HDFS<->DB2 integration
•   And now Microsoft…
Project “Isotope”
•   Work with Hortonworks to create “distro”
    of Hadoop that runs on Windows Server
    and Windows Azure
    – Hortonworks are ex-Yahoo FTEs who are Hadoop
      pioneers
•   Create ODBC Driver for Hive
    – And Excel Add-In that uses it
•   Build JavaScript command line and
    MapReduce framework
•   Contribute it all back to open source
    Apache project
Hadoop on Azure
•   Install onto your own Azure VMs and build
    a cluster, or…
•   Provision a cluster in one step
    – Give it a name
    – Choose number of nodes and storage size in cluster
    – Wait for it to provision
    – Go!
Provisioning a Cluster
Submitting, Running and
Monitoring Jobs
•   Upload a JAR
•   Use .NET
•   Use the JavaScript Console
•   Use the Hive Console
Running MapReduce
Jobs
Hadoop on Azure Data Sources
•   Files in HDFS
•   Azure Blob Storage
•   Amazon S3 Storage
•   Hive Tables
Review: ODBC Connection Types
•   Registry-based
    – User Data Source Name (DSN)
    – System DSN
•   File-based
    – File DSN
•   String-based
    – DSN-less connection
•   We need file-based
•   Wizard obfuscates how to do this
•   Don’t forget to open the ODBC port!
Hive ODBC Setup,
Excel Add-In
ODBC Driver’s Untold Story
•   Works with any Hive install/Hadoop
    cluster, not just Windows-based ones.
How Does SQL Server Fit In?
•   RDBMS + PDW: Sqoop connectors
•   RDBMS: Columnstore Indexes
    – Enterprise Edition only
•   Analysis Services: Tabular Mode
    – Compatible with ODBC Driver
      Multidimensional mode is not
•   RDBMS + SSAS Tabular: DirectQuery
•   PowerPivot (as with SSAS Tabular)
•   Power View
    – Works against PowerPivot and SSAS Tabular
Querying Hadoop from
SQL Server BI
The “Data-Refinery” Idea
•   Use Hadoop to “on-board” unstructured
    data, then extract manageable subsets
•   Load the subsets into conventional DW/BI
    servers and use familiar analytics tools to
    examine
•   This is the current rationalization of
    Hadoop + BI tools’ coexistence
•   Will it stay this way?
Usability Impact
•   PowerPivot makes analysis much easier,
    self-service
•   Power View is great for discovery and
    visualization; also self-service
•   Combine with the Hive ODBC driver and
    suddenly Hadoop is accessible to
    business users
•   Caveats
    – Someone has to write the HiveQL
    – Can query Big Data, but must have smaller result
Other Relevant MS Technologies
•   SQL Server Components:
    – SQL Server Parallel Data Warehouse
    – StreamInsight
•   Azure Components:
    – Data Explorer
    – DataMarket
•   Deprecated MSR Project
    – Dryad
Resources
•   Big On Data blog
    – http://www.zdnet.com/blog/big-data
•   Apache Hadoop home page
    – http://hadoop.apache.org/
•   Hive & Pig home pages
    – http://hive.apache.org/
    – http://pig.apache.org/
•   Hadoop on Azure home page
    – https://www.hadooponazure.com/
•   SQL Server 2012 Big Data
    – http://bit.ly/sql2012bigdata
Thank you



•   andrew.brust@bluebadgeinsights.com
•   @andrewbrust on twitter
•   Want to get the free “Redmond Roundup
    Plus?”
    – Text “bluebadge” to 22828

Mais conteúdo relacionado

Mais procurados

Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillDataWorks Summit
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft PlatformAndrew Brust
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSBouquet
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
 
Non-Relational Databases at ACCU2011
Non-Relational Databases at ACCU2011Non-Relational Databases at ACCU2011
Non-Relational Databases at ACCU2011Gavin Heavyside
 
12 SQL On-Hadoop Tools
12 SQL On-Hadoop Tools12 SQL On-Hadoop Tools
12 SQL On-Hadoop ToolsXplenty
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Gavin Heavyside
 
MySql to HBase in 5 Steps
MySql to HBase in 5 StepsMySql to HBase in 5 Steps
MySql to HBase in 5 StepsScott Cinnamond
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 

Mais procurados (20)

Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft Platform
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 
Non-Relational Databases at ACCU2011
Non-Relational Databases at ACCU2011Non-Relational Databases at ACCU2011
Non-Relational Databases at ACCU2011
 
12 SQL On-Hadoop Tools
12 SQL On-Hadoop Tools12 SQL On-Hadoop Tools
12 SQL On-Hadoop Tools
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Sql over hadoop ver 3
Sql over hadoop ver 3Sql over hadoop ver 3
Sql over hadoop ver 3
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
 
MySql to HBase in 5 Steps
MySql to HBase in 5 StepsMySql to HBase in 5 Steps
MySql to HBase in 5 Steps
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
Hadoop - How It Works
Hadoop - How It WorksHadoop - How It Works
Hadoop - How It Works
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 

Semelhante a Microsoft's Big Play for Big Data

Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Mark Rittman
 
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionHadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionAndrew Brust
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache HadoopKMS Technology
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackBig Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackAndrew Brust
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Thomas W. Dinsmore
 

Semelhante a Microsoft's Big Play for Big Data (20)

Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
 
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionHadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in Action
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Apache drill
Apache drillApache drill
Apache drill
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
hadoop overview.pptx
hadoop overview.pptxhadoop overview.pptx
hadoop overview.pptx
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackBig Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
 
Hadoop
HadoopHadoop
Hadoop
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 

Mais de Andrew Brust

Azure ml screen grabs
Azure ml screen grabsAzure ml screen grabs
Azure ml screen grabsAndrew Brust
 
NoSQL: An Analysis
NoSQL: An AnalysisNoSQL: An Analysis
NoSQL: An AnalysisAndrew Brust
 
Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIAndrew Brust
 
A Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooA Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooAndrew Brust
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystemAndrew Brust
 
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012Andrew Brust
 
Power View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s DataPower View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s DataAndrew Brust
 
Evolved BI with SQL Server 2012
Evolved BIwith SQL Server 2012Evolved BIwith SQL Server 2012
Evolved BI with SQL Server 2012Andrew Brust
 
Grasping The LightSwitch Paradigm
Grasping The LightSwitch ParadigmGrasping The LightSwitch Paradigm
Grasping The LightSwitch ParadigmAndrew Brust
 
SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms Andrew Brust
 
Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis Andrew Brust
 

Mais de Andrew Brust (12)

Azure ml screen grabs
Azure ml screen grabsAzure ml screen grabs
Azure ml screen grabs
 
NoSQL: An Analysis
NoSQL: An AnalysisNoSQL: An Analysis
NoSQL: An Analysis
 
Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BI
 
A Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooA Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data Hullabaloo
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
 
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
 
Power View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s DataPower View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s Data
 
Evolved BI with SQL Server 2012
Evolved BIwith SQL Server 2012Evolved BIwith SQL Server 2012
Evolved BI with SQL Server 2012
 
Grasping The LightSwitch Paradigm
Grasping The LightSwitch ParadigmGrasping The LightSwitch Paradigm
Grasping The LightSwitch Paradigm
 
SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms
 
Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis
 

Último

React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 

Último (20)

React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 

Microsoft's Big Play for Big Data

  • 1. Microsoft's Big Play for Big Data Andrew J. Brust CEO and Founder Blue Badge Insights Level: Intermediate
  • 2. Meet Andrew • CEO and Founder, Blue Badge Insights • Big Data blogger for ZDNet • Microsoft Regional Director, MVP • Co-chair VSLive! and 17 years as a speaker • Founder, Microsoft BI User Group of NYC – http://www.msbinyc.com • Co-moderator, NYC .NET Developers Group – http://www.nycdotnetdev.com • “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News • brustblog.com, Twitter: @andrewbrust
  • 3. My New Blog (bit.ly/bigondata)
  • 5. What is Big Data? • 100s of TB into PB and higher • Involving data from: financial data, sensors, web logs, social media, etc. • Parallel processing often involved – Hadoop is emblematic, but other technologies are Big Data too • Processing of data sets too large for transactional databases – Analyzing interactions, rather than transactions – The three V’s: Volume, Velocity, Variety • Big Data tech sometimes imposed on small data problems
  • 6. What’s MapReduce? • “Big” input data as key-value pair series • Partition the data and send to mappers (nodes in cluster) • Mappers pre-aggregate by key, then all output for (a) given key(s) goes to a reducer • Reducer completes aggregations; one output per key, with value • Map and Reduce code natively written as Java functions
  • 7. MapReduce, in a Diagram Input mapper Output K1 Input mapper Output Input reducer Output Output K2 Input mapper Output Input reducer Output Input K3 Input mapper Output Input reducer Output Input mapper Output Input mapper Output
  • 8. What’s a Distributed File System? • One where data gets distributed over commodity drives on commodity servers • Data is replicated • If one box goes down, no data lost – Except the name node = SPOF! • BUT: HDFS is immutable – Files can only be written to once – So updates require drop + re-write (slow)
  • 9. Hadoop = MapReduce + HDFS • Modeled after Google MapReduce + GFS • Have more data? Just add more nodes to cluster. – Mappers execute in parallel – Hardware is commodity – “Scaling out” • Use of HDFS means data may well be local to mapper processing • So, not just parallel, but minimal data movement, which avoids network bottlenecks
  • 10. What’s NoSQL? • Databases that are non-relational (don’t let name fool you, some actually use SQL) • Four kinds: – Key-Value Store Schema-free FYI: Azure Table Storage is an example – Document Store All data stored in JSON objects – Wide-Column Store Define column families, but not columns – Graph database Manage relationships between objects
  • 11. What’s HBase? • A Wide-Column Store • Modeled after Google BigTable • Born at Powerset in 2007 – Powerset acquired by Microsoft in 2008 – Adopted in 2010 by Facebook for messaging platform • Uses HDFS – Therefore, Hadoop-compatible • Hadoop often used with HBase – But you can use either without the other
  • 12. The Hadoop Stack • Hadoop – MapReduce, HDFS • HBase – Lesser extent: Cassandra, HyperTable • Hive, Pig – SQL-like “data warehouse” system – Data transformation language • Sqoop – Import/export between HDFS, HBase, Hive and relational data warehouses • Flume – Log file integration • Mahout – Data Mining
  • 13. What’s Hive? • Began as Hadoop sub-project – Now top-level Apache project • Provides a SQL-like (“HiveQL”) abstraction over MapReduce • Has its own HDFS table file format (and it’s fully schema-bound) • Can also work over HBase • Acts as a bridge to many BI products which expect tabular data
  • 14. Hadoop Distributions • Cloudera • Hortonworks – HCatalog: Hive/Pig/MR Interop • MapR – Network File System replaces HDFS • IBM InfoSphere BigInsights – HDFS<->DB2 integration • And now Microsoft…
  • 15. Project “Isotope” • Work with Hortonworks to create “distro” of Hadoop that runs on Windows Server and Windows Azure – Hortonworks are ex-Yahoo FTEs who are Hadoop pioneers • Create ODBC Driver for Hive – And Excel Add-In that uses it • Build JavaScript command line and MapReduce framework • Contribute it all back to open source Apache project
  • 16. Hadoop on Azure • Install onto your own Azure VMs and build a cluster, or… • Provision a cluster in one step – Give it a name – Choose number of nodes and storage size in cluster – Wait for it to provision – Go!
  • 18. Submitting, Running and Monitoring Jobs • Upload a JAR • Use .NET • Use the JavaScript Console • Use the Hive Console
  • 20. Hadoop on Azure Data Sources • Files in HDFS • Azure Blob Storage • Amazon S3 Storage • Hive Tables
  • 21. Review: ODBC Connection Types • Registry-based – User Data Source Name (DSN) – System DSN • File-based – File DSN • String-based – DSN-less connection • We need file-based • Wizard obfuscates how to do this • Don’t forget to open the ODBC port!
  • 23. ODBC Driver’s Untold Story • Works with any Hive install/Hadoop cluster, not just Windows-based ones.
  • 24. How Does SQL Server Fit In? • RDBMS + PDW: Sqoop connectors • RDBMS: Columnstore Indexes – Enterprise Edition only • Analysis Services: Tabular Mode – Compatible with ODBC Driver Multidimensional mode is not • RDBMS + SSAS Tabular: DirectQuery • PowerPivot (as with SSAS Tabular) • Power View – Works against PowerPivot and SSAS Tabular
  • 26. The “Data-Refinery” Idea • Use Hadoop to “on-board” unstructured data, then extract manageable subsets • Load the subsets into conventional DW/BI servers and use familiar analytics tools to examine • This is the current rationalization of Hadoop + BI tools’ coexistence • Will it stay this way?
  • 27. Usability Impact • PowerPivot makes analysis much easier, self-service • Power View is great for discovery and visualization; also self-service • Combine with the Hive ODBC driver and suddenly Hadoop is accessible to business users • Caveats – Someone has to write the HiveQL – Can query Big Data, but must have smaller result
  • 28. Other Relevant MS Technologies • SQL Server Components: – SQL Server Parallel Data Warehouse – StreamInsight • Azure Components: – Data Explorer – DataMarket • Deprecated MSR Project – Dryad
  • 29. Resources • Big On Data blog – http://www.zdnet.com/blog/big-data • Apache Hadoop home page – http://hadoop.apache.org/ • Hive & Pig home pages – http://hive.apache.org/ – http://pig.apache.org/ • Hadoop on Azure home page – https://www.hadooponazure.com/ • SQL Server 2012 Big Data – http://bit.ly/sql2012bigdata
  • 30. Thank you • andrew.brust@bluebadgeinsights.com • @andrewbrust on twitter • Want to get the free “Redmond Roundup Plus?” – Text “bluebadge” to 22828