More Related Content
Similar to Microsoft's Big Play for Big Data (20)
More from Andrew Brust (7)
Microsoft's Big Play for Big Data
- 1. SQL Server Live! Orlando 2012
Microsoft's Big Play
for Big Data
Andrew J. Brust
CEO and Founder
Blue Badge Insights
Level: Intermediate
Meet Andrew
• CEO and Founder, Blue Badge Insights
• Big Data blogger for ZDNet
• Microsoft Regional Director, MVP
• Co-chair VSLive! and 17 years as a speaker
• Founder, Microsoft BI User Group of NYC
– http://www.msbinyc.com
• Co-moderator, NYC .NET Developers Group
– http://www.nycdotnetdev.com
• “Redmond Review” columnist for
Visual Studio Magazine and Redmond
Developer News
• brustblog.com, Twitter: @andrewbrust
SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 1
- 2. SQL Server Live! Orlando 2012
My New Blog (bit.ly/bigondata)
Read all about it!
SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 2
- 3. SQL Server Live! Orlando 2012
What is Big Data?
• 100s of TB into PB and higher
• Involving data from: financial data,
sensors, web logs, social media, etc.
• Parallel processing often involved
– Hadoop is emblematic, but other technologies are Big
Data too
• Processing of data sets too large for
transactional databases
– Analyzing interactions, rather than transactions
– The three V’s: Volume, Velocity, Variety
• Big Data tech sometimes imposed on
small data problems
What’s MapReduce?
• “Big” input data as key-value pair series
• Partition the data and send to mappers
(nodes in cluster)
• Mappers pre-aggregate by key, then all
output for (a) given key(s) goes to a
reducer
• Reducer completes aggregations; one
output per key, with value
• Map and Reduce code natively written as
Java functions
SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 3
- 4. SQL Server Live! Orlando 2012
MapReduce, in a Diagram
Input mapper Output
K1
Input mapper Output Input reducer Output
Output
K2
Input mapper Output Input reducer Output
Input
K3
Input mapper Output
Input reducer Output
Input mapper Output
Input mapper Output
What’s a Distributed File System?
• One where data gets distributed over
commodity drives on commodity servers
• Data is replicated
• If one box goes down, no data lost
– “Shared Nothing”
• BUT: Immutable
– Files can only be written to once
– So updates require drop + re-write (slow)
– You can append though
– Like a DVD/CD-ROM
SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 4
- 5. SQL Server Live! Orlando 2012
Hadoop = MapReduce + HDFS
• Modeled after Google MapReduce + GFS
• Have more data? Just add more nodes to
cluster.
– Mappers execute in parallel
– Hardware is commodity
– “Scaling out”
• Use of HDFS means data may well be local
to mapper processing
• So, not just parallel, but minimal data
movement, which avoids network
bottlenecks
What’s NoSQL?
• Databases that are non-relational (don’t let
name fool you, some actually use SQL)
• Four kinds:
– Key-Value Store
Schema-free
FYI: Azure Table Storage is an example
– Document Store
All data stored in JSON objects
– Wide-Column Store
Define column families, but not columns
– Graph database
Manage relationships between objects
SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 5
- 6. SQL Server Live! Orlando 2012
What’s HBase?
• A Wide-Column Store
• Modeled after Google BigTable
• Uses HDFS
– Therefore, Hadoop-compatible
• Hadoop often used with HBase
– But you can use either without the other
The Hadoop Stack
Log file integration
Machine Learning/Data Mining
RDBMS Import/Export
Query: HiveQL and Pig Latin
Database
MapReduce, HDFS
SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 6
- 7. SQL Server Live! Orlando 2012
What’s Hive?
• Began as Hadoop sub-project
– Now top-level Apache project
• Provides a SQL-like (“HiveQL”)
abstraction over MapReduce
• Has its own HDFS table file format (and it’s
fully schema-bound)
• Can also work over HBase
• Acts as a bridge to many BI products
which expect tabular data
Hadoop Distributions
• Cloudera
• Hortonworks
– HCatalog: Hive/Pig/MR Interop
• MapR
– Network File System replaces HDFS
• IBM InfoSphere BigInsights
– HDFS<->DB2 integration
• And now Microsoft…
SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 7
- 8. SQL Server Live! Orlando 2012
Microsoft HDInsight
• Developed with Hortonworks and
incorporates Hortonworks Data Platform
(HDP) for Windows
• Windows Azure HDInsight and Microsoft
HDInsight (for Windows Server)
– Single node preview runs on Windows client
• Includes ODBC Driver for Hive
– And Excel Add-In that uses it
• JavaScript MapReduce framework
• Contribute it all back to open source
Apache Project
Azure HDInsight Provisioning
• Give cluster a name
– Hostname will be name.cloudapp.net
• Create credentials
– Used for ODBC connections and RDP sessions
• Elect whether to use SQL Azure for Hive
metabase
• [Choose number of nodes and storage
size in cluster]
• Wait for cluster to provision
• Click link to go to portal
SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 8
- 9. SQL Server Live! Orlando 2012
Submitting, Running and
Monitoring Jobs
• Upload a JAR
• Use Streaming
– Use other languages (i.e. other than Java) to write
MapReduce code
– Python is popular option
– Any executable works, even C# console apps
– On HDInsight, JavaScript works too
– Still uses a JAR file: streaming.jar
• Run at command line (passing JAR name
and params) or use GUI
SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 9
- 10. SQL Server Live! Orlando 2012
Amenities for
Visual Studio/.NET
MRLib
(NuGet
Package)
MR code in
C#,
HadoopJob, LINQ to Hive
MapperBase,
ReducerBase
Hortonworks
Data Platform for
Windows
OdbcClient +
Debugging Hive ODBC
Driver
Deployment
Running MapReduce
Jobs
SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 10
- 11. SQL Server Live! Orlando 2012
HDInsight Data Sources
• Files in HDFS
• Azure Blob Storage (Azure HDInsight only)
• Hive Tables
• HBase?
Review: ODBC Connection Types
• Registry-based
– User Data Source Name (DSN)
– System DSN
• File-based
– File DSN
• String-based
– DSN-less connection
• We need file-based
• Wizard obfuscates how to do this
• Don’t forget to open the ODBC port!
SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 11
- 12. SQL Server Live! Orlando 2012
Hive ODBC Setup,
Excel Add-In
ODBC Driver’s Untold Story
• Works with any Hive install/Hadoop
cluster, not just Windows-based ones.
• Simba driver available too
SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 12
- 13. SQL Server Live! Orlando 2012
How Does SQL Server Fit In?
• RDBMS + PDW: Sqoop connectors
• RDBMS: Columnstore Indexes
– Enterprise Edition only
• Analysis Services: Tabular Mode
– Compatible with ODBC Driver
Multidimensional mode is not
• RDBMS + SSAS Tabular: DirectQuery
• PowerPivot (as with SSAS Tabular)
• Power View
– Works against PowerPivot and SSAS Tabular
Querying Hadoop from
SQL Server BI
SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 13
- 14. SQL Server Live! Orlando 2012
The “Data-Refinery” Idea
• Use Hadoop to “on-board” unstructured
data, then extract manageable subsets
• Load the subsets into conventional DW/BI
servers and use familiar analytics tool to
examine
• This is the current rationalization of
Hadoop + BI tools’ coexistence
• Will it stay this way?
Usability Impact
• PowerPivot makes analysis much easier,
self-service
• Power View is great for discovery and
visualization; also self-service
• Combine with the Hive ODBC driver and
suddenly Hadoop is accessible to
business users
• Caveats
– Someone has to write the HiveQL
– Can query Big Data, but must have smaller result
SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 14
- 15. SQL Server Live! Orlando 2012
Other Relevant MS Technologies
• SQL Server Components:
– SQL Server Parallel Data Warehouse
– StreamInsight
• Azure Components:
– Data Explorer
– DataMarket
• Deprecated MSR Project
– Dryad
Resources
• Big On Data blog
– http://www.zdnet.com/blog/big-data
• Apache Hadoop home page
– http://hadoop.apache.org/
• Hive & Pig home pages
– http://hive.apache.org/
– http://pig.apache.org/
• Hadoop on Azure home page
– https://www.hadooponazure.com/
• SQL Server 2012 Big Data
– http://bit.ly/sql2012bigdata
SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 15
- 16. SQL Server Live! Orlando 2012
Thank you
• andrew.brust@bluebadgeinsights.com
• @andrewbrust on twitter
• Want to get the free “Redmond Roundup
Plus?”
– Text “bluebadge” to 22828
SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © 2012 SQL Server Live! All rights reserved. 16