SlideShare a Scribd company logo
1 of 15
Big Data with SQL Server


Philly Code Camp
November 2012



Mark Kromer
BI & Big Data Technology Director
http://www.kromerbigdata.com
@kromerbigdata
@mssqldude
What we’ll (try) to cover today

‣ What is Big Data?
‣ The Big Data and Apache Hadoop environment
‣ Big Data Analytics
‣ SQL Server in the Big Data world
‣ How we utilize Big Data @ Razorfish




                                               2
Big Data 101

‣ 3 V’s
   ‣ Volume – Terabyte records, transactions, tables, files
   ‣ Velocity – Batch, near-time, real-time (analytics), streams.
   ‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix
‣ Text Processing
   ‣ Techniques for processing and analyzing unstructured (and structured) LARGE files
‣ Analytics & Insights
‣ Distributed File System & Programming
‣   Batch Processing
‣   Commodity Hardware
‣   Data Locality, no shared storage
‣   Scales linearly
‣   Great for large text file processing, not so great on small files
‣   Distributed programming paradigm
MapReduce Framework (Map)
using Microsoft.Hadoop.MapReduce;
using System.Text.RegularExpressions;
public class TotalHitsForPageMap : MapperBase
{
public override void Map(string inputLine, MapperContext context)
        {
            context.Log(inputLine);
            var parts = Regex.Split(inputLine, "s+");
            if (parts.Length != expected) //only take records with all values
            {
                return;
            }
            context.EmitKeyValue(parts[pagePos], hit);
        }
    }
MapReduce Framework (Reduce & Job)
public class TotalHitsForPageReducerCombiner : ReducerCombinerBase
  {
     public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext
context)
      {
          context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString());
      }
  }
public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner>
  {
      public override HadoopJobConfiguration Configure(ExecutorContext context)
      {
          var retVal = new HadoopJobConfiguration();
          retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT");
          retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT");
          retVal.DeleteOutputFolder = true;
          return retVal;
      }
  }
Mark’s Big Data Myths

‣ Big Data ≠ NoSQL
    ‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!,
      Google, Facebook, et al) but not the same thing
    ‣ Facebook, for example, uses Hbase from the Hadoop stack
‣ Big Data ≠ Real Time
    ‣ Big Data is primarily about batch processing huge files in a distributed manner
      and analyzing data that was otherwise too complex to provide value
    ‣ Use in-memory analytics for real time insights
‣ Big Data ≠ Data Warehouse
    ‣ I still refer to large multi-TB DWs as “VLDB”
    ‣ Big Data is about crunching stats in text files for discovery of new patterns and
      insights
    ‣ Use the DW to aggregate and store the summaries of those calculations for
      reporting
Razorfish & Big Data

‣   Web Analytics
‣   Big Data Analytics
‣   Digital Marketing – Ad Server Analytics
‣   Multiple TBs of online data per client per year
‣   Elastic Web-scale MapReduce & Hadoop
‣   Increase ROI of digital marketing campaigns
Big Data Analytics Web Platform
In-Database Analytics (Teradata Aster)
•   Because of built-in analytics functions and big data performance, Aster becomes
    the data scientist’s sandbox and BI’s big data analytics processor.




                                                             Prepackaged Analytics
                                                             Functions (including Attribution)
SQL Server Big Data – Data Loading




Amazon HDFS & EMR          Data Loading




                Amazon S3 Bucket
Sqoop
Data transfer to & from Hadoop & SQL Server

‣ sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password
  password –table customers -m 1


‣ > hadoop fs -cat /user/mark/customers/part-m-00000

‣ > 5,Bob Smith

‣ sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password
  password -m 1 –table customers –export-dir /user/mark/data/employees3
‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in
  32.6364 seconds (6.1588 bytes/sec)
‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
SQL Server Big Data Environment

‣ SQL Server Database
   ‣   SQL Server 2008 R2 or 2012 Enterprise Edition
   ‣   Page Compression
   ‣   2012 Columnar Compression on Fact Tables
   ‣   Clustered Index on all tables
   ‣   Auto-update Stats Asynch
   ‣   Partition Fact Tables by month and archive data with sliding window technique
   ‣   Drop all indexes before nightly ETL load jobs
   ‣   Rebuild all indexes when ETL completes
‣ SQL Server Analysis Services
   ‣   SSAS 2008 R2 or 2012 Enterprise Edition
   ‣   2008 R2 OLAP cubes partition-aligned with DW
   ‣   2012 cubes in-memory tabular cubes
   ‣   All access through MSMDPUMP or SharePoint
SQL Server Big Data Analytics Features

‣ Columnstore
‣ Sqoop adapter
‣ PolyBase
‣ Hive
‣ In-memory analytics
‣ Scale-out MPP
Wrap-up

‣ What is a Big Data approach to Analytics?
   ‣ Massive scale
   ‣ Data discovery & research
   ‣ Self-service
   ‣ Reporting & BI
‣ Why did we take this Big Data Analytics approach?
   ‣ Each Web client produces an average of 6 TBs of ICA data in a year
   ‣ The data in the sources are variable and unstructured
   ‣ SSIS ETL alone couldn’t keep up or handle complexity
   ‣ SQL Server 2012 columnstore and tabular SSAS 2012 were key to using SQL
       Server for Big Data
    ‣ With the configs mentioned previously, SQL Server is working great
‣ Analytics on Big Data also requires Big Data Analytics tools
    ‣ Aster, Tableau, PowerPivot, SAS

More Related Content

What's hot

What's hot (20)

Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
 
Data lake analytics for the admin
Data lake analytics for the adminData lake analytics for the admin
Data lake analytics for the admin
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
 
Building a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's PerspectiveBuilding a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's Perspective
 
Data Vault Vs Data Lake
Data Vault Vs Data LakeData Vault Vs Data Lake
Data Vault Vs Data Lake
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analytics
 
Building a Big Data Solution
Building a Big Data SolutionBuilding a Big Data Solution
Building a Big Data Solution
 
Data Lake
Data LakeData Lake
Data Lake
 
What is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | WhitepaperWhat is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | Whitepaper
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 

Viewers also liked

Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Mark Kromer
 

Viewers also liked (19)

Microsoft Event Registration System Hosted on Windows Azure
Microsoft Event Registration System Hosted on Windows AzureMicrosoft Event Registration System Hosted on Windows Azure
Microsoft Event Registration System Hosted on Windows Azure
 
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
 
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerPhilly Code Camp 2013 Mark Kromer Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
 
PSSUG Nov 2012: Big Data with SQL Server
PSSUG Nov 2012: Big Data with SQL ServerPSSUG Nov 2012: Big Data with SQL Server
PSSUG Nov 2012: Big Data with SQL Server
 
Big Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace ImagesBig Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace Images
 
MEC Data sheet
MEC Data sheetMEC Data sheet
MEC Data sheet
 
What's new in SQL Server 2012 for philly code camp 2012.1
What's new in SQL Server 2012 for philly code camp 2012.1What's new in SQL Server 2012 for philly code camp 2012.1
What's new in SQL Server 2012 for philly code camp 2012.1
 
Pentaho Big Data Analytics with Vertica and Hadoop
Pentaho Big Data Analytics with Vertica and HadoopPentaho Big Data Analytics with Vertica and Hadoop
Pentaho Big Data Analytics with Vertica and Hadoop
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
 
Pentaho Analytics on MongoDB
Pentaho Analytics on MongoDBPentaho Analytics on MongoDB
Pentaho Analytics on MongoDB
 
Sql server 2012 roadshow masd overview 003
Sql server 2012 roadshow masd overview 003Sql server 2012 roadshow masd overview 003
Sql server 2012 roadshow masd overview 003
 
Microsoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAsMicrosoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAs
 
Azure vs. amazon
Azure vs. amazonAzure vs. amazon
Azure vs. amazon
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
ETL in the Cloud With Microsoft Azure
ETL in the Cloud With Microsoft AzureETL in the Cloud With Microsoft Azure
ETL in the Cloud With Microsoft Azure
 
Azure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analyticsAzure cafe marketplace with looker data analytics
Azure cafe marketplace with looker data analytics
 
AWS vs Azure - Cloud Services Comparison
AWS vs Azure - Cloud Services ComparisonAWS vs Azure - Cloud Services Comparison
AWS vs Azure - Cloud Services Comparison
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature MappingMicrosoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
 

Similar to Big Data with SQL Server

Similar to Big Data with SQL Server (20)

AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
 
There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9
 
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
מיכאל
מיכאלמיכאל
מיכאל
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overview
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Meetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management TrendsMeetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management Trends
 
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 

More from Mark Kromer

More from Mark Kromer (20)

Fabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptxFabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptx
 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelines
 
Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22
 
Data cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flowsData cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flows
 
Data cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsData cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flows
 
Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
 
Data Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADFData Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADF
 
Azure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power QueryAzure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power Query
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
Data Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADFData Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADF
 
Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)
 
Data quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADFData quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADF
 
Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data Factory
 
ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300
 
ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2
 
ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1
 
ADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview MigrationADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview Migration
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Big Data with SQL Server

  • 1. Big Data with SQL Server Philly Code Camp November 2012 Mark Kromer BI & Big Data Technology Director http://www.kromerbigdata.com @kromerbigdata @mssqldude
  • 2. What we’ll (try) to cover today ‣ What is Big Data? ‣ The Big Data and Apache Hadoop environment ‣ Big Data Analytics ‣ SQL Server in the Big Data world ‣ How we utilize Big Data @ Razorfish 2
  • 3. Big Data 101 ‣ 3 V’s ‣ Volume – Terabyte records, transactions, tables, files ‣ Velocity – Batch, near-time, real-time (analytics), streams. ‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix ‣ Text Processing ‣ Techniques for processing and analyzing unstructured (and structured) LARGE files ‣ Analytics & Insights ‣ Distributed File System & Programming
  • 4. Batch Processing ‣ Commodity Hardware ‣ Data Locality, no shared storage ‣ Scales linearly ‣ Great for large text file processing, not so great on small files ‣ Distributed programming paradigm
  • 5. MapReduce Framework (Map) using Microsoft.Hadoop.MapReduce; using System.Text.RegularExpressions; public class TotalHitsForPageMap : MapperBase { public override void Map(string inputLine, MapperContext context) { context.Log(inputLine); var parts = Regex.Split(inputLine, "s+"); if (parts.Length != expected) //only take records with all values { return; } context.EmitKeyValue(parts[pagePos], hit); } }
  • 6. MapReduce Framework (Reduce & Job) public class TotalHitsForPageReducerCombiner : ReducerCombinerBase { public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context) { context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString()); } } public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner> { public override HadoopJobConfiguration Configure(ExecutorContext context) { var retVal = new HadoopJobConfiguration(); retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT"); retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT"); retVal.DeleteOutputFolder = true; return retVal; } }
  • 7. Mark’s Big Data Myths ‣ Big Data ≠ NoSQL ‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not the same thing ‣ Facebook, for example, uses Hbase from the Hadoop stack ‣ Big Data ≠ Real Time ‣ Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was otherwise too complex to provide value ‣ Use in-memory analytics for real time insights ‣ Big Data ≠ Data Warehouse ‣ I still refer to large multi-TB DWs as “VLDB” ‣ Big Data is about crunching stats in text files for discovery of new patterns and insights ‣ Use the DW to aggregate and store the summaries of those calculations for reporting
  • 8. Razorfish & Big Data ‣ Web Analytics ‣ Big Data Analytics ‣ Digital Marketing – Ad Server Analytics ‣ Multiple TBs of online data per client per year ‣ Elastic Web-scale MapReduce & Hadoop ‣ Increase ROI of digital marketing campaigns
  • 9. Big Data Analytics Web Platform
  • 10. In-Database Analytics (Teradata Aster) • Because of built-in analytics functions and big data performance, Aster becomes the data scientist’s sandbox and BI’s big data analytics processor. Prepackaged Analytics Functions (including Attribution)
  • 11. SQL Server Big Data – Data Loading Amazon HDFS & EMR Data Loading Amazon S3 Bucket
  • 12. Sqoop Data transfer to & from Hadoop & SQL Server ‣ sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1 ‣ > hadoop fs -cat /user/mark/customers/part-m-00000 ‣ > 5,Bob Smith ‣ sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir /user/mark/data/employees3 ‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec) ‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
  • 13. SQL Server Big Data Environment ‣ SQL Server Database ‣ SQL Server 2008 R2 or 2012 Enterprise Edition ‣ Page Compression ‣ 2012 Columnar Compression on Fact Tables ‣ Clustered Index on all tables ‣ Auto-update Stats Asynch ‣ Partition Fact Tables by month and archive data with sliding window technique ‣ Drop all indexes before nightly ETL load jobs ‣ Rebuild all indexes when ETL completes ‣ SQL Server Analysis Services ‣ SSAS 2008 R2 or 2012 Enterprise Edition ‣ 2008 R2 OLAP cubes partition-aligned with DW ‣ 2012 cubes in-memory tabular cubes ‣ All access through MSMDPUMP or SharePoint
  • 14. SQL Server Big Data Analytics Features ‣ Columnstore ‣ Sqoop adapter ‣ PolyBase ‣ Hive ‣ In-memory analytics ‣ Scale-out MPP
  • 15. Wrap-up ‣ What is a Big Data approach to Analytics? ‣ Massive scale ‣ Data discovery & research ‣ Self-service ‣ Reporting & BI ‣ Why did we take this Big Data Analytics approach? ‣ Each Web client produces an average of 6 TBs of ICA data in a year ‣ The data in the sources are variable and unstructured ‣ SSIS ETL alone couldn’t keep up or handle complexity ‣ SQL Server 2012 columnstore and tabular SSAS 2012 were key to using SQL Server for Big Data ‣ With the configs mentioned previously, SQL Server is working great ‣ Analytics on Big Data also requires Big Data Analytics tools ‣ Aster, Tableau, PowerPivot, SAS