SlideShare uma empresa Scribd logo
1 de 21
Windows Azure HDInsight
        Service


   Hadoop on Windows Azure


        NEIL MACKENZIE
Who Am I?

 Neil Mackenzie
 Windows Azure Architect @ Satory Global


 Windows Azure MVP
 Blog: http://convective.wordpress.com/
 Twitter: @mknz


 Book:
 Microsoft Windows Azure Development Cookbook
Goals and Agenda

 Goals
   Introduce Windows Azure HDInsight Service to the Windows
    Azure developer
   Introduce Windows Azure to the Hadoop user

   Not a tutorial on how to use Hadoop features

 Agenda
   Big Data

   Windows Azure

   Windows Azure HDInsight Service
Big Data

 Problem:
   How do we create value from enormous amounts of low-value
    data?


 Solution:
   Analyze it using a lot of commodity hardware.
Three Vs of Big Data

 Volume
   How much data is there?

 Variety
   What are the sources of the data?

 Velocity
   How fast is the data being generated?
MapReduce

 Distributed computational model for data analysis.
   Map function:
        Processes a key-value pair to generate intermediate pairs
    Reduce function:
        Merges all intermediate values with the same intermediate key.


 Map and reduce functions allocated to many
  compute nodes with data stored locally.
 Raw MapReduce functions are written in Java.
Apache Hadoop

 Modules:
   Hadoop Distributed File System (HDFS)

   MapReduce

 Related projects:
   HBase – scalable, distributed database

   Hive – data warehouse infrastructure

   Mahout – scalable machine learning library

   Pig – high-level data-flow language

 Other:
   Sqoop –import and export to relational database
Windows Azure

 Compute
   PaaS: Cloud Services, Windows Azure Web Sites
   IaaS: Virtual Machines

 Storage
   Windows Azure Storage Service: blobs, tables, queues
   Windows Azure SQL Database
   IaaS: Microsoft SQL Server, MongoDB, Cassandra, etc.

 Connectivity
   HTTP, TCP, UDP, Site-to-Site VPN

 Administration
   Portal, Service Management API
Windows Azure HDInsight Service

 Components:
   HadoopCore – v1.0.1

   HDFS & ASV

   Pig – v0.9.3

   Hive – v0.8.1

   Sqoop – v1.4.2

   Excel/Hive



 Note: this was formerly known as Hadoop on Azure.
Hadoop Administration

 Portal
   http://www.hadooponazure.com

   Apply to join preview

   Create and manage Hadoop cluster
         3 nodes for 5 days
     Access the Interactive console
       Hive
         Invoke Hive statements

       JavaScript
         Invoke HDFS commands

         Invoke Hive & Pig statements
Distributed File Systems

 HDFS
   Contents deleted when cluster deleted

 ASV
   Azure Storage Vault

   Data stored in Windows Azure Blob Storage

   Configured on Hadoop on Azure portal

   Contents survive deletion of Hadoop cluster

   Supports multi-level structure, e.g.:
       containername/input/file1
Pig

 Hadoop feature to perform data-flow operations:
   Execution environment

   Language: Pig Latin

 Execution Environment
   Local in local JVM or distributed on Hadoop cluster

 Pig Latin
   High-level language

   Describes data-flow operations

   Automatically invokes MapReduce jobs

   Much simpler than using MapReduce directly
Pig Example


records = LOAD 'asv://flightdata/input/flightdata.txt'
AS
(year:int, month:int, day:int, carrier:chararray, origin:char
array, dest:chararray, depdelay:int, arrdelay:int);

modified_records = FOREACH records
GENERATE origin, depdelay;

STORE modified_records
INTO 'my_output' using PigStorage(',');
Hive

 Hadoop feature to perform data warehouse
  operations
 HiveQL
     high-level, SQL-like language
     Supports equi-joins
     Schema on read NOT schema on write
     Automatically invokes MapReduce jobs
     Much simpler than using MapReduce directly
 Metadata store
   Contains descriptions of tables
Hive Example

FROM flightdata_asv

INSERT OVERWRITE TABLE origin_counts
SELECT origin, COUNT(*)
GROUP BY origin

INSERT OVERWRITE TABLE dest_counts
SELECT dest, COUNT(*)
GROUP BY dest
Sqoop

 Feature allowing import and export from SQL
 databases
    Uses JDBC connector
    Works with Windows Azure SQL Database
    Table must exist before export
Sqoop Example

 Exporting a table:
sqoop.cmd export –connect
"jdbc:sqlserver://sql_database_server.database.windows.net:1433;database=
sql_database_instance;user=sqoop_login@sql_database_server;password=s
qoop_login_password"
--table sql_database_table
--export-dir "/user/hive/warehouse/hive_table"
--input-fields-terminated-by "001"
Excel and Hadoop on Azure

 Example of Microsoft business intelligence strategy
   Expose Hadoop to existing tools

 HiveODBC connector for Excel
   Create Hive queries from Excel

   Invoke them from Excel
More Information

 Sign up for preview:
  http://www.hadooponazure.com
 Support:
  http://social.msdn.microsoft.com/Forums/en-US/hdinsight
 Avkash Chauhan’s blog:
  http://blogs.msdn.com/b/avkashchauhan/archive/tags/hadoop
 Roger Jennings’ blog:
  http://oakleafblog.blogspot.com/2012/04/using-data-in-
  windows-azure-blobs-with.html
Summary

 Hadoop:
   De-facto solution to the Big Data problem

 Windows Azure HDInsight Service
   Native Hadoop implementation

   Managed Hadoop service for Windows Azure

   Currently in preview
Windows Azure HDInsight Service

Mais conteúdo relacionado

Mais procurados

Bacd zenoss
Bacd zenossBacd zenoss
Bacd zenoss
ke4qqq
 

Mais procurados (20)

Hands-on Lab: Amazon ElastiCache
Hands-on Lab: Amazon ElastiCacheHands-on Lab: Amazon ElastiCache
Hands-on Lab: Amazon ElastiCache
 
Accelerating DynamoDB with DAX
Accelerating DynamoDB with DAXAccelerating DynamoDB with DAX
Accelerating DynamoDB with DAX
 
Scaling Drupal in AWS Using AutoScaling, Cloudformation, RDS and more
Scaling Drupal in AWS Using AutoScaling, Cloudformation, RDS and moreScaling Drupal in AWS Using AutoScaling, Cloudformation, RDS and more
Scaling Drupal in AWS Using AutoScaling, Cloudformation, RDS and more
 
AutoScaling and Drupal
AutoScaling and DrupalAutoScaling and Drupal
AutoScaling and Drupal
 
Windows Azure Virtual Machines
Windows Azure Virtual MachinesWindows Azure Virtual Machines
Windows Azure Virtual Machines
 
Getting Started with ElastiCache for Redis
Getting Started with ElastiCache for RedisGetting Started with ElastiCache for Redis
Getting Started with ElastiCache for Redis
 
Infrastructure as Code on Azure - NET Conf CO v2018
Infrastructure as Code on Azure - NET Conf CO v2018 Infrastructure as Code on Azure - NET Conf CO v2018
Infrastructure as Code on Azure - NET Conf CO v2018
 
More Cache for Less Cash
More Cache for Less CashMore Cache for Less Cash
More Cache for Less Cash
 
Apache Superset at Airbnb
Apache Superset at AirbnbApache Superset at Airbnb
Apache Superset at Airbnb
 
AWS Cyber Security Best Practices
AWS Cyber Security Best PracticesAWS Cyber Security Best Practices
AWS Cyber Security Best Practices
 
Azure Virtual Machines Deployment Scenarios
Azure Virtual Machines Deployment ScenariosAzure Virtual Machines Deployment Scenarios
Azure Virtual Machines Deployment Scenarios
 
Automating Your Microsoft Azure Environment (DevLink 2014)
Automating Your Microsoft Azure Environment (DevLink 2014)Automating Your Microsoft Azure Environment (DevLink 2014)
Automating Your Microsoft Azure Environment (DevLink 2014)
 
Best Practices for Running MongoDB on AWS - AWS May 2016 Webinar Series
Best Practices for Running MongoDB on AWS - AWS May 2016 Webinar SeriesBest Practices for Running MongoDB on AWS - AWS May 2016 Webinar Series
Best Practices for Running MongoDB on AWS - AWS May 2016 Webinar Series
 
Automating Your Azure Environment
Automating Your Azure EnvironmentAutomating Your Azure Environment
Automating Your Azure Environment
 
Amazon Web Services Building Blocks for Drupal Applications and Hosting
Amazon Web Services Building Blocks for Drupal Applications and HostingAmazon Web Services Building Blocks for Drupal Applications and Hosting
Amazon Web Services Building Blocks for Drupal Applications and Hosting
 
Azure IaaS
Azure IaaSAzure IaaS
Azure IaaS
 
Windows Azure Blob Storage
Windows Azure Blob StorageWindows Azure Blob Storage
Windows Azure Blob Storage
 
Zero to 60 with Azure Cosmos DB
Zero to 60 with Azure Cosmos DBZero to 60 with Azure Cosmos DB
Zero to 60 with Azure Cosmos DB
 
Bacd zenoss
Bacd zenossBacd zenoss
Bacd zenoss
 
HDInsight Informative articles
HDInsight Informative articlesHDInsight Informative articles
HDInsight Informative articles
 

Destaque

Drive Smarter Decisions with Hadoop and Windows Azure HDInsight
Drive Smarter Decisions with Hadoop and Windows Azure HDInsightDrive Smarter Decisions with Hadoop and Windows Azure HDInsight
Drive Smarter Decisions with Hadoop and Windows Azure HDInsight
DataWorks Summit
 
Visualising the tabular model for power view upload
Visualising the tabular model for power view uploadVisualising the tabular model for power view upload
Visualising the tabular model for power view upload
Jen Stirrup
 

Destaque (20)

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft Azure
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhDSpark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
 
PASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureMLPASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureML
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 
Drive Smarter Decisions with Hadoop and Windows Azure HDInsight
Drive Smarter Decisions with Hadoop and Windows Azure HDInsightDrive Smarter Decisions with Hadoop and Windows Azure HDInsight
Drive Smarter Decisions with Hadoop and Windows Azure HDInsight
 
Functional Reactive Programming without Black Magic (UIKonf 2015)
Functional Reactive Programming without Black Magic (UIKonf 2015)Functional Reactive Programming without Black Magic (UIKonf 2015)
Functional Reactive Programming without Black Magic (UIKonf 2015)
 
Business Intelligence Barista: What DataViz Tool to Use, and When?
Business Intelligence Barista: What DataViz Tool to Use, and When?Business Intelligence Barista: What DataViz Tool to Use, and When?
Business Intelligence Barista: What DataViz Tool to Use, and When?
 
Business Intelligence Barista: What DataViz Tool to Use, and When?
Business Intelligence Barista: What DataViz Tool to Use, and When?Business Intelligence Barista: What DataViz Tool to Use, and When?
Business Intelligence Barista: What DataViz Tool to Use, and When?
 
Realtime analytics with_hadoop
Realtime analytics with_hadoopRealtime analytics with_hadoop
Realtime analytics with_hadoop
 
Restructuring Technical Debt - A Software and System Quality Approach
Restructuring Technical Debt - A Software and System Quality ApproachRestructuring Technical Debt - A Software and System Quality Approach
Restructuring Technical Debt - A Software and System Quality Approach
 
Cloud computing by Bhavesh
Cloud computing by BhaveshCloud computing by Bhavesh
Cloud computing by Bhavesh
 
Visualising the tabular model for power view upload
Visualising the tabular model for power view uploadVisualising the tabular model for power view upload
Visualising the tabular model for power view upload
 
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data VisualisationDigital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
 
Go Serverless with Azure Functions
Go Serverless with Azure FunctionsGo Serverless with Azure Functions
Go Serverless with Azure Functions
 

Semelhante a Windows Azure HDInsight Service

It takes two to tango! : Is SQL-on-Hadoop the next big step?
It takes two to tango! : Is SQL-on-Hadoop the next big step?It takes two to tango! : Is SQL-on-Hadoop the next big step?
It takes two to tango! : Is SQL-on-Hadoop the next big step?
Srihari Srinivasan
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
Rajesh Nadipalli
 

Semelhante a Windows Azure HDInsight Service (20)

Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache Hadoop
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
מיכאל
מיכאלמיכאל
מיכאל
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive Analytics
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
It takes two to tango! : Is SQL-on-Hadoop the next big step?
It takes two to tango! : Is SQL-on-Hadoop the next big step?It takes two to tango! : Is SQL-on-Hadoop the next big step?
It takes two to tango! : Is SQL-on-Hadoop the next big step?
 
Srikanth hadoop 3.6yrs_hyd
Srikanth hadoop 3.6yrs_hydSrikanth hadoop 3.6yrs_hyd
Srikanth hadoop 3.6yrs_hyd
 
Hive
HiveHive
Hive
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platform
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft Platform
 
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackBig Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
 
Escalando Aplicaciones Web
Escalando Aplicaciones WebEscalando Aplicaciones Web
Escalando Aplicaciones Web
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Sureh hadoop 3 years t
Sureh hadoop 3 years tSureh hadoop 3 years t
Sureh hadoop 3 years t
 

Mais de Neil Mackenzie (6)

Azure DocumentDB
Azure DocumentDBAzure DocumentDB
Azure DocumentDB
 
Project Orleans - Actor Model framework
Project Orleans - Actor Model frameworkProject Orleans - Actor Model framework
Project Orleans - Actor Model framework
 
Windows Azure SQL Database Federations
Windows Azure SQL Database FederationsWindows Azure SQL Database Federations
Windows Azure SQL Database Federations
 
Brokered Messaging in Windows Azure
Brokered Messaging in Windows AzureBrokered Messaging in Windows Azure
Brokered Messaging in Windows Azure
 
Windows Azure Diagnostics
Windows Azure DiagnosticsWindows Azure Diagnostics
Windows Azure Diagnostics
 
Introduction to Windows Azure AppFabric Applications
Introduction to Windows Azure AppFabric ApplicationsIntroduction to Windows Azure AppFabric Applications
Introduction to Windows Azure AppFabric Applications
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Windows Azure HDInsight Service

  • 1. Windows Azure HDInsight Service Hadoop on Windows Azure NEIL MACKENZIE
  • 2. Who Am I?  Neil Mackenzie  Windows Azure Architect @ Satory Global  Windows Azure MVP  Blog: http://convective.wordpress.com/  Twitter: @mknz  Book: Microsoft Windows Azure Development Cookbook
  • 3. Goals and Agenda  Goals  Introduce Windows Azure HDInsight Service to the Windows Azure developer  Introduce Windows Azure to the Hadoop user  Not a tutorial on how to use Hadoop features  Agenda  Big Data  Windows Azure  Windows Azure HDInsight Service
  • 4. Big Data  Problem:  How do we create value from enormous amounts of low-value data?  Solution:  Analyze it using a lot of commodity hardware.
  • 5. Three Vs of Big Data  Volume  How much data is there?  Variety  What are the sources of the data?  Velocity  How fast is the data being generated?
  • 6. MapReduce  Distributed computational model for data analysis.  Map function:  Processes a key-value pair to generate intermediate pairs  Reduce function:  Merges all intermediate values with the same intermediate key.  Map and reduce functions allocated to many compute nodes with data stored locally.  Raw MapReduce functions are written in Java.
  • 7. Apache Hadoop  Modules:  Hadoop Distributed File System (HDFS)  MapReduce  Related projects:  HBase – scalable, distributed database  Hive – data warehouse infrastructure  Mahout – scalable machine learning library  Pig – high-level data-flow language  Other:  Sqoop –import and export to relational database
  • 8. Windows Azure  Compute  PaaS: Cloud Services, Windows Azure Web Sites  IaaS: Virtual Machines  Storage  Windows Azure Storage Service: blobs, tables, queues  Windows Azure SQL Database  IaaS: Microsoft SQL Server, MongoDB, Cassandra, etc.  Connectivity  HTTP, TCP, UDP, Site-to-Site VPN  Administration  Portal, Service Management API
  • 9. Windows Azure HDInsight Service  Components:  HadoopCore – v1.0.1  HDFS & ASV  Pig – v0.9.3  Hive – v0.8.1  Sqoop – v1.4.2  Excel/Hive  Note: this was formerly known as Hadoop on Azure.
  • 10. Hadoop Administration  Portal  http://www.hadooponazure.com  Apply to join preview  Create and manage Hadoop cluster  3 nodes for 5 days  Access the Interactive console  Hive  Invoke Hive statements  JavaScript  Invoke HDFS commands  Invoke Hive & Pig statements
  • 11. Distributed File Systems  HDFS  Contents deleted when cluster deleted  ASV  Azure Storage Vault  Data stored in Windows Azure Blob Storage  Configured on Hadoop on Azure portal  Contents survive deletion of Hadoop cluster  Supports multi-level structure, e.g.:  containername/input/file1
  • 12. Pig  Hadoop feature to perform data-flow operations:  Execution environment  Language: Pig Latin  Execution Environment  Local in local JVM or distributed on Hadoop cluster  Pig Latin  High-level language  Describes data-flow operations  Automatically invokes MapReduce jobs  Much simpler than using MapReduce directly
  • 13. Pig Example records = LOAD 'asv://flightdata/input/flightdata.txt' AS (year:int, month:int, day:int, carrier:chararray, origin:char array, dest:chararray, depdelay:int, arrdelay:int); modified_records = FOREACH records GENERATE origin, depdelay; STORE modified_records INTO 'my_output' using PigStorage(',');
  • 14. Hive  Hadoop feature to perform data warehouse operations  HiveQL  high-level, SQL-like language  Supports equi-joins  Schema on read NOT schema on write  Automatically invokes MapReduce jobs  Much simpler than using MapReduce directly  Metadata store  Contains descriptions of tables
  • 15. Hive Example FROM flightdata_asv INSERT OVERWRITE TABLE origin_counts SELECT origin, COUNT(*) GROUP BY origin INSERT OVERWRITE TABLE dest_counts SELECT dest, COUNT(*) GROUP BY dest
  • 16. Sqoop  Feature allowing import and export from SQL databases  Uses JDBC connector  Works with Windows Azure SQL Database  Table must exist before export
  • 17. Sqoop Example  Exporting a table: sqoop.cmd export –connect "jdbc:sqlserver://sql_database_server.database.windows.net:1433;database= sql_database_instance;user=sqoop_login@sql_database_server;password=s qoop_login_password" --table sql_database_table --export-dir "/user/hive/warehouse/hive_table" --input-fields-terminated-by "001"
  • 18. Excel and Hadoop on Azure  Example of Microsoft business intelligence strategy  Expose Hadoop to existing tools  HiveODBC connector for Excel  Create Hive queries from Excel  Invoke them from Excel
  • 19. More Information  Sign up for preview: http://www.hadooponazure.com  Support: http://social.msdn.microsoft.com/Forums/en-US/hdinsight  Avkash Chauhan’s blog: http://blogs.msdn.com/b/avkashchauhan/archive/tags/hadoop  Roger Jennings’ blog: http://oakleafblog.blogspot.com/2012/04/using-data-in- windows-azure-blobs-with.html
  • 20. Summary  Hadoop:  De-facto solution to the Big Data problem  Windows Azure HDInsight Service  Native Hadoop implementation  Managed Hadoop service for Windows Azure  Currently in preview