SlideShare uma empresa Scribd logo
1 de 29
Henk van der Valk
Technical Sales Professional
Jan Pieter Posthuma
Microsoft BI Consultant
7/11/2013

Hadoop
Access to online
training content

JOIN THE PASS
COMMUNITY
Become a PASS member for free
and join the world‟s biggest SQL
Server Community.

Join Local
Chapters

Personalize your PASS website experience

Access to events at
discounted rates

Join Virtual
Chapters

2
Agenda
•
•
•
•
•
•
•
•
•

Introduction
Hadoop
HDFS
Data access to HDFS
Map/Reduce
Hive
Data access from HDFS
SQL PDW PolyBase
Wrap up

3
Introduction Henk
•
•
•
•
•

10 years of Unisys-EMEA Performance Center
2002- Largest SQL DWH in the world (SQL2000)
Project Real – (SQL 2005)
ETL WR - loading 1TB within 30 mins (SQL 2008)
Contributed to various SQL whitepapers

•
•
•

Schuberg Philis-100% uptime for mission critical applications
Since april 1st, 2011 – Microsoft SQL PDW - Western Europe
SQLPass speaker & volunteer since 2005

4
Introduction

Alerts, Notifications
SQL Server
StreamInsight

Big Data Sources
(Raw, Unstructured)
Data & Compute Intensive
Application

Business
Insights
SQL Server FTDW Data
Marts

Sensors

Load

SQL Server Reporting
Services

Fast

Devices

Summarize &
Load

HDInsight on
Windows Azure

Bots

HDInsight on
Windows Server

SQL Server Parallel Data
Warehouse

Historical Data
(Beyond Active Window)

Interactive
Reports

Integrate/Enrich

SQL Server Analysis
Server

Crawlers

Performance
Scorecards
Azure Market Place

Enterprise ETL with
SSIS, DQS, MDS

ERP

CRM

LOB

APPS

Source Systems

5
Introduction Jan Pieter Posthuma
Jan Pieter Posthuma
• Technical Lead Microsoft BI and
Big Data consultant
• Inter Access, local consultancy firm in the
Netherlands
• Architect role at multiple projects
• Analysis Service, Reporting Service,
PerformancePoint Service, Big Data,
HDInsight, Cloud BI
http://twitter.com/jppp
http://linkedin.com/jpposthuma
jan.pieter.posthuma@interaccess.nl

6
Hadoop
• Hadoop is a collection of software to create a data-intensive
distributed cluster running on commodity hardware.
• Original idea by Google (2003).
• Widely accepted by Database vendors as a solution for unstructured
data
• Microsoft partners with HortonWorks and delivers their Hadoop
Data Platform as Microsoft HDInsight
• Available as an Azure service and on premise
• HortonWorks Data Platform (HDP) 100% Open Source!

7

7
Hadoop

Map/
Reduce
HBase
HDFS

Poly
base

Avro (Serialization)

Zookeeper

• HDFS – distributed, fault tolerant file system
• MapReduce – framework for writing/executing distributed, fault
tolerant algorithms
• Hive & Pig – SQL-like declarative languages
• Sqoop/PolyBase – package
for moving data between HDFS
BI
ETL
RDBMS
Reporting Tools
and relational DB systems
• + Others…
Hive & Pig
• Hadoop 2.0
Sqoop /

8
HDFS
Large File

1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
…

6440MB
Let‟s color-code them
Block
1

Block
2

Block
3

Block
4

Block
5

Block
6

64MB

64MB

64MB

64MB

64MB

64MB

e.g., Block Size = 64MB

…

Block
100

Block
101

64MB

40MB

Files are composed of set of blocks
• Typically 64MB in size
• Each block is stored as a separate file
in the local file system (e.g. NTFS)
9

9
HDFS
NameNode

HDFS was designed with the
expectation that failures (both
hardware and software) would
occur frequently

BackupNode

namespace backups

Hadoop 2.0 is more decentralized
• Interaction between DataNodes
• Less dependent on primary
NameNode

(heartbeat, balancing, replication, etc.)

DataNode

DataNode

DataNode

DataNode

nodes write to local disk

DataNode
Data access to HDFS
FTP – Upload your data files
Streaming – Via AVRO (RPC) or Flume
Hadoop command – hadoop fs -copyFromLocal
Windows Azure BLOB storage – HDInsight Service (Azure) uses BLOB
storage instead of local VM storage. Data can be uploaded without a
provisioned Hadoop cluster
• PolyBase – Feature of PDW 2012. Direct read/write data access to
the datanodes.
•
•
•
•

11
Data access
Hadoop command Demo

12
13
Map/Reduce
• MR: all functions in a batch oriented architecture
•
•

Map: Apply the logic to the data, eg page hits count.
Reduce: Reduces (aggregate) the results of the Mappers to one.

• YARN: split the JobTracker in to Resource Manager and Node
Manager. And MR in Hadoop 2.0 uses YARN as its JobTacker

14
Map/Reduce
Total page hits

15
Hive
•
•
•
•
•
•
•
•
•

Build for easy data retrieval
Uses Map/Reduce
Created by Facebook
HiveQL: SQL like language
Stores data in tables, which are stored as HDFS file(s)
Only initial INSERT supported, no UPDATE or DELETE
External tables possible on existing (CSV) file(s)
Extra language options to use benefits of Hadoop
Stinger initiative: Phase 1 (0.11) and Phase 2 (0.12).
Improve Hive 100x

16
Hive
Star schema join – (Based on TPC-DS Query 27)
SELECT
col5, avg(col6)
FROM
store_sales_fact ssf
41 GB
join item_dim on (ssf.col1 = item_dim.col1)
58 MB
join date_dim on (ssf.col2 = date_dim.col2)
11 MB
join custdmgrphcs_dim on (ssf.col3 = custdmgrphcs_dim.col3)
80 MB
join store_dim on (ssf.col4 = store_dim.col4)
106 KB
GROUP BY col5
ORDER BY col5
LIMIT 100;

Cluster: 6 Nodes (2 Name, 4 Compute – dual core, 14GB)
17
Hive
File Type

# MR jobs

Input Size

# Mappers

Time

Text / Hive 0.10

5

43.1 GB

179

21:00 min

Text / Hive 0.11

1

38.0 GB

151

4:06 min

RC / Hive 0.11

1

8.21 GB

76

2:16 min

ORC / Hive 0.11

1

2.83 GB

38

1:44 min

RC / Hive 0.11 /
Partitioned /
Bucketed

1

1.73 GB

19

1:44 min

ORC / Hive 0.11 /
Partitioned /
Bucketed

1

687 MB

27

01:19 min

Data: ~64x less data
Time; ~16x times faster
18
Data access from Hadoop
Excel
FTP
Hadoop command – hadoop fs -copyToLocal
ODBC[1] – Via Hive (HiveQL) data can be extracted.
Power Query – Is capable of extracting data directly from HDFS or
Azure BLOB storage
• PolyBase – Feature of PDW 2012. Direct read/write data access to
the datanodes.
•
•
•
•
•

[1] http://www.microsoft.com/en-us/download/details.aspx?id=40886
[2] Power BI Excel add-in – http://www.powerbi.com
19
Data access
Excel 2013 Demo

20
21
PDW – Polybase

…
SQL Server
SQL Server

SQL Server

SQL Server

Sqoop

This is PDW!

DN

DN

DN

DN

DN

DN

DN

DN

DN

DN

DN

DN

Hadoop Cluster

22

22
PDW – External Tables
• An external table is PDW‟s representation of data residing in HDFS
• The “table” (metadata) lives in the context of a SQL Server database

• The actual table data resides in HDFS
• No support for DML operations
• No concurrency control or isolation level guarantees
CREATE EXTERNAL TABLE table_name ({<column_definition>} [,...n ])
{WITH (LOCATION =‘<URI>’,[FORMAT_OPTIONS = (<VALUES>)])}
[;]

Required to indicate
location of Hadoop cluster

Optional format options
associated with parsing of data
from HDFS (e.g. field delimiters
& reject-related thresholds)

23
PDW – Hadoop use cases & examples

[1] Retrieve data from HDFS with a PDW query
• Seamlessly join structured and semi-structured data
SELECT Username FROM ClickStream c, User u WHERE c.UserID = u.ID
AND c.URL=‘www.bing.com’;

[2] Import data from HDFS to PDW
• Parallelized CREATE TABLE AS SELECT (CTAS)
• External tables as the source
• PDW table, either replicated or distributed, as destination
CREATE TABLE ClickStreamInPDW WITH DISTRIBUTION = HASH(URL)
AS SELECT URL, EventDate, UserID FROM ClickStream;

[3] Export data from PDW to HDFS
• Parallelized CREATE EXTERNAL TABLE AS SELECT (CETAS)
• External table as the destination; creates a set of HDFS files
CREATE EXTERNAL TABLE ClickStream2 (URL, EventDate, UserID)
WITH (LOCATION =‘hdfs://MyHadoop:5000/joe’, FORMAT_OPTIONS (...)
AS SELECT URL, EventDate, UserID FROM ClickStreamInPDW;
SQL Server 2012
PDW
Polybase demo

25
Wrap up
Hadoop „just another data source‟ @ your fingertips!
Batch processing large datasets before loading into your DWH
Offloading DWH data, but still accessible for analysis/reporting

Integrate Hadoop via SQOOP, ODBC (Hive) or PolyBase
Near future: deeply integration between Hadoop and SQL PDW
Try Hadoop / HDInsight yourself:
Azure: http://www.windowsazure.com/en-us/pricing/free-trial/
Web PI: http://www.microsoft.com/web/downloads/platform.aspx

26
Q&A

27
References
Microsoft Big Data
http://www.microsoft.com/bigdata
Windows Azure HDInsight Service (3 months free trail)
http://www.windowsazure.com/en-us/services/hdinsight/
SQL Server Parallel Data Warehouse (PDW) Landing Page

http://www.microsoft.com/PDW
http://www.upgradetopdw.com
Introduction to Polybase
http://www.microsoft.com/en-us/sqlserver/solutionstechnologies/data-warehousing/polybase.aspx
28

28
Thanks!

29

Mais conteúdo relacionado

Mais procurados

Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file systemAnshul Bhatnagar
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceUday Vakalapudi
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Simplilearn
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 

Mais procurados (20)

Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 

Destaque

World of Watson 2016 - Information Insecurity
World of Watson 2016 - Information InsecurityWorld of Watson 2016 - Information Insecurity
World of Watson 2016 - Information InsecurityKeith Redman
 
InfoSphere: Leading from the Front - Accelerating Data Integration through Me...
InfoSphere: Leading from the Front - Accelerating Data Integration through Me...InfoSphere: Leading from the Front - Accelerating Data Integration through Me...
InfoSphere: Leading from the Front - Accelerating Data Integration through Me...Vincent Kwon
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoopMaulik Thaker
 
Simplifying Big Data ETL with Talend
Simplifying Big Data ETL with TalendSimplifying Big Data ETL with Talend
Simplifying Big Data ETL with TalendEdureka!
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezJan Pieter Posthuma
 

Destaque (6)

World of Watson 2016 - Information Insecurity
World of Watson 2016 - Information InsecurityWorld of Watson 2016 - Information Insecurity
World of Watson 2016 - Information Insecurity
 
InfoSphere: Leading from the Front - Accelerating Data Integration through Me...
InfoSphere: Leading from the Front - Accelerating Data Integration through Me...InfoSphere: Leading from the Front - Accelerating Data Integration through Me...
InfoSphere: Leading from the Front - Accelerating Data Integration through Me...
 
SQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopSQLBits XI - ETL with Hadoop
SQLBits XI - ETL with Hadoop
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoop
 
Simplifying Big Data ETL with Talend
Simplifying Big Data ETL with TalendSimplifying Big Data ETL with Talend
Simplifying Big Data ETL with Talend
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 

Semelhante a SQLRally Amsterdam 2013 - Hadoop

Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopHortonworks
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalMichael Rainey
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaData Con LA
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenmaharajothip1
 
CCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialCCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialRoxycodone Online
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseHenk van der Valk
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdfvishal choudhary
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSStéphane Fréchette
 

Semelhante a SQLRally Amsterdam 2013 - Hadoop (20)

Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
 
CCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialCCD-410 Cloudera Study Material
CCD-410 Cloudera Study Material
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
 
hive hadoop sql
hive hadoop sqlhive hadoop sql
hive hadoop sql
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdf
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hadoop intro
Hadoop introHadoop intro
Hadoop intro
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 

Mais de Jan Pieter Posthuma

Extending Power BI with your own custom visual
Extending Power BI with your own custom visualExtending Power BI with your own custom visual
Extending Power BI with your own custom visualJan Pieter Posthuma
 
Extending Power BI with your own custom visual
Extending Power BI with your own custom visualExtending Power BI with your own custom visual
Extending Power BI with your own custom visualJan Pieter Posthuma
 
Azure Global Bootcamp - CIS Handson
Azure Global Bootcamp - CIS HandsonAzure Global Bootcamp - CIS Handson
Azure Global Bootcamp - CIS HandsonJan Pieter Posthuma
 
Extending Power BI With Your Own Custom Visual
Extending Power BI With Your Own Custom VisualExtending Power BI With Your Own Custom Visual
Extending Power BI With Your Own Custom VisualJan Pieter Posthuma
 
SQLSaturday 551 - Extending Power BI
SQLSaturday 551 - Extending Power BISQLSaturday 551 - Extending Power BI
SQLSaturday 551 - Extending Power BIJan Pieter Posthuma
 
SQLServer Days - Power BI Custom Visuals
SQLServer Days - Power BI Custom VisualsSQLServer Days - Power BI Custom Visuals
SQLServer Days - Power BI Custom VisualsJan Pieter Posthuma
 
TechDays - Power BI Custom Visuals
TechDays - Power BI Custom VisualsTechDays - Power BI Custom Visuals
TechDays - Power BI Custom VisualsJan Pieter Posthuma
 
SQLSaturday 541 - Extending Power BI
SQLSaturday 541 - Extending Power BISQLSaturday 541 - Extending Power BI
SQLSaturday 541 - Extending Power BIJan Pieter Posthuma
 

Mais de Jan Pieter Posthuma (11)

Power BI for Developers
Power BI for DevelopersPower BI for Developers
Power BI for Developers
 
Extending Power BI with your own custom visual
Extending Power BI with your own custom visualExtending Power BI with your own custom visual
Extending Power BI with your own custom visual
 
Extending Power BI with your own custom visual
Extending Power BI with your own custom visualExtending Power BI with your own custom visual
Extending Power BI with your own custom visual
 
Azure Global Bootcamp - CIS Handson
Azure Global Bootcamp - CIS HandsonAzure Global Bootcamp - CIS Handson
Azure Global Bootcamp - CIS Handson
 
Extending Power BI With Your Own Custom Visual
Extending Power BI With Your Own Custom VisualExtending Power BI With Your Own Custom Visual
Extending Power BI With Your Own Custom Visual
 
PBIG - Power BI en R visuals
PBIG - Power BI en R visualsPBIG - Power BI en R visuals
PBIG - Power BI en R visuals
 
SQLSaturday 551 - Extending Power BI
SQLSaturday 551 - Extending Power BISQLSaturday 551 - Extending Power BI
SQLSaturday 551 - Extending Power BI
 
SQLServer Days - Power BI Custom Visuals
SQLServer Days - Power BI Custom VisualsSQLServer Days - Power BI Custom Visuals
SQLServer Days - Power BI Custom Visuals
 
TechDays - Power BI Custom Visuals
TechDays - Power BI Custom VisualsTechDays - Power BI Custom Visuals
TechDays - Power BI Custom Visuals
 
SQLSaturday 541 - Extending Power BI
SQLSaturday 541 - Extending Power BISQLSaturday 541 - Extending Power BI
SQLSaturday 541 - Extending Power BI
 
Power BI API
Power BI APIPower BI API
Power BI API
 

Último

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Último (20)

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

SQLRally Amsterdam 2013 - Hadoop

  • 1. Henk van der Valk Technical Sales Professional Jan Pieter Posthuma Microsoft BI Consultant 7/11/2013 Hadoop
  • 2. Access to online training content JOIN THE PASS COMMUNITY Become a PASS member for free and join the world‟s biggest SQL Server Community. Join Local Chapters Personalize your PASS website experience Access to events at discounted rates Join Virtual Chapters 2
  • 3. Agenda • • • • • • • • • Introduction Hadoop HDFS Data access to HDFS Map/Reduce Hive Data access from HDFS SQL PDW PolyBase Wrap up 3
  • 4. Introduction Henk • • • • • 10 years of Unisys-EMEA Performance Center 2002- Largest SQL DWH in the world (SQL2000) Project Real – (SQL 2005) ETL WR - loading 1TB within 30 mins (SQL 2008) Contributed to various SQL whitepapers • • • Schuberg Philis-100% uptime for mission critical applications Since april 1st, 2011 – Microsoft SQL PDW - Western Europe SQLPass speaker & volunteer since 2005 4
  • 5. Introduction Alerts, Notifications SQL Server StreamInsight Big Data Sources (Raw, Unstructured) Data & Compute Intensive Application Business Insights SQL Server FTDW Data Marts Sensors Load SQL Server Reporting Services Fast Devices Summarize & Load HDInsight on Windows Azure Bots HDInsight on Windows Server SQL Server Parallel Data Warehouse Historical Data (Beyond Active Window) Interactive Reports Integrate/Enrich SQL Server Analysis Server Crawlers Performance Scorecards Azure Market Place Enterprise ETL with SSIS, DQS, MDS ERP CRM LOB APPS Source Systems 5
  • 6. Introduction Jan Pieter Posthuma Jan Pieter Posthuma • Technical Lead Microsoft BI and Big Data consultant • Inter Access, local consultancy firm in the Netherlands • Architect role at multiple projects • Analysis Service, Reporting Service, PerformancePoint Service, Big Data, HDInsight, Cloud BI http://twitter.com/jppp http://linkedin.com/jpposthuma jan.pieter.posthuma@interaccess.nl 6
  • 7. Hadoop • Hadoop is a collection of software to create a data-intensive distributed cluster running on commodity hardware. • Original idea by Google (2003). • Widely accepted by Database vendors as a solution for unstructured data • Microsoft partners with HortonWorks and delivers their Hadoop Data Platform as Microsoft HDInsight • Available as an Azure service and on premise • HortonWorks Data Platform (HDP) 100% Open Source! 7 7
  • 8. Hadoop Map/ Reduce HBase HDFS Poly base Avro (Serialization) Zookeeper • HDFS – distributed, fault tolerant file system • MapReduce – framework for writing/executing distributed, fault tolerant algorithms • Hive & Pig – SQL-like declarative languages • Sqoop/PolyBase – package for moving data between HDFS BI ETL RDBMS Reporting Tools and relational DB systems • + Others… Hive & Pig • Hadoop 2.0 Sqoop / 8
  • 9. HDFS Large File 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 … 6440MB Let‟s color-code them Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 64MB 64MB 64MB 64MB 64MB 64MB e.g., Block Size = 64MB … Block 100 Block 101 64MB 40MB Files are composed of set of blocks • Typically 64MB in size • Each block is stored as a separate file in the local file system (e.g. NTFS) 9 9
  • 10. HDFS NameNode HDFS was designed with the expectation that failures (both hardware and software) would occur frequently BackupNode namespace backups Hadoop 2.0 is more decentralized • Interaction between DataNodes • Less dependent on primary NameNode (heartbeat, balancing, replication, etc.) DataNode DataNode DataNode DataNode nodes write to local disk DataNode
  • 11. Data access to HDFS FTP – Upload your data files Streaming – Via AVRO (RPC) or Flume Hadoop command – hadoop fs -copyFromLocal Windows Azure BLOB storage – HDInsight Service (Azure) uses BLOB storage instead of local VM storage. Data can be uploaded without a provisioned Hadoop cluster • PolyBase – Feature of PDW 2012. Direct read/write data access to the datanodes. • • • • 11
  • 13. 13
  • 14. Map/Reduce • MR: all functions in a batch oriented architecture • • Map: Apply the logic to the data, eg page hits count. Reduce: Reduces (aggregate) the results of the Mappers to one. • YARN: split the JobTracker in to Resource Manager and Node Manager. And MR in Hadoop 2.0 uses YARN as its JobTacker 14
  • 16. Hive • • • • • • • • • Build for easy data retrieval Uses Map/Reduce Created by Facebook HiveQL: SQL like language Stores data in tables, which are stored as HDFS file(s) Only initial INSERT supported, no UPDATE or DELETE External tables possible on existing (CSV) file(s) Extra language options to use benefits of Hadoop Stinger initiative: Phase 1 (0.11) and Phase 2 (0.12). Improve Hive 100x 16
  • 17. Hive Star schema join – (Based on TPC-DS Query 27) SELECT col5, avg(col6) FROM store_sales_fact ssf 41 GB join item_dim on (ssf.col1 = item_dim.col1) 58 MB join date_dim on (ssf.col2 = date_dim.col2) 11 MB join custdmgrphcs_dim on (ssf.col3 = custdmgrphcs_dim.col3) 80 MB join store_dim on (ssf.col4 = store_dim.col4) 106 KB GROUP BY col5 ORDER BY col5 LIMIT 100; Cluster: 6 Nodes (2 Name, 4 Compute – dual core, 14GB) 17
  • 18. Hive File Type # MR jobs Input Size # Mappers Time Text / Hive 0.10 5 43.1 GB 179 21:00 min Text / Hive 0.11 1 38.0 GB 151 4:06 min RC / Hive 0.11 1 8.21 GB 76 2:16 min ORC / Hive 0.11 1 2.83 GB 38 1:44 min RC / Hive 0.11 / Partitioned / Bucketed 1 1.73 GB 19 1:44 min ORC / Hive 0.11 / Partitioned / Bucketed 1 687 MB 27 01:19 min Data: ~64x less data Time; ~16x times faster 18
  • 19. Data access from Hadoop Excel FTP Hadoop command – hadoop fs -copyToLocal ODBC[1] – Via Hive (HiveQL) data can be extracted. Power Query – Is capable of extracting data directly from HDFS or Azure BLOB storage • PolyBase – Feature of PDW 2012. Direct read/write data access to the datanodes. • • • • • [1] http://www.microsoft.com/en-us/download/details.aspx?id=40886 [2] Power BI Excel add-in – http://www.powerbi.com 19
  • 21. 21
  • 22. PDW – Polybase … SQL Server SQL Server SQL Server SQL Server Sqoop This is PDW! DN DN DN DN DN DN DN DN DN DN DN DN Hadoop Cluster 22 22
  • 23. PDW – External Tables • An external table is PDW‟s representation of data residing in HDFS • The “table” (metadata) lives in the context of a SQL Server database • The actual table data resides in HDFS • No support for DML operations • No concurrency control or isolation level guarantees CREATE EXTERNAL TABLE table_name ({<column_definition>} [,...n ]) {WITH (LOCATION =‘<URI>’,[FORMAT_OPTIONS = (<VALUES>)])} [;] Required to indicate location of Hadoop cluster Optional format options associated with parsing of data from HDFS (e.g. field delimiters & reject-related thresholds) 23
  • 24. PDW – Hadoop use cases & examples [1] Retrieve data from HDFS with a PDW query • Seamlessly join structured and semi-structured data SELECT Username FROM ClickStream c, User u WHERE c.UserID = u.ID AND c.URL=‘www.bing.com’; [2] Import data from HDFS to PDW • Parallelized CREATE TABLE AS SELECT (CTAS) • External tables as the source • PDW table, either replicated or distributed, as destination CREATE TABLE ClickStreamInPDW WITH DISTRIBUTION = HASH(URL) AS SELECT URL, EventDate, UserID FROM ClickStream; [3] Export data from PDW to HDFS • Parallelized CREATE EXTERNAL TABLE AS SELECT (CETAS) • External table as the destination; creates a set of HDFS files CREATE EXTERNAL TABLE ClickStream2 (URL, EventDate, UserID) WITH (LOCATION =‘hdfs://MyHadoop:5000/joe’, FORMAT_OPTIONS (...) AS SELECT URL, EventDate, UserID FROM ClickStreamInPDW;
  • 26. Wrap up Hadoop „just another data source‟ @ your fingertips! Batch processing large datasets before loading into your DWH Offloading DWH data, but still accessible for analysis/reporting Integrate Hadoop via SQOOP, ODBC (Hive) or PolyBase Near future: deeply integration between Hadoop and SQL PDW Try Hadoop / HDInsight yourself: Azure: http://www.windowsazure.com/en-us/pricing/free-trial/ Web PI: http://www.microsoft.com/web/downloads/platform.aspx 26
  • 28. References Microsoft Big Data http://www.microsoft.com/bigdata Windows Azure HDInsight Service (3 months free trail) http://www.windowsazure.com/en-us/services/hdinsight/ SQL Server Parallel Data Warehouse (PDW) Landing Page http://www.microsoft.com/PDW http://www.upgradetopdw.com Introduction to Polybase http://www.microsoft.com/en-us/sqlserver/solutionstechnologies/data-warehousing/polybase.aspx 28 28

Notas do Editor

  1. DEMO:Upload a local file with hadoop -copyFromLocal
  2. - Hadoopcommand- CoudXplorer
  3. DEMO: Total hit count W3C logs
  4. - TotalHits MR job
  5. ‘Of the 150k jobs Facebook runs daily, only 500 are MapReduce jobs. The rest are is HiveQL’
  6. Hive &lt;0.11:stores data is plain text filesno join optimization.typical DHW query (star schema join) results in 6 MR jobsHive 0.11:introduces (O)RC files, loosely based on column store indexesjoin optimizationTypical DWH query result in 1 MR jobHive 0.12:- Uses Yarn and Tez, optimized for DWH queries and less overhead then MR
  7. DEMO: Retrieving data via ODBC and Power Query in Excel
  8. Via ExcelData Explorer (Azure BLOB storage)