Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
SQLRally Amsterdam 2013 - Hadoop
1. Henk van der Valk
Technical Sales Professional
Jan Pieter Posthuma
Microsoft BI Consultant
7/11/2013
Hadoop
2. Access to online
training content
JOIN THE PASS
COMMUNITY
Become a PASS member for free
and join the world‟s biggest SQL
Server Community.
Join Local
Chapters
Personalize your PASS website experience
Access to events at
discounted rates
Join Virtual
Chapters
2
4. Introduction Henk
•
•
•
•
•
10 years of Unisys-EMEA Performance Center
2002- Largest SQL DWH in the world (SQL2000)
Project Real – (SQL 2005)
ETL WR - loading 1TB within 30 mins (SQL 2008)
Contributed to various SQL whitepapers
•
•
•
Schuberg Philis-100% uptime for mission critical applications
Since april 1st, 2011 – Microsoft SQL PDW - Western Europe
SQLPass speaker & volunteer since 2005
4
5. Introduction
Alerts, Notifications
SQL Server
StreamInsight
Big Data Sources
(Raw, Unstructured)
Data & Compute Intensive
Application
Business
Insights
SQL Server FTDW Data
Marts
Sensors
Load
SQL Server Reporting
Services
Fast
Devices
Summarize &
Load
HDInsight on
Windows Azure
Bots
HDInsight on
Windows Server
SQL Server Parallel Data
Warehouse
Historical Data
(Beyond Active Window)
Interactive
Reports
Integrate/Enrich
SQL Server Analysis
Server
Crawlers
Performance
Scorecards
Azure Market Place
Enterprise ETL with
SSIS, DQS, MDS
ERP
CRM
LOB
APPS
Source Systems
5
6. Introduction Jan Pieter Posthuma
Jan Pieter Posthuma
• Technical Lead Microsoft BI and
Big Data consultant
• Inter Access, local consultancy firm in the
Netherlands
• Architect role at multiple projects
• Analysis Service, Reporting Service,
PerformancePoint Service, Big Data,
HDInsight, Cloud BI
http://twitter.com/jppp
http://linkedin.com/jpposthuma
jan.pieter.posthuma@interaccess.nl
6
7. Hadoop
• Hadoop is a collection of software to create a data-intensive
distributed cluster running on commodity hardware.
• Original idea by Google (2003).
• Widely accepted by Database vendors as a solution for unstructured
data
• Microsoft partners with HortonWorks and delivers their Hadoop
Data Platform as Microsoft HDInsight
• Available as an Azure service and on premise
• HortonWorks Data Platform (HDP) 100% Open Source!
7
7
8. Hadoop
Map/
Reduce
HBase
HDFS
Poly
base
Avro (Serialization)
Zookeeper
• HDFS – distributed, fault tolerant file system
• MapReduce – framework for writing/executing distributed, fault
tolerant algorithms
• Hive & Pig – SQL-like declarative languages
• Sqoop/PolyBase – package
for moving data between HDFS
BI
ETL
RDBMS
Reporting Tools
and relational DB systems
• + Others…
Hive & Pig
• Hadoop 2.0
Sqoop /
8
10. HDFS
NameNode
HDFS was designed with the
expectation that failures (both
hardware and software) would
occur frequently
BackupNode
namespace backups
Hadoop 2.0 is more decentralized
• Interaction between DataNodes
• Less dependent on primary
NameNode
(heartbeat, balancing, replication, etc.)
DataNode
DataNode
DataNode
DataNode
nodes write to local disk
DataNode
11. Data access to HDFS
FTP – Upload your data files
Streaming – Via AVRO (RPC) or Flume
Hadoop command – hadoop fs -copyFromLocal
Windows Azure BLOB storage – HDInsight Service (Azure) uses BLOB
storage instead of local VM storage. Data can be uploaded without a
provisioned Hadoop cluster
• PolyBase – Feature of PDW 2012. Direct read/write data access to
the datanodes.
•
•
•
•
11
14. Map/Reduce
• MR: all functions in a batch oriented architecture
•
•
Map: Apply the logic to the data, eg page hits count.
Reduce: Reduces (aggregate) the results of the Mappers to one.
• YARN: split the JobTracker in to Resource Manager and Node
Manager. And MR in Hadoop 2.0 uses YARN as its JobTacker
14
16. Hive
•
•
•
•
•
•
•
•
•
Build for easy data retrieval
Uses Map/Reduce
Created by Facebook
HiveQL: SQL like language
Stores data in tables, which are stored as HDFS file(s)
Only initial INSERT supported, no UPDATE or DELETE
External tables possible on existing (CSV) file(s)
Extra language options to use benefits of Hadoop
Stinger initiative: Phase 1 (0.11) and Phase 2 (0.12).
Improve Hive 100x
16
17. Hive
Star schema join – (Based on TPC-DS Query 27)
SELECT
col5, avg(col6)
FROM
store_sales_fact ssf
41 GB
join item_dim on (ssf.col1 = item_dim.col1)
58 MB
join date_dim on (ssf.col2 = date_dim.col2)
11 MB
join custdmgrphcs_dim on (ssf.col3 = custdmgrphcs_dim.col3)
80 MB
join store_dim on (ssf.col4 = store_dim.col4)
106 KB
GROUP BY col5
ORDER BY col5
LIMIT 100;
Cluster: 6 Nodes (2 Name, 4 Compute – dual core, 14GB)
17
18. Hive
File Type
# MR jobs
Input Size
# Mappers
Time
Text / Hive 0.10
5
43.1 GB
179
21:00 min
Text / Hive 0.11
1
38.0 GB
151
4:06 min
RC / Hive 0.11
1
8.21 GB
76
2:16 min
ORC / Hive 0.11
1
2.83 GB
38
1:44 min
RC / Hive 0.11 /
Partitioned /
Bucketed
1
1.73 GB
19
1:44 min
ORC / Hive 0.11 /
Partitioned /
Bucketed
1
687 MB
27
01:19 min
Data: ~64x less data
Time; ~16x times faster
18
19. Data access from Hadoop
Excel
FTP
Hadoop command – hadoop fs -copyToLocal
ODBC[1] – Via Hive (HiveQL) data can be extracted.
Power Query – Is capable of extracting data directly from HDFS or
Azure BLOB storage
• PolyBase – Feature of PDW 2012. Direct read/write data access to
the datanodes.
•
•
•
•
•
[1] http://www.microsoft.com/en-us/download/details.aspx?id=40886
[2] Power BI Excel add-in – http://www.powerbi.com
19
22. PDW – Polybase
…
SQL Server
SQL Server
SQL Server
SQL Server
Sqoop
This is PDW!
DN
DN
DN
DN
DN
DN
DN
DN
DN
DN
DN
DN
Hadoop Cluster
22
22
23. PDW – External Tables
• An external table is PDW‟s representation of data residing in HDFS
• The “table” (metadata) lives in the context of a SQL Server database
• The actual table data resides in HDFS
• No support for DML operations
• No concurrency control or isolation level guarantees
CREATE EXTERNAL TABLE table_name ({<column_definition>} [,...n ])
{WITH (LOCATION =‘<URI>’,[FORMAT_OPTIONS = (<VALUES>)])}
[;]
Required to indicate
location of Hadoop cluster
Optional format options
associated with parsing of data
from HDFS (e.g. field delimiters
& reject-related thresholds)
23
24. PDW – Hadoop use cases & examples
[1] Retrieve data from HDFS with a PDW query
• Seamlessly join structured and semi-structured data
SELECT Username FROM ClickStream c, User u WHERE c.UserID = u.ID
AND c.URL=‘www.bing.com’;
[2] Import data from HDFS to PDW
• Parallelized CREATE TABLE AS SELECT (CTAS)
• External tables as the source
• PDW table, either replicated or distributed, as destination
CREATE TABLE ClickStreamInPDW WITH DISTRIBUTION = HASH(URL)
AS SELECT URL, EventDate, UserID FROM ClickStream;
[3] Export data from PDW to HDFS
• Parallelized CREATE EXTERNAL TABLE AS SELECT (CETAS)
• External table as the destination; creates a set of HDFS files
CREATE EXTERNAL TABLE ClickStream2 (URL, EventDate, UserID)
WITH (LOCATION =‘hdfs://MyHadoop:5000/joe’, FORMAT_OPTIONS (...)
AS SELECT URL, EventDate, UserID FROM ClickStreamInPDW;
26. Wrap up
Hadoop „just another data source‟ @ your fingertips!
Batch processing large datasets before loading into your DWH
Offloading DWH data, but still accessible for analysis/reporting
Integrate Hadoop via SQOOP, ODBC (Hive) or PolyBase
Near future: deeply integration between Hadoop and SQL PDW
Try Hadoop / HDInsight yourself:
Azure: http://www.windowsazure.com/en-us/pricing/free-trial/
Web PI: http://www.microsoft.com/web/downloads/platform.aspx
26
28. References
Microsoft Big Data
http://www.microsoft.com/bigdata
Windows Azure HDInsight Service (3 months free trail)
http://www.windowsazure.com/en-us/services/hdinsight/
SQL Server Parallel Data Warehouse (PDW) Landing Page
http://www.microsoft.com/PDW
http://www.upgradetopdw.com
Introduction to Polybase
http://www.microsoft.com/en-us/sqlserver/solutionstechnologies/data-warehousing/polybase.aspx
28
28
DEMO:Upload a local file with hadoop -copyFromLocal
- Hadoopcommand- CoudXplorer
DEMO: Total hit count W3C logs
- TotalHits MR job
‘Of the 150k jobs Facebook runs daily, only 500 are MapReduce jobs. The rest are is HiveQL’
Hive <0.11:stores data is plain text filesno join optimization.typical DHW query (star schema join) results in 6 MR jobsHive 0.11:introduces (O)RC files, loosely based on column store indexesjoin optimizationTypical DWH query result in 1 MR jobHive 0.12:- Uses Yarn and Tez, optimized for DWH queries and less overhead then MR
DEMO: Retrieving data via ODBC and Power Query in Excel