SlideShare uma empresa Scribd logo
1 de 38
MapReduce in der Praxis
MapReduce Entwurfsmuster
Sascha Dittmann
Blog: http://www.sascha-dittmann.de
Twitter: @SaschaDittmann
Organizer
SQLSaturday Rheinland 201428.06.2014
You Rock! Sponsor
SQLSaturday Rheinland 201428.06.2014
Gold Sponsor
SQLSaturday Rheinland 201428.06.2014
Silver Sponsor
SQLSaturday Rheinland 201428.06.2014
Bronze Sponsor and Media Partner
SQLSaturday Rheinland 201428.06.2014
WAS IST BIG DATA?
Wo liegt das Problem?
28.06.2014 SQLSaturday Rheinland 2014
Wo liegt das Problem?
28.06.2014 SQLSaturday Rheinland 2014
"eins, zwei, zwei, drei, drei, drei, vier, vier, vier, vier"
.Split (',')
.GroupBy (x => x)
.ToList ()
.ForEach (g =>
Console.WriteLine (
“{0} wurde {1} mal gefunden.",
g.Key.Trim(),
g.Count()
));
Die 3 V’s
28.06.2014 SQLSaturday Rheinland 2014
Variety
Velocity
• Variety (Vielfalt)
• Relational, XML, Video, Text, ...
• Velocity (Geschwindigkeit)
• Batch, Intervall, Echtzeit, ...
• Volume (Menge)
• KB, MB, GB, TB, EB, PB, ZB,
YB, ...
Volume
Skalierung
28.06.2014 SQLSaturday Rheinland 2014
Vertikale Skalierung Horizontale Skalierung
WAS IST HADOOP / HDINSIGHT
Auf der Suche nach Lösungen
28.06.2014 SQLSaturday Rheinland 2014
Apache Hadoop / Microsoft HDInsight
28.06.2014 SQLSaturday Rheinland 2014
+
Apache Hadoop 1.x Ecosystem
28.06.2014 SQLSaturday Rheinland 2014
MapReduce (Job Scheduling/Execution System)
HDFS
(Hadoop Distributed File System)
HBase (Column DB)
Pig (Data
Flow)
Hive
(Warehouse
and Data
Access)
Oozie
(Workflow)
Sqoop
Traditional BI Tools
HBase / Cassandra
(Columnar NoSQL Databases)
Avro(Serialization)
Zookeeper(Coordination)
Apache
Mahout
Cascading
(programming
model)
Hadoop = MapReduce + HDFS
Flume
Microsoft HDInsight
28.06.2014 SQLSaturday Rheinland 2014
MapReduce (Job Scheduling/Execution System)
HDFS
(Hadoop Distributed File System)
HBase (Column DB)
Pig
(Data
Flow)
Hive
(Warehous
e and Data
Access)
Oozie
(Workflow)
Sqoop
Traditional BI Tools
HBase / Cassandra
(Columnar NoSQL Databases)
Avro(Serialization)
Zookeeper(Coordination)
Apache
Mahout
Cascading
(programming
model)
Hadoop = MapReduce + HDFS
Flume
Windows
SystemCenter
ActiveDirectory
Visual Studio
THE HADOOP CORE
HDFS + MapReduce
28.06.2014 SQLSaturday Rheinland 2014
Hadoop Distributed File System (HDFS)
28.06.2014 SQLSaturday Rheinland 2014
 Portable Operating System Interface (POSIX)
 Replikation auf mehrere Datenknoten
js> #ls /user/Sascha/input/ncdc
Found 9 items
drwxr-xr-x - Sascha supergroup 0 2013-04-24 13:09 /user/Sascha/input/ncdc/all
drwxr-xr-x - Sascha supergroup 0 2013-04-24 13:01 /user/Sascha/input/ncdc/all2
drwxr-xr-x - Sascha supergroup 0 2013-04-23 13:06 /user/Sascha/input/ncdc/metadata
drwxr-xr-x - Sascha supergroup 0 2013-04-23 13:06 /user/Sascha/input/ncdc/micro
drwxr-xr-x - Sascha supergroup 0 2013-04-23 13:06 /user/Sascha/input/ncdc/micro-tab
-rw-r--r-- 3 Sascha supergroup 529 2013-04-23 13:06 /user/Sascha/input/ncdc/sample.txt
-rw-r--r-- 3 Sascha supergroup 168 2013-04-23 13:06 /user/Sascha/input/ncdc/sample.txt.gz
Map/Reduce am Beispiel von Messdaten
28.06.2014 SQLSaturday Rheinland 2014
0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999
Jahr Lufttemperatur
Map/Reduce am Beispiel von Messdaten
28.06.2014 SQLSaturday Rheinland 2014
0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999
Messqualität
Map/Reduce
28.06.2014 SQLSaturday Rheinland 2014
Map
Sort
Shuffle
DataNode
Map
Sort
Shuffle
DataNode
Map
Sort
Shuffle
DataNode
Reduce
0067011990999991950051507004+68750
0043011990999991950051512004+68750
0043011990999991950051518004+68750
0043012650999991949032412004+62300
0043012650999991949032418004+62300
1949,0
1950,22
1950,55
1952,-11
1950,33
1949,0
1950,[22,33,55]
1952,-11
1949,0
1950,55
1952,-11
Map/Reduce mit Combiner-Funktion
28.06.2014 SQLSaturday Rheinland 2014
Map
Combine
Sort
Shuffle
DataNode
Map
Combine
Sort
Shuffle
DataNode
Map
Combine
Sort
Shuffle
DataNode
Reduce
0067011990999991950051507004+68750
0043011990999991950051512004+68750
0043011990999991950051518004+68750
0043012650999991949032412004+62300
0043012650999991949032418004+62300
1949,0
1950,22
1950,55
1952,-11
1950,33
1949,0
1950,55
1952,-11
1950,33
1949,0
1950,[33,55]
1952,-11
1949,0
1950,55
1952,-11
Map/Reduce mit Combiner-Funktion
28.06.2014 SQLSaturday Rheinland 2014
Map
Combine
Sort
Shuffle
DataNode
Map
Combine
Sort
Shuffle
DataNode
Map
Combine
Sort
Shuffle
DataNode
Reduce
0067011990999991950051507004+68750
0043011990999991950051512004+68750
0043011990999991950051518004+68750
0043012650999991949032412004+62300
0043012650999991949032418004+62300
1949,0
1950,22
1950,55
1952,-11
1950,33
1949,0
1950,55
1952,-11
1950,33
1949,0
1950,[33,55]
1952,-11
1949,0
1950,55
1952,-11
Partitioner (Key % Anzahl der Reducer)
NUMERISCHE AGGREGATION
MapReduce Entwurfsmuster
28.06.2014 SQLSaturday Rheinland 2014
Numerische Aggregation (T-SQL)
SELECT MIN(CreationDate)
, MAX(CreationDate)
, COUNT(UserId)
FROM dbo.Comments
GROUP BY UserId
28.06.2014 SQLSaturday Rheinland 2014
Min. / Max. / Count (MapReduce)
28.06.2014 SQLSaturday Rheinland 2014
Min. / Max. / Count (Datenfluß)
28.06.2014 SQLSaturday Rheinland 2014
UserId Min Max Count
12345 14.06.2001 14.06.2001 1
12345 11.01.1995 11.01.1995 1
12345 23.05.2009 23.05.2009 1
56789 23.06.1996 23.06.1996 1
56789 05.07.1997 05.07.1997 1
52737 06.12.2000 06.12.2000 1
UserId Min Max Count
12345 11.01.1995 23.05.2009 3
52737 06.12.2000 06.12.2000 1
56789 23.06.1996 05.07.1997 2
Mapper
Combiner / Reducer
Standardabweichung / Median
28.06.2014 SQLSaturday Rheinland 2014
Standardabweichung / Median (Datenfluß)
28.06.2014 SQLSaturday Rheinland 2014
UserId Length
4 18
4 14
4 18
3 4
3 4
9 19
Hour Length / Count Pairs
3 4:2,
4 18:2, 14:1
9 19:1
Mapper
Combiner
Reducer
Schnelles Zählen mit Countern
28.06.2014 SQLSaturday Rheinland 2014
FILTERN
MapReduce Entwurfsmuster
28.06.2014 SQLSaturday Rheinland 2014
Filtern (T-SQL)
SELECT *
FROM dbo.Table
WHERE Value < 42
28.06.2014 SQLSaturday Rheinland 2014
TOP / DISTINCT
28.06.2014 SQLSaturday Rheinland 2014
DATEN ORGANISIEREN
MapReduce Entwurfsmuster
28.06.2014 SQLSaturday Rheinland 2014
Struktur ändern / Partitionen / Sortierung
28.06.2014 SQLSaturday Rheinland 2014
JOINS
MapReduce Entwurfsmuster
28.06.2014 SQLSaturday Rheinland 2014
Filtern (T-SQL)
SELECT *
FROM dbo.Table1 T1
JOIN dbo.Table2 T2
ON T1.ID = T2.ID
28.06.2014 SQLSaturday Rheinland 2014
JOINS
28.06.2014 SQLSaturday Rheinland 2014
Code Beispiele
 C#
https://github.com/SaschaDittmann/MapReduceSamples
 Java
https://github.com/SaschaDittmann/MapReduceSamplesJava
 Blog Posts
http://www.sascha-dittmann.de/?tag=/MapReduce
28.06.2014 SQLSaturday Rheinland 2014
Thank you!
for sponsorship
for volunteering
for participation
for a great
SQLSaturday #313
SQLSaturday Rheinland 201428.06.2014

Mais conteúdo relacionado

Destaque

Datawarehouse como servicio en azure (sqldw)
Datawarehouse como servicio en azure (sqldw)Datawarehouse como servicio en azure (sqldw)
Datawarehouse como servicio en azure (sqldw)Enrique Catala Bañuls
 
Microsoft Azure Data Warehouse Overview
Microsoft Azure Data Warehouse OverviewMicrosoft Azure Data Warehouse Overview
Microsoft Azure Data Warehouse OverviewJustin Munsters
 
SQL Azure Data Warehouse - Silviu Niculita
SQL Azure Data Warehouse - Silviu NiculitaSQL Azure Data Warehouse - Silviu Niculita
SQL Azure Data Warehouse - Silviu NiculitaITCamp
 
Azure doc db (slideshare)
Azure doc db (slideshare)Azure doc db (slideshare)
Azure doc db (slideshare)David Green
 
NoSQL with Microsoft Azure
NoSQL with Microsoft AzureNoSQL with Microsoft Azure
NoSQL with Microsoft AzureKhalid Salama
 
Azure DocumentDB for Healthcare Integration - Part 2
Azure DocumentDB for Healthcare Integration - Part 2Azure DocumentDB for Healthcare Integration - Part 2
Azure DocumentDB for Healthcare Integration - Part 2BizTalk360
 
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSONSQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSONSascha Dittmann
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with SparkKhalid Salama
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseJames Serra
 
Introducing Azure SQL Database
Introducing Azure SQL DatabaseIntroducing Azure SQL Database
Introducing Azure SQL DatabaseJames Serra
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureKhalid Salama
 
Cortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeCortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeMSAdvAnalytics
 
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...NoSQLmatters
 
Building IoT and Big Data Solutions on Azure
Building IoT and Big Data Solutions on AzureBuilding IoT and Big Data Solutions on Azure
Building IoT and Big Data Solutions on AzureIdo Flatow
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudJames Serra
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureMark Kromer
 

Destaque (20)

Datawarehouse como servicio en azure (sqldw)
Datawarehouse como servicio en azure (sqldw)Datawarehouse como servicio en azure (sqldw)
Datawarehouse como servicio en azure (sqldw)
 
Microsoft Azure Data Warehouse Overview
Microsoft Azure Data Warehouse OverviewMicrosoft Azure Data Warehouse Overview
Microsoft Azure Data Warehouse Overview
 
SQL Azure Data Warehouse - Silviu Niculita
SQL Azure Data Warehouse - Silviu NiculitaSQL Azure Data Warehouse - Silviu Niculita
SQL Azure Data Warehouse - Silviu Niculita
 
NoSQL, which way to go?
NoSQL, which way to go?NoSQL, which way to go?
NoSQL, which way to go?
 
Azure doc db (slideshare)
Azure doc db (slideshare)Azure doc db (slideshare)
Azure doc db (slideshare)
 
NoSQL with Microsoft Azure
NoSQL with Microsoft AzureNoSQL with Microsoft Azure
NoSQL with Microsoft Azure
 
Azure DocumentDB for Healthcare Integration - Part 2
Azure DocumentDB for Healthcare Integration - Part 2Azure DocumentDB for Healthcare Integration - Part 2
Azure DocumentDB for Healthcare Integration - Part 2
 
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSONSQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
 
Azure DocumentDB
Azure DocumentDBAzure DocumentDB
Azure DocumentDB
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 
Azure DocumentDb
Azure DocumentDbAzure DocumentDb
Azure DocumentDb
 
Introducing Azure SQL Database
Introducing Azure SQL DatabaseIntroducing Azure SQL Database
Introducing Azure SQL Database
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft Azure
 
Cortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeCortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data Lake
 
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
 
Building IoT and Big Data Solutions on Azure
Building IoT and Big Data Solutions on AzureBuilding IoT and Big Data Solutions on Azure
Building IoT and Big Data Solutions on Azure
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
Big Data Application Architectures - IoT
Big Data Application Architectures - IoTBig Data Application Architectures - IoT
Big Data Application Architectures - IoT
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 

Semelhante a SQL Saturday #313 Rheinland - MapReduce in der Praxis

SQLSaturday Rheinland 2014 - Power query vs. ssis
SQLSaturday Rheinland 2014 - Power query vs. ssisSQLSaturday Rheinland 2014 - Power query vs. ssis
SQLSaturday Rheinland 2014 - Power query vs. ssisJean-Pierre Riehl
 
RRDTool + RUBY DSL = RRD-FFI
RRDTool + RUBY DSL = RRD-FFIRRDTool + RUBY DSL = RRD-FFI
RRDTool + RUBY DSL = RRD-FFIKamil Grabowski
 
VirtaThon 2011 - Mining the AWR
VirtaThon 2011 - Mining the AWRVirtaThon 2011 - Mining the AWR
VirtaThon 2011 - Mining the AWRKristofferson A
 
Grill at bigdata-cloud conf
Grill at bigdata-cloud confGrill at bigdata-cloud conf
Grill at bigdata-cloud confamarsri
 
Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014Eva Tse
 
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)Spark Summit
 
Hurtownia danych jako serwis w chmurze, na przykładzie azure sql data warehou...
Hurtownia danych jako serwis w chmurze, na przykładzie azure sql data warehou...Hurtownia danych jako serwis w chmurze, na przykładzie azure sql data warehou...
Hurtownia danych jako serwis w chmurze, na przykładzie azure sql data warehou...Alicja Folga
 
Megadata With Python and Hadoop
Megadata With Python and HadoopMegadata With Python and Hadoop
Megadata With Python and Hadoopryancox
 
Azure Machine Learning using R
Azure Machine Learning using RAzure Machine Learning using R
Azure Machine Learning using RHerman Wu
 
Synchronise your data between MySQL and MongoDB
Synchronise your data between MySQL and MongoDBSynchronise your data between MySQL and MongoDB
Synchronise your data between MySQL and MongoDBGiuseppe Maxia
 
Big data and data science study
Big data and data science studyBig data and data science study
Big data and data science studydspadawan
 
useR!2010 matome
useR!2010 matomeuseR!2010 matome
useR!2010 matomeybenjo
 
Design & Performance - Steve Souders at Fastly Altitude 2015
Design & Performance - Steve Souders at Fastly Altitude 2015Design & Performance - Steve Souders at Fastly Altitude 2015
Design & Performance - Steve Souders at Fastly Altitude 2015Fastly
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Data cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flowsData cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flowsMark Kromer
 
Data cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsData cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsMark Kromer
 
Intro to Big Data - Orlando Code Camp 2014
Intro to Big Data - Orlando Code Camp 2014Intro to Big Data - Orlando Code Camp 2014
Intro to Big Data - Orlando Code Camp 2014John Ternent
 
Migrating to Oracle Database 12c: 300 DBs in 300 days.
Migrating to Oracle Database 12c: 300 DBs in 300 days.Migrating to Oracle Database 12c: 300 DBs in 300 days.
Migrating to Oracle Database 12c: 300 DBs in 300 days.Ludovico Caldara
 
Redshift Introduction
Redshift IntroductionRedshift Introduction
Redshift IntroductionDataKitchen
 

Semelhante a SQL Saturday #313 Rheinland - MapReduce in der Praxis (20)

SQLSaturday Rheinland 2014 - Power query vs. ssis
SQLSaturday Rheinland 2014 - Power query vs. ssisSQLSaturday Rheinland 2014 - Power query vs. ssis
SQLSaturday Rheinland 2014 - Power query vs. ssis
 
RRDTool + RUBY DSL = RRD-FFI
RRDTool + RUBY DSL = RRD-FFIRRDTool + RUBY DSL = RRD-FFI
RRDTool + RUBY DSL = RRD-FFI
 
Visitation time scheduling
Visitation time schedulingVisitation time scheduling
Visitation time scheduling
 
VirtaThon 2011 - Mining the AWR
VirtaThon 2011 - Mining the AWRVirtaThon 2011 - Mining the AWR
VirtaThon 2011 - Mining the AWR
 
Grill at bigdata-cloud conf
Grill at bigdata-cloud confGrill at bigdata-cloud conf
Grill at bigdata-cloud conf
 
Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014
 
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
 
Hurtownia danych jako serwis w chmurze, na przykładzie azure sql data warehou...
Hurtownia danych jako serwis w chmurze, na przykładzie azure sql data warehou...Hurtownia danych jako serwis w chmurze, na przykładzie azure sql data warehou...
Hurtownia danych jako serwis w chmurze, na przykładzie azure sql data warehou...
 
Megadata With Python and Hadoop
Megadata With Python and HadoopMegadata With Python and Hadoop
Megadata With Python and Hadoop
 
Azure Machine Learning using R
Azure Machine Learning using RAzure Machine Learning using R
Azure Machine Learning using R
 
Synchronise your data between MySQL and MongoDB
Synchronise your data between MySQL and MongoDBSynchronise your data between MySQL and MongoDB
Synchronise your data between MySQL and MongoDB
 
Big data and data science study
Big data and data science studyBig data and data science study
Big data and data science study
 
useR!2010 matome
useR!2010 matomeuseR!2010 matome
useR!2010 matome
 
Design & Performance - Steve Souders at Fastly Altitude 2015
Design & Performance - Steve Souders at Fastly Altitude 2015Design & Performance - Steve Souders at Fastly Altitude 2015
Design & Performance - Steve Souders at Fastly Altitude 2015
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Data cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flowsData cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flows
 
Data cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsData cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flows
 
Intro to Big Data - Orlando Code Camp 2014
Intro to Big Data - Orlando Code Camp 2014Intro to Big Data - Orlando Code Camp 2014
Intro to Big Data - Orlando Code Camp 2014
 
Migrating to Oracle Database 12c: 300 DBs in 300 days.
Migrating to Oracle Database 12c: 300 DBs in 300 days.Migrating to Oracle Database 12c: 300 DBs in 300 days.
Migrating to Oracle Database 12c: 300 DBs in 300 days.
 
Redshift Introduction
Redshift IntroductionRedshift Introduction
Redshift Introduction
 

Mais de Sascha Dittmann

Hochskalierbare, relationale Datenbanken in Microsoft Azure
Hochskalierbare, relationale Datenbanken in Microsoft AzureHochskalierbare, relationale Datenbanken in Microsoft Azure
Hochskalierbare, relationale Datenbanken in Microsoft AzureSascha Dittmann
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at ScaleSascha Dittmann
 
dotnet Cologne 2015 - Azure Service Fabric
dotnet Cologne 2015 - Azure Service Fabric dotnet Cologne 2015 - Azure Service Fabric
dotnet Cologne 2015 - Azure Service Fabric Sascha Dittmann
 
Hadoop 2.0 - The Next Level
Hadoop 2.0 - The Next LevelHadoop 2.0 - The Next Level
Hadoop 2.0 - The Next LevelSascha Dittmann
 
Microsoft HDInsight Podcast #001 - Was ist HDInsight
Microsoft HDInsight Podcast #001 - Was ist HDInsightMicrosoft HDInsight Podcast #001 - Was ist HDInsight
Microsoft HDInsight Podcast #001 - Was ist HDInsightSascha Dittmann
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)Sascha Dittmann
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)Sascha Dittmann
 
dotnet Cologne 2013 - Windows Azure Mobile Services
dotnet Cologne 2013 - Windows Azure Mobile Servicesdotnet Cologne 2013 - Windows Azure Mobile Services
dotnet Cologne 2013 - Windows Azure Mobile ServicesSascha Dittmann
 
dotnet Cologne 2013 - Microsoft HD Insight für .NET Entwickler
dotnet Cologne 2013 - Microsoft HD Insight für .NET Entwicklerdotnet Cologne 2013 - Microsoft HD Insight für .NET Entwickler
dotnet Cologne 2013 - Microsoft HD Insight für .NET EntwicklerSascha Dittmann
 
Developer Open Space 2012 - Cloud Computing Workshop
Developer Open Space 2012 - Cloud Computing WorkshopDeveloper Open Space 2012 - Cloud Computing Workshop
Developer Open Space 2012 - Cloud Computing WorkshopSascha Dittmann
 
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)Sascha Dittmann
 
CloudOps Summit 2012 - 3 Wege in die Cloud
CloudOps Summit 2012 - 3 Wege in die CloudCloudOps Summit 2012 - 3 Wege in die Cloud
CloudOps Summit 2012 - 3 Wege in die CloudSascha Dittmann
 
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv....NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...Sascha Dittmann
 
NoSQL mit RavenDB und Azure
NoSQL mit RavenDB und AzureNoSQL mit RavenDB und Azure
NoSQL mit RavenDB und AzureSascha Dittmann
 
Windows Azure für Entwickler V1
Windows Azure für Entwickler V1Windows Azure für Entwickler V1
Windows Azure für Entwickler V1Sascha Dittmann
 

Mais de Sascha Dittmann (17)

C# + SQL = Big Data
C# + SQL = Big DataC# + SQL = Big Data
C# + SQL = Big Data
 
Hochskalierbare, relationale Datenbanken in Microsoft Azure
Hochskalierbare, relationale Datenbanken in Microsoft AzureHochskalierbare, relationale Datenbanken in Microsoft Azure
Hochskalierbare, relationale Datenbanken in Microsoft Azure
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
 
dotnet Cologne 2015 - Azure Service Fabric
dotnet Cologne 2015 - Azure Service Fabric dotnet Cologne 2015 - Azure Service Fabric
dotnet Cologne 2015 - Azure Service Fabric
 
Hadoop 2.0 - The Next Level
Hadoop 2.0 - The Next LevelHadoop 2.0 - The Next Level
Hadoop 2.0 - The Next Level
 
Microsoft HDInsight Podcast #001 - Was ist HDInsight
Microsoft HDInsight Podcast #001 - Was ist HDInsightMicrosoft HDInsight Podcast #001 - Was ist HDInsight
Microsoft HDInsight Podcast #001 - Was ist HDInsight
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
 
dotnet Cologne 2013 - Windows Azure Mobile Services
dotnet Cologne 2013 - Windows Azure Mobile Servicesdotnet Cologne 2013 - Windows Azure Mobile Services
dotnet Cologne 2013 - Windows Azure Mobile Services
 
dotnet Cologne 2013 - Microsoft HD Insight für .NET Entwickler
dotnet Cologne 2013 - Microsoft HD Insight für .NET Entwicklerdotnet Cologne 2013 - Microsoft HD Insight für .NET Entwickler
dotnet Cologne 2013 - Microsoft HD Insight für .NET Entwickler
 
Developer Open Space 2012 - Cloud Computing Workshop
Developer Open Space 2012 - Cloud Computing WorkshopDeveloper Open Space 2012 - Cloud Computing Workshop
Developer Open Space 2012 - Cloud Computing Workshop
 
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
 
CloudOps Summit 2012 - 3 Wege in die Cloud
CloudOps Summit 2012 - 3 Wege in die CloudCloudOps Summit 2012 - 3 Wege in die Cloud
CloudOps Summit 2012 - 3 Wege in die Cloud
 
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv....NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
 
Big Data & NoSQL
Big Data & NoSQLBig Data & NoSQL
Big Data & NoSQL
 
NoSQL mit RavenDB und Azure
NoSQL mit RavenDB und AzureNoSQL mit RavenDB und Azure
NoSQL mit RavenDB und Azure
 
Windows Azure für Entwickler V1
Windows Azure für Entwickler V1Windows Azure für Entwickler V1
Windows Azure für Entwickler V1
 

Último

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Último (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

SQL Saturday #313 Rheinland - MapReduce in der Praxis

  • 1. MapReduce in der Praxis MapReduce Entwurfsmuster Sascha Dittmann Blog: http://www.sascha-dittmann.de Twitter: @SaschaDittmann
  • 3. You Rock! Sponsor SQLSaturday Rheinland 201428.06.2014
  • 6. Bronze Sponsor and Media Partner SQLSaturday Rheinland 201428.06.2014
  • 7. WAS IST BIG DATA? Wo liegt das Problem? 28.06.2014 SQLSaturday Rheinland 2014
  • 8. Wo liegt das Problem? 28.06.2014 SQLSaturday Rheinland 2014 "eins, zwei, zwei, drei, drei, drei, vier, vier, vier, vier" .Split (',') .GroupBy (x => x) .ToList () .ForEach (g => Console.WriteLine ( “{0} wurde {1} mal gefunden.", g.Key.Trim(), g.Count() ));
  • 9. Die 3 V’s 28.06.2014 SQLSaturday Rheinland 2014 Variety Velocity • Variety (Vielfalt) • Relational, XML, Video, Text, ... • Velocity (Geschwindigkeit) • Batch, Intervall, Echtzeit, ... • Volume (Menge) • KB, MB, GB, TB, EB, PB, ZB, YB, ... Volume
  • 10. Skalierung 28.06.2014 SQLSaturday Rheinland 2014 Vertikale Skalierung Horizontale Skalierung
  • 11. WAS IST HADOOP / HDINSIGHT Auf der Suche nach Lösungen 28.06.2014 SQLSaturday Rheinland 2014
  • 12. Apache Hadoop / Microsoft HDInsight 28.06.2014 SQLSaturday Rheinland 2014 +
  • 13. Apache Hadoop 1.x Ecosystem 28.06.2014 SQLSaturday Rheinland 2014 MapReduce (Job Scheduling/Execution System) HDFS (Hadoop Distributed File System) HBase (Column DB) Pig (Data Flow) Hive (Warehouse and Data Access) Oozie (Workflow) Sqoop Traditional BI Tools HBase / Cassandra (Columnar NoSQL Databases) Avro(Serialization) Zookeeper(Coordination) Apache Mahout Cascading (programming model) Hadoop = MapReduce + HDFS Flume
  • 14. Microsoft HDInsight 28.06.2014 SQLSaturday Rheinland 2014 MapReduce (Job Scheduling/Execution System) HDFS (Hadoop Distributed File System) HBase (Column DB) Pig (Data Flow) Hive (Warehous e and Data Access) Oozie (Workflow) Sqoop Traditional BI Tools HBase / Cassandra (Columnar NoSQL Databases) Avro(Serialization) Zookeeper(Coordination) Apache Mahout Cascading (programming model) Hadoop = MapReduce + HDFS Flume Windows SystemCenter ActiveDirectory Visual Studio
  • 15. THE HADOOP CORE HDFS + MapReduce 28.06.2014 SQLSaturday Rheinland 2014
  • 16. Hadoop Distributed File System (HDFS) 28.06.2014 SQLSaturday Rheinland 2014  Portable Operating System Interface (POSIX)  Replikation auf mehrere Datenknoten js> #ls /user/Sascha/input/ncdc Found 9 items drwxr-xr-x - Sascha supergroup 0 2013-04-24 13:09 /user/Sascha/input/ncdc/all drwxr-xr-x - Sascha supergroup 0 2013-04-24 13:01 /user/Sascha/input/ncdc/all2 drwxr-xr-x - Sascha supergroup 0 2013-04-23 13:06 /user/Sascha/input/ncdc/metadata drwxr-xr-x - Sascha supergroup 0 2013-04-23 13:06 /user/Sascha/input/ncdc/micro drwxr-xr-x - Sascha supergroup 0 2013-04-23 13:06 /user/Sascha/input/ncdc/micro-tab -rw-r--r-- 3 Sascha supergroup 529 2013-04-23 13:06 /user/Sascha/input/ncdc/sample.txt -rw-r--r-- 3 Sascha supergroup 168 2013-04-23 13:06 /user/Sascha/input/ncdc/sample.txt.gz
  • 17. Map/Reduce am Beispiel von Messdaten 28.06.2014 SQLSaturday Rheinland 2014 0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999 0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999 0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999 0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999 0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999 Jahr Lufttemperatur
  • 18. Map/Reduce am Beispiel von Messdaten 28.06.2014 SQLSaturday Rheinland 2014 0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999 0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999 0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999 0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999 0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999 Messqualität
  • 19. Map/Reduce 28.06.2014 SQLSaturday Rheinland 2014 Map Sort Shuffle DataNode Map Sort Shuffle DataNode Map Sort Shuffle DataNode Reduce 0067011990999991950051507004+68750 0043011990999991950051512004+68750 0043011990999991950051518004+68750 0043012650999991949032412004+62300 0043012650999991949032418004+62300 1949,0 1950,22 1950,55 1952,-11 1950,33 1949,0 1950,[22,33,55] 1952,-11 1949,0 1950,55 1952,-11
  • 20. Map/Reduce mit Combiner-Funktion 28.06.2014 SQLSaturday Rheinland 2014 Map Combine Sort Shuffle DataNode Map Combine Sort Shuffle DataNode Map Combine Sort Shuffle DataNode Reduce 0067011990999991950051507004+68750 0043011990999991950051512004+68750 0043011990999991950051518004+68750 0043012650999991949032412004+62300 0043012650999991949032418004+62300 1949,0 1950,22 1950,55 1952,-11 1950,33 1949,0 1950,55 1952,-11 1950,33 1949,0 1950,[33,55] 1952,-11 1949,0 1950,55 1952,-11
  • 21. Map/Reduce mit Combiner-Funktion 28.06.2014 SQLSaturday Rheinland 2014 Map Combine Sort Shuffle DataNode Map Combine Sort Shuffle DataNode Map Combine Sort Shuffle DataNode Reduce 0067011990999991950051507004+68750 0043011990999991950051512004+68750 0043011990999991950051518004+68750 0043012650999991949032412004+62300 0043012650999991949032418004+62300 1949,0 1950,22 1950,55 1952,-11 1950,33 1949,0 1950,55 1952,-11 1950,33 1949,0 1950,[33,55] 1952,-11 1949,0 1950,55 1952,-11 Partitioner (Key % Anzahl der Reducer)
  • 23. Numerische Aggregation (T-SQL) SELECT MIN(CreationDate) , MAX(CreationDate) , COUNT(UserId) FROM dbo.Comments GROUP BY UserId 28.06.2014 SQLSaturday Rheinland 2014
  • 24. Min. / Max. / Count (MapReduce) 28.06.2014 SQLSaturday Rheinland 2014
  • 25. Min. / Max. / Count (Datenfluß) 28.06.2014 SQLSaturday Rheinland 2014 UserId Min Max Count 12345 14.06.2001 14.06.2001 1 12345 11.01.1995 11.01.1995 1 12345 23.05.2009 23.05.2009 1 56789 23.06.1996 23.06.1996 1 56789 05.07.1997 05.07.1997 1 52737 06.12.2000 06.12.2000 1 UserId Min Max Count 12345 11.01.1995 23.05.2009 3 52737 06.12.2000 06.12.2000 1 56789 23.06.1996 05.07.1997 2 Mapper Combiner / Reducer
  • 26. Standardabweichung / Median 28.06.2014 SQLSaturday Rheinland 2014
  • 27. Standardabweichung / Median (Datenfluß) 28.06.2014 SQLSaturday Rheinland 2014 UserId Length 4 18 4 14 4 18 3 4 3 4 9 19 Hour Length / Count Pairs 3 4:2, 4 18:2, 14:1 9 19:1 Mapper Combiner Reducer
  • 28. Schnelles Zählen mit Countern 28.06.2014 SQLSaturday Rheinland 2014
  • 30. Filtern (T-SQL) SELECT * FROM dbo.Table WHERE Value < 42 28.06.2014 SQLSaturday Rheinland 2014
  • 31. TOP / DISTINCT 28.06.2014 SQLSaturday Rheinland 2014
  • 33. Struktur ändern / Partitionen / Sortierung 28.06.2014 SQLSaturday Rheinland 2014
  • 35. Filtern (T-SQL) SELECT * FROM dbo.Table1 T1 JOIN dbo.Table2 T2 ON T1.ID = T2.ID 28.06.2014 SQLSaturday Rheinland 2014
  • 37. Code Beispiele  C# https://github.com/SaschaDittmann/MapReduceSamples  Java https://github.com/SaschaDittmann/MapReduceSamplesJava  Blog Posts http://www.sascha-dittmann.de/?tag=/MapReduce 28.06.2014 SQLSaturday Rheinland 2014
  • 38. Thank you! for sponsorship for volunteering for participation for a great SQLSaturday #313 SQLSaturday Rheinland 201428.06.2014