SlideShare uma empresa Scribd logo
1 de 38
MapReduce in der Praxis
MapReduce Entwurfsmuster
Sascha Dittmann
Blog: http://www.sascha-dittmann.de
Twitter: @SaschaDittmann
Organizer
SQLSaturday Rheinland 201428.06.2014
You Rock! Sponsor
SQLSaturday Rheinland 201428.06.2014
Gold Sponsor
SQLSaturday Rheinland 201428.06.2014
Silver Sponsor
SQLSaturday Rheinland 201428.06.2014
Bronze Sponsor and Media Partner
SQLSaturday Rheinland 201428.06.2014
WAS IST BIG DATA?
Wo liegt das Problem?
28.06.2014 SQLSaturday Rheinland 2014
Wo liegt das Problem?
28.06.2014 SQLSaturday Rheinland 2014
"eins, zwei, zwei, drei, drei, drei, vier, vier, vier, vier"
.Split (',')
.GroupBy (x => x)
.ToList ()
.ForEach (g =>
Console.WriteLine (
“{0} wurde {1} mal gefunden.",
g.Key.Trim(),
g.Count()
));
Die 3 V’s
28.06.2014 SQLSaturday Rheinland 2014
Variety
Velocity
• Variety (Vielfalt)
• Relational, XML, Video, Text, ...
• Velocity (Geschwindigkeit)
• Batch, Intervall, Echtzeit, ...
• Volume (Menge)
• KB, MB, GB, TB, EB, PB, ZB,
YB, ...
Volume
Skalierung
28.06.2014 SQLSaturday Rheinland 2014
Vertikale Skalierung Horizontale Skalierung
WAS IST HADOOP / HDINSIGHT
Auf der Suche nach Lösungen
28.06.2014 SQLSaturday Rheinland 2014
Apache Hadoop / Microsoft HDInsight
28.06.2014 SQLSaturday Rheinland 2014
+
Apache Hadoop 1.x Ecosystem
28.06.2014 SQLSaturday Rheinland 2014
MapReduce (Job Scheduling/Execution System)
HDFS
(Hadoop Distributed File System)
HBase (Column DB)
Pig (Data
Flow)
Hive
(Warehouse
and Data
Access)
Oozie
(Workflow)
Sqoop
Traditional BI Tools
HBase / Cassandra
(Columnar NoSQL Databases)
Avro(Serialization)
Zookeeper(Coordination)
Apache
Mahout
Cascading
(programming
model)
Hadoop = MapReduce + HDFS
Flume
Microsoft HDInsight
28.06.2014 SQLSaturday Rheinland 2014
MapReduce (Job Scheduling/Execution System)
HDFS
(Hadoop Distributed File System)
HBase (Column DB)
Pig
(Data
Flow)
Hive
(Warehous
e and Data
Access)
Oozie
(Workflow)
Sqoop
Traditional BI Tools
HBase / Cassandra
(Columnar NoSQL Databases)
Avro(Serialization)
Zookeeper(Coordination)
Apache
Mahout
Cascading
(programming
model)
Hadoop = MapReduce + HDFS
Flume
Windows
SystemCenter
ActiveDirectory
Visual Studio
THE HADOOP CORE
HDFS + MapReduce
28.06.2014 SQLSaturday Rheinland 2014
Hadoop Distributed File System (HDFS)
28.06.2014 SQLSaturday Rheinland 2014
 Portable Operating System Interface (POSIX)
 Replikation auf mehrere Datenknoten
js> #ls /user/Sascha/input/ncdc
Found 9 items
drwxr-xr-x - Sascha supergroup 0 2013-04-24 13:09 /user/Sascha/input/ncdc/all
drwxr-xr-x - Sascha supergroup 0 2013-04-24 13:01 /user/Sascha/input/ncdc/all2
drwxr-xr-x - Sascha supergroup 0 2013-04-23 13:06 /user/Sascha/input/ncdc/metadata
drwxr-xr-x - Sascha supergroup 0 2013-04-23 13:06 /user/Sascha/input/ncdc/micro
drwxr-xr-x - Sascha supergroup 0 2013-04-23 13:06 /user/Sascha/input/ncdc/micro-tab
-rw-r--r-- 3 Sascha supergroup 529 2013-04-23 13:06 /user/Sascha/input/ncdc/sample.txt
-rw-r--r-- 3 Sascha supergroup 168 2013-04-23 13:06 /user/Sascha/input/ncdc/sample.txt.gz
Map/Reduce am Beispiel von Messdaten
28.06.2014 SQLSaturday Rheinland 2014
0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999
Jahr Lufttemperatur
Map/Reduce am Beispiel von Messdaten
28.06.2014 SQLSaturday Rheinland 2014
0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999
Messqualität
Map/Reduce
28.06.2014 SQLSaturday Rheinland 2014
Map
Sort
Shuffle
DataNode
Map
Sort
Shuffle
DataNode
Map
Sort
Shuffle
DataNode
Reduce
0067011990999991950051507004+68750
0043011990999991950051512004+68750
0043011990999991950051518004+68750
0043012650999991949032412004+62300
0043012650999991949032418004+62300
1949,0
1950,22
1950,55
1952,-11
1950,33
1949,0
1950,[22,33,55]
1952,-11
1949,0
1950,55
1952,-11
Map/Reduce mit Combiner-Funktion
28.06.2014 SQLSaturday Rheinland 2014
Map
Combine
Sort
Shuffle
DataNode
Map
Combine
Sort
Shuffle
DataNode
Map
Combine
Sort
Shuffle
DataNode
Reduce
0067011990999991950051507004+68750
0043011990999991950051512004+68750
0043011990999991950051518004+68750
0043012650999991949032412004+62300
0043012650999991949032418004+62300
1949,0
1950,22
1950,55
1952,-11
1950,33
1949,0
1950,55
1952,-11
1950,33
1949,0
1950,[33,55]
1952,-11
1949,0
1950,55
1952,-11
Map/Reduce mit Combiner-Funktion
28.06.2014 SQLSaturday Rheinland 2014
Map
Combine
Sort
Shuffle
DataNode
Map
Combine
Sort
Shuffle
DataNode
Map
Combine
Sort
Shuffle
DataNode
Reduce
0067011990999991950051507004+68750
0043011990999991950051512004+68750
0043011990999991950051518004+68750
0043012650999991949032412004+62300
0043012650999991949032418004+62300
1949,0
1950,22
1950,55
1952,-11
1950,33
1949,0
1950,55
1952,-11
1950,33
1949,0
1950,[33,55]
1952,-11
1949,0
1950,55
1952,-11
Partitioner (Key % Anzahl der Reducer)
NUMERISCHE AGGREGATION
MapReduce Entwurfsmuster
28.06.2014 SQLSaturday Rheinland 2014
Numerische Aggregation (T-SQL)
SELECT MIN(CreationDate)
, MAX(CreationDate)
, COUNT(UserId)
FROM dbo.Comments
GROUP BY UserId
28.06.2014 SQLSaturday Rheinland 2014
Min. / Max. / Count (MapReduce)
28.06.2014 SQLSaturday Rheinland 2014
Min. / Max. / Count (Datenfluß)
28.06.2014 SQLSaturday Rheinland 2014
UserId Min Max Count
12345 14.06.2001 14.06.2001 1
12345 11.01.1995 11.01.1995 1
12345 23.05.2009 23.05.2009 1
56789 23.06.1996 23.06.1996 1
56789 05.07.1997 05.07.1997 1
52737 06.12.2000 06.12.2000 1
UserId Min Max Count
12345 11.01.1995 23.05.2009 3
52737 06.12.2000 06.12.2000 1
56789 23.06.1996 05.07.1997 2
Mapper
Combiner / Reducer
Standardabweichung / Median
28.06.2014 SQLSaturday Rheinland 2014
Standardabweichung / Median (Datenfluß)
28.06.2014 SQLSaturday Rheinland 2014
UserId Length
4 18
4 14
4 18
3 4
3 4
9 19
Hour Length / Count Pairs
3 4:2,
4 18:2, 14:1
9 19:1
Mapper
Combiner
Reducer
Schnelles Zählen mit Countern
28.06.2014 SQLSaturday Rheinland 2014
FILTERN
MapReduce Entwurfsmuster
28.06.2014 SQLSaturday Rheinland 2014
Filtern (T-SQL)
SELECT *
FROM dbo.Table
WHERE Value < 42
28.06.2014 SQLSaturday Rheinland 2014
TOP / DISTINCT
28.06.2014 SQLSaturday Rheinland 2014
DATEN ORGANISIEREN
MapReduce Entwurfsmuster
28.06.2014 SQLSaturday Rheinland 2014
Struktur ändern / Partitionen / Sortierung
28.06.2014 SQLSaturday Rheinland 2014
JOINS
MapReduce Entwurfsmuster
28.06.2014 SQLSaturday Rheinland 2014
Filtern (T-SQL)
SELECT *
FROM dbo.Table1 T1
JOIN dbo.Table2 T2
ON T1.ID = T2.ID
28.06.2014 SQLSaturday Rheinland 2014
JOINS
28.06.2014 SQLSaturday Rheinland 2014
Code Beispiele
 C#
https://github.com/SaschaDittmann/MapReduceSamples
 Java
https://github.com/SaschaDittmann/MapReduceSamplesJava
 Blog Posts
http://www.sascha-dittmann.de/?tag=/MapReduce
28.06.2014 SQLSaturday Rheinland 2014
Thank you!
for sponsorship
for volunteering
for participation
for a great
SQLSaturday #313
SQLSaturday Rheinland 201428.06.2014

Mais conteúdo relacionado

Destaque

Destaque (20)

Datawarehouse como servicio en azure (sqldw)
Datawarehouse como servicio en azure (sqldw)Datawarehouse como servicio en azure (sqldw)
Datawarehouse como servicio en azure (sqldw)
 
Microsoft Azure Data Warehouse Overview
Microsoft Azure Data Warehouse OverviewMicrosoft Azure Data Warehouse Overview
Microsoft Azure Data Warehouse Overview
 
SQL Azure Data Warehouse - Silviu Niculita
SQL Azure Data Warehouse - Silviu NiculitaSQL Azure Data Warehouse - Silviu Niculita
SQL Azure Data Warehouse - Silviu Niculita
 
NoSQL, which way to go?
NoSQL, which way to go?NoSQL, which way to go?
NoSQL, which way to go?
 
Azure doc db (slideshare)
Azure doc db (slideshare)Azure doc db (slideshare)
Azure doc db (slideshare)
 
NoSQL with Microsoft Azure
NoSQL with Microsoft AzureNoSQL with Microsoft Azure
NoSQL with Microsoft Azure
 
Azure DocumentDB for Healthcare Integration - Part 2
Azure DocumentDB for Healthcare Integration - Part 2Azure DocumentDB for Healthcare Integration - Part 2
Azure DocumentDB for Healthcare Integration - Part 2
 
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSONSQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
 
Azure DocumentDB
Azure DocumentDBAzure DocumentDB
Azure DocumentDB
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 
Azure DocumentDb
Azure DocumentDbAzure DocumentDb
Azure DocumentDb
 
Introducing Azure SQL Database
Introducing Azure SQL DatabaseIntroducing Azure SQL Database
Introducing Azure SQL Database
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft Azure
 
Cortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeCortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data Lake
 
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
 
Building IoT and Big Data Solutions on Azure
Building IoT and Big Data Solutions on AzureBuilding IoT and Big Data Solutions on Azure
Building IoT and Big Data Solutions on Azure
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
Big Data Application Architectures - IoT
Big Data Application Architectures - IoTBig Data Application Architectures - IoT
Big Data Application Architectures - IoT
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 

Semelhante a SQL Saturday #313 Rheinland - MapReduce in der Praxis

VirtaThon 2011 - Mining the AWR
VirtaThon 2011 - Mining the AWRVirtaThon 2011 - Mining the AWR
VirtaThon 2011 - Mining the AWR
Kristofferson A
 
Grill at bigdata-cloud conf
Grill at bigdata-cloud confGrill at bigdata-cloud conf
Grill at bigdata-cloud conf
amarsri
 
Big data and data science study
Big data and data science studyBig data and data science study
Big data and data science study
dspadawan
 

Semelhante a SQL Saturday #313 Rheinland - MapReduce in der Praxis (20)

SQLSaturday Rheinland 2014 - Power query vs. ssis
SQLSaturday Rheinland 2014 - Power query vs. ssisSQLSaturday Rheinland 2014 - Power query vs. ssis
SQLSaturday Rheinland 2014 - Power query vs. ssis
 
RRDTool + RUBY DSL = RRD-FFI
RRDTool + RUBY DSL = RRD-FFIRRDTool + RUBY DSL = RRD-FFI
RRDTool + RUBY DSL = RRD-FFI
 
Visitation time scheduling
Visitation time schedulingVisitation time scheduling
Visitation time scheduling
 
VirtaThon 2011 - Mining the AWR
VirtaThon 2011 - Mining the AWRVirtaThon 2011 - Mining the AWR
VirtaThon 2011 - Mining the AWR
 
Grill at bigdata-cloud conf
Grill at bigdata-cloud confGrill at bigdata-cloud conf
Grill at bigdata-cloud conf
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014
 
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
 
Hurtownia danych jako serwis w chmurze, na przykładzie azure sql data warehou...
Hurtownia danych jako serwis w chmurze, na przykładzie azure sql data warehou...Hurtownia danych jako serwis w chmurze, na przykładzie azure sql data warehou...
Hurtownia danych jako serwis w chmurze, na przykładzie azure sql data warehou...
 
Megadata With Python and Hadoop
Megadata With Python and HadoopMegadata With Python and Hadoop
Megadata With Python and Hadoop
 
Azure Machine Learning using R
Azure Machine Learning using RAzure Machine Learning using R
Azure Machine Learning using R
 
Synchronise your data between MySQL and MongoDB
Synchronise your data between MySQL and MongoDBSynchronise your data between MySQL and MongoDB
Synchronise your data between MySQL and MongoDB
 
Big data and data science study
Big data and data science studyBig data and data science study
Big data and data science study
 
useR!2010 matome
useR!2010 matomeuseR!2010 matome
useR!2010 matome
 
A short introduction to Vertica
A short introduction to VerticaA short introduction to Vertica
A short introduction to Vertica
 
Design & Performance - Steve Souders at Fastly Altitude 2015
Design & Performance - Steve Souders at Fastly Altitude 2015Design & Performance - Steve Souders at Fastly Altitude 2015
Design & Performance - Steve Souders at Fastly Altitude 2015
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Data cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flowsData cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flows
 
Data cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsData cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flows
 
Migrating to Oracle Database 12c: 300 DBs in 300 days.
Migrating to Oracle Database 12c: 300 DBs in 300 days.Migrating to Oracle Database 12c: 300 DBs in 300 days.
Migrating to Oracle Database 12c: 300 DBs in 300 days.
 

Mais de Sascha Dittmann

dotnet Cologne 2013 - Windows Azure Mobile Services
dotnet Cologne 2013 - Windows Azure Mobile Servicesdotnet Cologne 2013 - Windows Azure Mobile Services
dotnet Cologne 2013 - Windows Azure Mobile Services
Sascha Dittmann
 
CloudOps Summit 2012 - 3 Wege in die Cloud
CloudOps Summit 2012 - 3 Wege in die CloudCloudOps Summit 2012 - 3 Wege in die Cloud
CloudOps Summit 2012 - 3 Wege in die Cloud
Sascha Dittmann
 
NoSQL mit RavenDB und Azure
NoSQL mit RavenDB und AzureNoSQL mit RavenDB und Azure
NoSQL mit RavenDB und Azure
Sascha Dittmann
 
Windows Azure für Entwickler V1
Windows Azure für Entwickler V1Windows Azure für Entwickler V1
Windows Azure für Entwickler V1
Sascha Dittmann
 

Mais de Sascha Dittmann (17)

C# + SQL = Big Data
C# + SQL = Big DataC# + SQL = Big Data
C# + SQL = Big Data
 
Hochskalierbare, relationale Datenbanken in Microsoft Azure
Hochskalierbare, relationale Datenbanken in Microsoft AzureHochskalierbare, relationale Datenbanken in Microsoft Azure
Hochskalierbare, relationale Datenbanken in Microsoft Azure
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
 
dotnet Cologne 2015 - Azure Service Fabric
dotnet Cologne 2015 - Azure Service Fabric dotnet Cologne 2015 - Azure Service Fabric
dotnet Cologne 2015 - Azure Service Fabric
 
Hadoop 2.0 - The Next Level
Hadoop 2.0 - The Next LevelHadoop 2.0 - The Next Level
Hadoop 2.0 - The Next Level
 
Microsoft HDInsight Podcast #001 - Was ist HDInsight
Microsoft HDInsight Podcast #001 - Was ist HDInsightMicrosoft HDInsight Podcast #001 - Was ist HDInsight
Microsoft HDInsight Podcast #001 - Was ist HDInsight
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
 
dotnet Cologne 2013 - Windows Azure Mobile Services
dotnet Cologne 2013 - Windows Azure Mobile Servicesdotnet Cologne 2013 - Windows Azure Mobile Services
dotnet Cologne 2013 - Windows Azure Mobile Services
 
dotnet Cologne 2013 - Microsoft HD Insight für .NET Entwickler
dotnet Cologne 2013 - Microsoft HD Insight für .NET Entwicklerdotnet Cologne 2013 - Microsoft HD Insight für .NET Entwickler
dotnet Cologne 2013 - Microsoft HD Insight für .NET Entwickler
 
Developer Open Space 2012 - Cloud Computing Workshop
Developer Open Space 2012 - Cloud Computing WorkshopDeveloper Open Space 2012 - Cloud Computing Workshop
Developer Open Space 2012 - Cloud Computing Workshop
 
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
 
CloudOps Summit 2012 - 3 Wege in die Cloud
CloudOps Summit 2012 - 3 Wege in die CloudCloudOps Summit 2012 - 3 Wege in die Cloud
CloudOps Summit 2012 - 3 Wege in die Cloud
 
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv....NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
 
Big Data & NoSQL
Big Data & NoSQLBig Data & NoSQL
Big Data & NoSQL
 
NoSQL mit RavenDB und Azure
NoSQL mit RavenDB und AzureNoSQL mit RavenDB und Azure
NoSQL mit RavenDB und Azure
 
Windows Azure für Entwickler V1
Windows Azure für Entwickler V1Windows Azure für Entwickler V1
Windows Azure für Entwickler V1
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

SQL Saturday #313 Rheinland - MapReduce in der Praxis

  • 1. MapReduce in der Praxis MapReduce Entwurfsmuster Sascha Dittmann Blog: http://www.sascha-dittmann.de Twitter: @SaschaDittmann
  • 3. You Rock! Sponsor SQLSaturday Rheinland 201428.06.2014
  • 6. Bronze Sponsor and Media Partner SQLSaturday Rheinland 201428.06.2014
  • 7. WAS IST BIG DATA? Wo liegt das Problem? 28.06.2014 SQLSaturday Rheinland 2014
  • 8. Wo liegt das Problem? 28.06.2014 SQLSaturday Rheinland 2014 "eins, zwei, zwei, drei, drei, drei, vier, vier, vier, vier" .Split (',') .GroupBy (x => x) .ToList () .ForEach (g => Console.WriteLine ( “{0} wurde {1} mal gefunden.", g.Key.Trim(), g.Count() ));
  • 9. Die 3 V’s 28.06.2014 SQLSaturday Rheinland 2014 Variety Velocity • Variety (Vielfalt) • Relational, XML, Video, Text, ... • Velocity (Geschwindigkeit) • Batch, Intervall, Echtzeit, ... • Volume (Menge) • KB, MB, GB, TB, EB, PB, ZB, YB, ... Volume
  • 10. Skalierung 28.06.2014 SQLSaturday Rheinland 2014 Vertikale Skalierung Horizontale Skalierung
  • 11. WAS IST HADOOP / HDINSIGHT Auf der Suche nach Lösungen 28.06.2014 SQLSaturday Rheinland 2014
  • 12. Apache Hadoop / Microsoft HDInsight 28.06.2014 SQLSaturday Rheinland 2014 +
  • 13. Apache Hadoop 1.x Ecosystem 28.06.2014 SQLSaturday Rheinland 2014 MapReduce (Job Scheduling/Execution System) HDFS (Hadoop Distributed File System) HBase (Column DB) Pig (Data Flow) Hive (Warehouse and Data Access) Oozie (Workflow) Sqoop Traditional BI Tools HBase / Cassandra (Columnar NoSQL Databases) Avro(Serialization) Zookeeper(Coordination) Apache Mahout Cascading (programming model) Hadoop = MapReduce + HDFS Flume
  • 14. Microsoft HDInsight 28.06.2014 SQLSaturday Rheinland 2014 MapReduce (Job Scheduling/Execution System) HDFS (Hadoop Distributed File System) HBase (Column DB) Pig (Data Flow) Hive (Warehous e and Data Access) Oozie (Workflow) Sqoop Traditional BI Tools HBase / Cassandra (Columnar NoSQL Databases) Avro(Serialization) Zookeeper(Coordination) Apache Mahout Cascading (programming model) Hadoop = MapReduce + HDFS Flume Windows SystemCenter ActiveDirectory Visual Studio
  • 15. THE HADOOP CORE HDFS + MapReduce 28.06.2014 SQLSaturday Rheinland 2014
  • 16. Hadoop Distributed File System (HDFS) 28.06.2014 SQLSaturday Rheinland 2014  Portable Operating System Interface (POSIX)  Replikation auf mehrere Datenknoten js> #ls /user/Sascha/input/ncdc Found 9 items drwxr-xr-x - Sascha supergroup 0 2013-04-24 13:09 /user/Sascha/input/ncdc/all drwxr-xr-x - Sascha supergroup 0 2013-04-24 13:01 /user/Sascha/input/ncdc/all2 drwxr-xr-x - Sascha supergroup 0 2013-04-23 13:06 /user/Sascha/input/ncdc/metadata drwxr-xr-x - Sascha supergroup 0 2013-04-23 13:06 /user/Sascha/input/ncdc/micro drwxr-xr-x - Sascha supergroup 0 2013-04-23 13:06 /user/Sascha/input/ncdc/micro-tab -rw-r--r-- 3 Sascha supergroup 529 2013-04-23 13:06 /user/Sascha/input/ncdc/sample.txt -rw-r--r-- 3 Sascha supergroup 168 2013-04-23 13:06 /user/Sascha/input/ncdc/sample.txt.gz
  • 17. Map/Reduce am Beispiel von Messdaten 28.06.2014 SQLSaturday Rheinland 2014 0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999 0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999 0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999 0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999 0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999 Jahr Lufttemperatur
  • 18. Map/Reduce am Beispiel von Messdaten 28.06.2014 SQLSaturday Rheinland 2014 0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999 0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999 0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999 0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999 0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999 Messqualität
  • 19. Map/Reduce 28.06.2014 SQLSaturday Rheinland 2014 Map Sort Shuffle DataNode Map Sort Shuffle DataNode Map Sort Shuffle DataNode Reduce 0067011990999991950051507004+68750 0043011990999991950051512004+68750 0043011990999991950051518004+68750 0043012650999991949032412004+62300 0043012650999991949032418004+62300 1949,0 1950,22 1950,55 1952,-11 1950,33 1949,0 1950,[22,33,55] 1952,-11 1949,0 1950,55 1952,-11
  • 20. Map/Reduce mit Combiner-Funktion 28.06.2014 SQLSaturday Rheinland 2014 Map Combine Sort Shuffle DataNode Map Combine Sort Shuffle DataNode Map Combine Sort Shuffle DataNode Reduce 0067011990999991950051507004+68750 0043011990999991950051512004+68750 0043011990999991950051518004+68750 0043012650999991949032412004+62300 0043012650999991949032418004+62300 1949,0 1950,22 1950,55 1952,-11 1950,33 1949,0 1950,55 1952,-11 1950,33 1949,0 1950,[33,55] 1952,-11 1949,0 1950,55 1952,-11
  • 21. Map/Reduce mit Combiner-Funktion 28.06.2014 SQLSaturday Rheinland 2014 Map Combine Sort Shuffle DataNode Map Combine Sort Shuffle DataNode Map Combine Sort Shuffle DataNode Reduce 0067011990999991950051507004+68750 0043011990999991950051512004+68750 0043011990999991950051518004+68750 0043012650999991949032412004+62300 0043012650999991949032418004+62300 1949,0 1950,22 1950,55 1952,-11 1950,33 1949,0 1950,55 1952,-11 1950,33 1949,0 1950,[33,55] 1952,-11 1949,0 1950,55 1952,-11 Partitioner (Key % Anzahl der Reducer)
  • 23. Numerische Aggregation (T-SQL) SELECT MIN(CreationDate) , MAX(CreationDate) , COUNT(UserId) FROM dbo.Comments GROUP BY UserId 28.06.2014 SQLSaturday Rheinland 2014
  • 24. Min. / Max. / Count (MapReduce) 28.06.2014 SQLSaturday Rheinland 2014
  • 25. Min. / Max. / Count (Datenfluß) 28.06.2014 SQLSaturday Rheinland 2014 UserId Min Max Count 12345 14.06.2001 14.06.2001 1 12345 11.01.1995 11.01.1995 1 12345 23.05.2009 23.05.2009 1 56789 23.06.1996 23.06.1996 1 56789 05.07.1997 05.07.1997 1 52737 06.12.2000 06.12.2000 1 UserId Min Max Count 12345 11.01.1995 23.05.2009 3 52737 06.12.2000 06.12.2000 1 56789 23.06.1996 05.07.1997 2 Mapper Combiner / Reducer
  • 26. Standardabweichung / Median 28.06.2014 SQLSaturday Rheinland 2014
  • 27. Standardabweichung / Median (Datenfluß) 28.06.2014 SQLSaturday Rheinland 2014 UserId Length 4 18 4 14 4 18 3 4 3 4 9 19 Hour Length / Count Pairs 3 4:2, 4 18:2, 14:1 9 19:1 Mapper Combiner Reducer
  • 28. Schnelles Zählen mit Countern 28.06.2014 SQLSaturday Rheinland 2014
  • 30. Filtern (T-SQL) SELECT * FROM dbo.Table WHERE Value < 42 28.06.2014 SQLSaturday Rheinland 2014
  • 31. TOP / DISTINCT 28.06.2014 SQLSaturday Rheinland 2014
  • 33. Struktur ändern / Partitionen / Sortierung 28.06.2014 SQLSaturday Rheinland 2014
  • 35. Filtern (T-SQL) SELECT * FROM dbo.Table1 T1 JOIN dbo.Table2 T2 ON T1.ID = T2.ID 28.06.2014 SQLSaturday Rheinland 2014
  • 37. Code Beispiele  C# https://github.com/SaschaDittmann/MapReduceSamples  Java https://github.com/SaschaDittmann/MapReduceSamplesJava  Blog Posts http://www.sascha-dittmann.de/?tag=/MapReduce 28.06.2014 SQLSaturday Rheinland 2014
  • 38. Thank you! for sponsorship for volunteering for participation for a great SQLSaturday #313 SQLSaturday Rheinland 201428.06.2014