SlideShare uma empresa Scribd logo
1 de 43
Big Data Infrastructure workshop 
A hands-on introduction 
Saturday, December 6, 2014
Agenda 
08:30 AM Breakfast 
09:00 AM Introduction and Strengths of Technologies 
10:00 AM Start an EMR Cluster 
10:15 AM break + set up query tool 
10:30 AM Hadoop hands-on 
10:55 AM break 
11:10 AM Redshift hands-on 
11:40 AM Operationalizing your code 
12:00 PM adjourn 
12/6/2014 2
Background on your presenters
DataKitchen Leadership 
Chris Bergh 
(Executive Chef) 
4 
Gil Benghiat 
(VP Product) 
Eric Estabrooks 
(VP Cloud and 
Data Services) 
Software development origins and executive experience 
delivering enterprise software focused on Marketing and 
Health Care sectors. 
Deep Analytic Experience: Spent past decade solving the 
analytic data preparation problem 
New Approach To Data Preparation and Production: 
focused on the Analysts
Analysts And Their Teams Are Spending 
60-80% Of Their Time 
On Data Preparation And Production 
5
This creates an expectation gap 
6 
Analyze 
Prepare Data 
C 
Analyze 
Prepare Data 
Business Customer 
Expectation 
Analyst 
Reality 
Communicate 
The business does not 
think that Analysts are 
preparing data 
(Analysts don’t want to 
prepare data)
What Analyst Really Want: 
An Integrated Data Set Ready For Analysis 
With: Autonomy & Agility 
Without: All the Work & Anxiety
8 
DataKitchen 
solves this 
problem. 
We are on a mission 
to prepare data to 
make analysts 
successful.
Agenda 
08:30 AM Breakfast 
09:00 AM Introduction and Strengths of Technologies 
10:00 AM Start an EMR Cluster 
10:15 AM break + set up query tool 
10:30 AM Hadoop hands-on 
10:55 AM break 
11:10 AM Redshift hands-on 
11:40 AM Operationalizing your code 
12:00 PM adjourn 
12/6/2014 9
Experience of Audience 
• Who considers themselves 
• Analyst 
• Data scientist 
• Programmer / Scripter 
• On the Business side 
• Who knows SQL – can write a simple select? 
• Who had an AWS account before today? 
12/6/2014 10
Hadoop & Redshift
What Is Apache Hadoop? 
• Software framework 
• Large scale processing 
• Network of commodity hardware 
• Handles hardware failures 
12/6/2014 12 
http://hadoop.apache.org/
What is Hadoop good for? 
• Problems that are huge (batch), but not 
hard, and can be run in parallel over 
immutable data 
• NOT OLTP 
(e.g. backend to e-commerce site) 
• Providing a Map Reduce framework 
12/6/2014 13
Map Reduce 
http://www.cs.berkeley.edu/~matei/talks/2010/amp_mapreduce.pdf 
12/6/2014 14
12/6/2014 15
You can write map reduce jobs in your favorite language 
Streaming Interface 
• Lets you specify mappers and 
reducer 
• Supports 
• Java 
• Python 
• Ruby 
• Unix Shell 
• R 
• Any executable 
Map Reduce “generators” 
• Results in map reduce jobs 
• PIG 
• Hive 
12/6/2014 16
Applications that lend themselves to map reduce 
• Word Count 
• PDF Generation (NY Times 11,000,000 articles) 
• Analysis of stock market historical data (ROI and standard deviation) 
• Geographical Data (Finding intersections, rendering map files) 
• Log file querying and analysis 
• Statistical machine translation 
• Spam detection 
• Analyzing Tweets 
12/6/2014 17
Would you use an excavator to plant a tomato? 
12/6/2014 18
Another use … 
Some people use a Hadoop cluster for a “data lake” 
• Store all 
your raw 
data 
• Cook it on 
demand 
12/6/2014 19
Impala 
12/6/2014 http20://pixgood.com/hadoop-ecosystem-diagram.html
Pig 
http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009 
• Pig Latin - the scripting language 
• Grunt – Shell for executing Pig Commands 
12/6/2014 21
http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009 
This is what it would be in Java 
12/6/2014 22
Hive 
You write SQL! Well, almost, it is HiveQL 
12/6/2014 23 
SELECT user.* 
FROM user 
WHERE 
user.active = 1; 
JDBC 
SQL 
Workbench 
The first hands on session will focus on this.
In Amazon, the common workflow for batch 
processing starts and ends with s3. 
Hive 
Script 
12/6/2014 24
Impala 
• Uses SQL very similar to HiveQL 
• Runs 10-100x faster 
• Runs in memory so it does not scale up as well 
• Great for developing your code on a small data set 
• Can use interactively with Tableau and other BI tools 
• Some batch jobs run faster on Impala than Hive 
12/6/2014 25
What is EMR? 
• Hadoop offered by Amazon 
• EMR = Elastic Map Reduce 
• Amazon does almost all of the work to create a cluster 
12/6/2014 26 
OR
Three ways to pay for EMR 
• On Demand - highest price, by the hour, no commitment 
• m1.small $0.055 per Hour 
• i2.8xlarge $7.09 per hour 
• (29 different machine options) 
• Reservation - 1 and 3 year terms (No, All, & Partial Upfront) 
• Spot - lowest price, machine can be taken away 
Do I leave my cluster up all the time? 
12/6/2014 27
Adding machines: Time down, Cost up 
Cost in ECU 
12/6/2014 28
What Is Redshift? 
• Columnar database 
• Great for reads 
• Scale by adding machines 
• Two ways to pay 
• On Demand 
• Reservation 
• Good for SQL-based ETL too 
12/6/2014 29 
http://hadoop.apache.org/
Redshift Machine Options (on demand prices) 
12/6/2014 30 
Petabyte scale 
Remember: Amazon charges for s3 storage too
Redshift usage pattern 
• Load data to s3 first 
• Use BI tools to send in SQL 
• Amazon Redshift is based on PostgreSQL 
The second hands on session will focus on this. 
12/6/2014 31 
JDBC 
SQL 
Workbench
Agenda 
08:30 AM Breakfast 
09:00 AM Introduction and Strengths of Technologies 
10:00 AM Start an EMR Cluster 
10:15 AM break + set up query tool 
10:30 AM Hadoop hands-on 
10:55 AM break 
11:10 AM Redshift hands-on 
11:40 AM Operationalizing your code 
12:00 PM adjourn 
12/6/2014 32
Should I use Redshift or EMR? 
Redshift for 
• Structured data 
• Interactive queries 
• Speed 
Hadoop for 
• Data format flexibility 
• Computation flexibility 
• Super Big Data 
• Try both 
• Compare costs 
• If it works in Redshift, start there 
12/6/2014 33
Performance comparison (3. Join Query) 
12/6/2014 34 
https://amplab.cs.berkeley.edu/benchmark/
Recap 
• Started a Hadoop cluster via the AWS Console (Web UI) 
• Loaded Data 
• Wrote some queries 
• Same for Redshift 
Eventually, you will do this for real and have a script that has value. 
Now what? 
12/6/2014 35
To run your data job you need to … 
• Wait for the new data to arrive 
• Move it to s3 
• Start a cluster 
• Load the data 
• Run your SQL scripts 
• Wait for it to finish 
• Shut down your cluster 
12/6/2014 36
And hope … 
• The new data is in the right format 
• Assumptions you made during development are still true 
• Someone did not mess up your code with an "easy change“ 
• The new data transfers run successfully 
• A table you depend on has been updated correctly 
• The new data has not been truncated by the source 
• No data quality issues with the source data 
Wouldn’t it be great to turn your hopes into tests? 
12/6/2014 37
DataKitchen: We produce the data 
SQL, tests and 
the check list 
go into a 
Recipe 
You data 
are 
Ingredients 
12/6/2014 38 
The results 
are 
Servings
DataKitchen brings reality in line with expectations 
39 
Analyze 
Prepare Data 
C 
Analyze 
Prepare Data 
Business Customer 
Expectation 
Analyst 
Reality 
Communicate 
Communicate 
Analyze 
Prepare Data 
With 
DataKitchen
The story of our first Recipe 
12/6/2014 40
The story of our first Recipe 
With DataKitchen, we got 75% of our time back! 
… and we don’t have to remember to shut down our cluster. 
12/6/2014 41
Remember to shut down your clusters
43 
Thank you! 
Send us an email 
to receive our newsletter 
or to give us feedback. 
info@datakitchen.io

Mais conteúdo relacionado

Mais procurados

Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for DummiesRodney Joyce
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureNiels Naglé
 
Talend Big Data Capabilities - 2014
Talend Big Data Capabilities - 2014Talend Big Data Capabilities - 2014
Talend Big Data Capabilities - 2014Rajan Kanitkar
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on AzureTrivadis
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analyticsjoshwills
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTurkish Testing Board
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...DataWorks Summit
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019DataKitchen
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerDataWorks Summit
 
How to get started in Big Data without Big Costs - StampedeCon 2016
How to get started in Big Data without Big Costs - StampedeCon 2016How to get started in Big Data without Big Costs - StampedeCon 2016
How to get started in Big Data without Big Costs - StampedeCon 2016StampedeCon
 
Azure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsAzure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsWaqas Idrees
 
Hadoop Hadoop & Spark meetup - Altiscale
Hadoop Hadoop & Spark meetup - AltiscaleHadoop Hadoop & Spark meetup - Altiscale
Hadoop Hadoop & Spark meetup - AltiscaleMark Kerzner
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine LearningMark Tabladillo
 
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonUsing Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonMapR Technologies
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS CloudIdan Tohami
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 

Mais procurados (20)

Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architecture
 
Talend Big Data Capabilities - 2014
Talend Big Data Capabilities - 2014Talend Big Data Capabilities - 2014
Talend Big Data Capabilities - 2014
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A Primer
 
How to get started in Big Data without Big Costs - StampedeCon 2016
How to get started in Big Data without Big Costs - StampedeCon 2016How to get started in Big Data without Big Costs - StampedeCon 2016
How to get started in Big Data without Big Costs - StampedeCon 2016
 
Azure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsAzure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake Analytics
 
Hadoop Hadoop & Spark meetup - Altiscale
Hadoop Hadoop & Spark meetup - AltiscaleHadoop Hadoop & Spark meetup - Altiscale
Hadoop Hadoop & Spark meetup - Altiscale
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
 
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonUsing Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS Cloud
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 

Semelhante a Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift

Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Dataconomy Media
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Mats Uddenfeldt
 
Redshift Introduction
Redshift IntroductionRedshift Introduction
Redshift IntroductionDataKitchen
 
Impala use case @ edge
Impala use case @ edgeImpala use case @ edge
Impala use case @ edgeRam Kedem
 
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Mark Rittman
 
The Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On TimeThe Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On TimeInside Analysis
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...Amazon Web Services
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataPentaho
 
Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?Inside Analysis
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Vantara
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data AnalyticsAttunity
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiFelicia Haggarty
 

Semelhante a Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift (20)

Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Redshift Introduction
Redshift IntroductionRedshift Introduction
Redshift Introduction
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Impala use case @ edge
Impala use case @ edgeImpala use case @ edge
Impala use case @ edge
 
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
 
The Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On TimeThe Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On Time
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
 
Unlock the value of your big data infrastructure
Unlock the value of your big data infrastructureUnlock the value of your big data infrastructure
Unlock the value of your big data infrastructure
 
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big Data
 
Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop Solution
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
 

Mais de DataKitchen

Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You! DataKitchen
 
seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019DataKitchen
 
ODSC May 2019 - The DataOps Manifesto
ODSC May 2019 - The DataOps ManifestoODSC May 2019 - The DataOps Manifesto
ODSC May 2019 - The DataOps ManifestoDataKitchen
 
Fri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataopsFri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataopsDataKitchen
 
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...DataKitchen
 
Open Data Science Conference Agile Data
Open Data Science Conference Agile DataOpen Data Science Conference Agile Data
Open Data Science Conference Agile DataDataKitchen
 
Do Agile Data in Just 5 Shocking Steps!
Do Agile Data in Just 5 Shocking Steps!Do Agile Data in Just 5 Shocking Steps!
Do Agile Data in Just 5 Shocking Steps!DataKitchen
 

Mais de DataKitchen (7)

Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!
 
seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019
 
ODSC May 2019 - The DataOps Manifesto
ODSC May 2019 - The DataOps ManifestoODSC May 2019 - The DataOps Manifesto
ODSC May 2019 - The DataOps Manifesto
 
Fri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataopsFri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataops
 
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
 
Open Data Science Conference Agile Data
Open Data Science Conference Agile DataOpen Data Science Conference Agile Data
Open Data Science Conference Agile Data
 
Do Agile Data in Just 5 Shocking Steps!
Do Agile Data in Just 5 Shocking Steps!Do Agile Data in Just 5 Shocking Steps!
Do Agile Data in Just 5 Shocking Steps!
 

Último

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 

Último (20)

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 

Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift

  • 1. Big Data Infrastructure workshop A hands-on introduction Saturday, December 6, 2014
  • 2. Agenda 08:30 AM Breakfast 09:00 AM Introduction and Strengths of Technologies 10:00 AM Start an EMR Cluster 10:15 AM break + set up query tool 10:30 AM Hadoop hands-on 10:55 AM break 11:10 AM Redshift hands-on 11:40 AM Operationalizing your code 12:00 PM adjourn 12/6/2014 2
  • 3. Background on your presenters
  • 4. DataKitchen Leadership Chris Bergh (Executive Chef) 4 Gil Benghiat (VP Product) Eric Estabrooks (VP Cloud and Data Services) Software development origins and executive experience delivering enterprise software focused on Marketing and Health Care sectors. Deep Analytic Experience: Spent past decade solving the analytic data preparation problem New Approach To Data Preparation and Production: focused on the Analysts
  • 5. Analysts And Their Teams Are Spending 60-80% Of Their Time On Data Preparation And Production 5
  • 6. This creates an expectation gap 6 Analyze Prepare Data C Analyze Prepare Data Business Customer Expectation Analyst Reality Communicate The business does not think that Analysts are preparing data (Analysts don’t want to prepare data)
  • 7. What Analyst Really Want: An Integrated Data Set Ready For Analysis With: Autonomy & Agility Without: All the Work & Anxiety
  • 8. 8 DataKitchen solves this problem. We are on a mission to prepare data to make analysts successful.
  • 9. Agenda 08:30 AM Breakfast 09:00 AM Introduction and Strengths of Technologies 10:00 AM Start an EMR Cluster 10:15 AM break + set up query tool 10:30 AM Hadoop hands-on 10:55 AM break 11:10 AM Redshift hands-on 11:40 AM Operationalizing your code 12:00 PM adjourn 12/6/2014 9
  • 10. Experience of Audience • Who considers themselves • Analyst • Data scientist • Programmer / Scripter • On the Business side • Who knows SQL – can write a simple select? • Who had an AWS account before today? 12/6/2014 10
  • 12. What Is Apache Hadoop? • Software framework • Large scale processing • Network of commodity hardware • Handles hardware failures 12/6/2014 12 http://hadoop.apache.org/
  • 13. What is Hadoop good for? • Problems that are huge (batch), but not hard, and can be run in parallel over immutable data • NOT OLTP (e.g. backend to e-commerce site) • Providing a Map Reduce framework 12/6/2014 13
  • 16. You can write map reduce jobs in your favorite language Streaming Interface • Lets you specify mappers and reducer • Supports • Java • Python • Ruby • Unix Shell • R • Any executable Map Reduce “generators” • Results in map reduce jobs • PIG • Hive 12/6/2014 16
  • 17. Applications that lend themselves to map reduce • Word Count • PDF Generation (NY Times 11,000,000 articles) • Analysis of stock market historical data (ROI and standard deviation) • Geographical Data (Finding intersections, rendering map files) • Log file querying and analysis • Statistical machine translation • Spam detection • Analyzing Tweets 12/6/2014 17
  • 18. Would you use an excavator to plant a tomato? 12/6/2014 18
  • 19. Another use … Some people use a Hadoop cluster for a “data lake” • Store all your raw data • Cook it on demand 12/6/2014 19
  • 21. Pig http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009 • Pig Latin - the scripting language • Grunt – Shell for executing Pig Commands 12/6/2014 21
  • 23. Hive You write SQL! Well, almost, it is HiveQL 12/6/2014 23 SELECT user.* FROM user WHERE user.active = 1; JDBC SQL Workbench The first hands on session will focus on this.
  • 24. In Amazon, the common workflow for batch processing starts and ends with s3. Hive Script 12/6/2014 24
  • 25. Impala • Uses SQL very similar to HiveQL • Runs 10-100x faster • Runs in memory so it does not scale up as well • Great for developing your code on a small data set • Can use interactively with Tableau and other BI tools • Some batch jobs run faster on Impala than Hive 12/6/2014 25
  • 26. What is EMR? • Hadoop offered by Amazon • EMR = Elastic Map Reduce • Amazon does almost all of the work to create a cluster 12/6/2014 26 OR
  • 27. Three ways to pay for EMR • On Demand - highest price, by the hour, no commitment • m1.small $0.055 per Hour • i2.8xlarge $7.09 per hour • (29 different machine options) • Reservation - 1 and 3 year terms (No, All, & Partial Upfront) • Spot - lowest price, machine can be taken away Do I leave my cluster up all the time? 12/6/2014 27
  • 28. Adding machines: Time down, Cost up Cost in ECU 12/6/2014 28
  • 29. What Is Redshift? • Columnar database • Great for reads • Scale by adding machines • Two ways to pay • On Demand • Reservation • Good for SQL-based ETL too 12/6/2014 29 http://hadoop.apache.org/
  • 30. Redshift Machine Options (on demand prices) 12/6/2014 30 Petabyte scale Remember: Amazon charges for s3 storage too
  • 31. Redshift usage pattern • Load data to s3 first • Use BI tools to send in SQL • Amazon Redshift is based on PostgreSQL The second hands on session will focus on this. 12/6/2014 31 JDBC SQL Workbench
  • 32. Agenda 08:30 AM Breakfast 09:00 AM Introduction and Strengths of Technologies 10:00 AM Start an EMR Cluster 10:15 AM break + set up query tool 10:30 AM Hadoop hands-on 10:55 AM break 11:10 AM Redshift hands-on 11:40 AM Operationalizing your code 12:00 PM adjourn 12/6/2014 32
  • 33. Should I use Redshift or EMR? Redshift for • Structured data • Interactive queries • Speed Hadoop for • Data format flexibility • Computation flexibility • Super Big Data • Try both • Compare costs • If it works in Redshift, start there 12/6/2014 33
  • 34. Performance comparison (3. Join Query) 12/6/2014 34 https://amplab.cs.berkeley.edu/benchmark/
  • 35. Recap • Started a Hadoop cluster via the AWS Console (Web UI) • Loaded Data • Wrote some queries • Same for Redshift Eventually, you will do this for real and have a script that has value. Now what? 12/6/2014 35
  • 36. To run your data job you need to … • Wait for the new data to arrive • Move it to s3 • Start a cluster • Load the data • Run your SQL scripts • Wait for it to finish • Shut down your cluster 12/6/2014 36
  • 37. And hope … • The new data is in the right format • Assumptions you made during development are still true • Someone did not mess up your code with an "easy change“ • The new data transfers run successfully • A table you depend on has been updated correctly • The new data has not been truncated by the source • No data quality issues with the source data Wouldn’t it be great to turn your hopes into tests? 12/6/2014 37
  • 38. DataKitchen: We produce the data SQL, tests and the check list go into a Recipe You data are Ingredients 12/6/2014 38 The results are Servings
  • 39. DataKitchen brings reality in line with expectations 39 Analyze Prepare Data C Analyze Prepare Data Business Customer Expectation Analyst Reality Communicate Communicate Analyze Prepare Data With DataKitchen
  • 40. The story of our first Recipe 12/6/2014 40
  • 41. The story of our first Recipe With DataKitchen, we got 75% of our time back! … and we don’t have to remember to shut down our cluster. 12/6/2014 41
  • 42. Remember to shut down your clusters
  • 43. 43 Thank you! Send us an email to receive our newsletter or to give us feedback. info@datakitchen.io