SlideShare uma empresa Scribd logo
1 de 58
Baixar para ler offline
The Elephant in the Room
A DBA’s Guide to Hadoop & Big Data
Purpose
Rosetta Stone presentation
High level overview of Hadoop & Big Data
NOT a deep dive
NOT a demo session
Mostly theory & vocabulary
Where to learn more
About Me
Manage DBA’s for financial services company
Former Data Architect, DBA, developer
Linchpin People TeamMate
AtlantaMDF Chapter Leader
Infrequent blogger: http://codegumbo.com
About You
Assume that
● mostly developers
● SQL experience
● exposure to database admin &
architecture
● little to no experience with Big Data
“Big” Data
Big Data is like teenage sex...
Everyone talks about it,
Nobody really knows how to do it,
Everyone thinks everyone else is doing it,
So everyone claims they are doing it…
-Dan Ariely
The Four V’s of Big Data
Volume - data is too big to scale out
Velocity - decision window is small
Variety - multiple formats challenge integration
Variability - same data, different interpretations
http://goo.gl/6icouZ
RDBMS versus Big Data
RDBMS
Primarily Scale-Up
Strong Typing
Normalization
Default Mutable
Mature
Big Data
Primarily Scale-Out
Schemaless
Default Immutable
Evolving
Big Data Use Cases
Massive Size
PB of info
Data Warehouse
Large clusters
High Cost
Complex Analytics
Schemaless
Investigational
Single-node
Low Cost
Foundations
“Gentlemen, this is a
football…”
- Vince Lombardi
Hadoop Ecosystem (Hortonworks)
Hortonworks
Hadoop
Scaleable, distributed processing framework
open-source
Hortonworks*
Cloudera
proprietary components
Facebook
Yahoo
HDFS
Hadoop Distributed File System
Inspired by Google FileSystem (2002-2003)
Cluster storage of large files across servers
Yahoo - 10,000 core Hadoop cluster(s)
Facebook - 100 PB+ (June, 2012)
http://goo.gl/SpSN
HDFS
HDFS
File permissions and authentication.
Rack aware
fsck: find missing files or blocks.
Scheduled Rebalancing
Redundancy & Replication
Built around MapReduce
MapReduce
“Developed” by Google; patent issued in 2004
Map - filtering and sorting
Reduce - summarization
Inherently distributed
MapReduce
Hive
HiveQL - SQL like syntax
DDL scripts define tables
Query transformed into MapReduce jobs
Performance increases with scalability
Stinger initiative - MicrosoftHortonworks
Hive
Hive
create external table price_data (stock_exchange string,
symbol string, trade_date string, open float, high float,
low float, close float, volume int, adj_close float) row
format delimited fields terminated by ',' stored as
textfile location '/user/hue/nyse/nyse_prices';
select * from price_data where symbol = 'IBM';
Hive
HCatalog
Tight integration with Hive, but supports all
Hadoop data access protocols
Define relational view into data (DDL)
“Tables” can be reused by Hive, Pig, Storm...
Tutorial
Pig
Data abstraction language; Yahoo (2006)
Based on Java; supports Python & Ruby
Procedural (SQL is declarative)
Allows for ETL
Lazy evaluation
Pig
Pig
Pig
ETL service; useful as “duct tape”
Typical scenario:
Load data into HDFS
Use Pig to scrub data, and
Pump to another “db” (e.g., MongoDB)
Web service reads from destination
Hadoop Ecosystem (Hortonworks)
Hortonworks
Hadoop SQL Server
HDFS Windows Cluster
Database
MapReduce Query Optimizer
Master Web Interface SQL Server Management Studio
Hive SQL
HCatalog Views
Pig Powershell
SSIS
Big Data Administration
The possession of
facts is knowledge,
the use of them is
wisdom. – Thomas
Jefferson
Big Data Use Cases
Massive Size
PB of info
Data Warehouse
Large clusters
High Cost
Complex Analytics
Schemaless
Investigational
Single-node
Low Cost
PERFORMANCE
APPLICATION GROWTH
RDBMS
PERFORMANCE
APPLICATION GROWTH
BIG DATA
PERFORMANCE
APPLICATION GROWTH
Scale-Up Costs (SQL Server)
Single Server
Maximum RAM
SAN
Licenses
Windows
SQL Server
Microsoft Support
Personnel
Developers
DBA
SAN Admin
Network Admin
Facilities
Minimum Footprint
Scale-Out Costs (Hortonworks HDP)
Multiple Servers
Commodity
Licenses
Windows ($$$)
Linux ($)
HDP Support
Personnel
Developer
HDP Admin
Network Admin
Facilities
Power
Space
Air
Performance Tuning
SYSTEM
CODE
RDBMS
SYSTEM
CODE
HADOOP
Performance Tuning Tips
Hadoop Ecosystem (Hortonworks)
Hortonworks
Performance Architecture
Nathan Marz - Twitter, Storm
Lambda Architecture
Performance Architecture
Getting Started (Massive Size)
1. Lab Environment (Virtualized)
2. Setup OS (Windows or Linux)
3. Download (MSI or RPM)
4. Deploy Prereqs (Python, Java, C++)
5. Setup Master Node(s)
6. Setup Data Node(s)
Windows Installation Tutorial
Big Data Use Cases
Massive Size
PB of info
Data Warehouse
Large clusters
High Cost
Complex Analytics
Schemaless
Investigational
Single-node
Low Cost
Word Count
Problem: count the number of times a word
displays in a specific record.
e.g. “Lorem ipsum dolor sit amet, consectetur
adipiscing elit.”...
Word Count
SQL Server
Create UDF to
parse strings
Hadoop
Pig script to parse
strings
Word Count - SQL Server
CREATE function WordRepeatedNumTimes
(@SourceString varchar(max),@TargetWord varchar(8000))
RETURNS int
AS
BEGIN
DECLARE @NumTimesRepeated int
,@CurrentStringPosition int
,@LengthOfString int
,@PatternStartsAtPosition int
,@LengthOfTargetWord int
,@NewSourceString varchar(max)
Word Count - SQL Server
SET @LengthOfTargetWord = len(@TargetWord)
SET @LengthOfString = len(@SourceString)
SET @NumTimesRepeated = 0
SET @CurrentStringPosition = 0
SET @PatternStartsAtPosition = 0
SET @NewSourceString = @SourceString
WHILE len(@NewSourceString) >= @LengthOfTargetWord
BEGIN
SET @PatternStartsAtPosition = CHARINDEX (@TargetWord,
@NewSourceString)
IF @PatternStartsAtPosition <> 0
BEGIN
Word Count - SQL Server
SET @NumTimesRepeated = @NumTimesRepeated + 1
SET @CurrentStringPosition = @CurrentStringPosition +
@PatternStartsAtPosition + @LengthOfTargetWord
SET @NewSourceString = substring(@NewSourceString,
@PatternStartsAtPosition + @LengthOfTargetWord, @LengthOfString)
END
ELSE
BEGIN
SET @NewSourceString = ''
END
END
RETURN @NumTimesRepeated
END
Word Count (Hadoop)
a = load '/user/hue/word_count_text.txt';
b = foreach a generate flatten(TOKENIZE
((chararray)$0)) as word;
c = group b by word;
d = foreach c generate COUNT(b), group;
store d into '/user/hue/pig_wordcount';
Getting Started (Complex Analysis)
1. Lab Environment (Virtualized)
2. Install Hortonworks Sandbox
1. Setup Azure account
2. HDInsight
Theoretically, can scale to PB, but
no idea what that will cost you.
Note that the interface highlights
Hive (with Stinger); Pig commands
are run through Powershell
In Conclusion
Lots of vocabulary
HDFS, Pig, Hive, MapReduce
Map to SQL Server (RDBMS) vocabulary
Different Use Cases
Massive Data
Complex Analysis
Questions & Feedback
Contact Me
Stuart R. Ainsworth
Twitter: @codegumbo
Email: stuart@codegumbo.com
SpeakerRate: http://spkr8.com/t/33521
Big Data - Dangerous
http://www.thefacehawk.com/

Mais conteúdo relacionado

Mais procurados

Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
Kevin Weil
 

Mais procurados (20)

Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by Keylabs
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Real-time analytics with HBase
Real-time analytics with HBaseReal-time analytics with HBase
Real-time analytics with HBase
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-Programmers
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
The ABC of Big Data
The ABC of Big DataThe ABC of Big Data
The ABC of Big Data
 
Big Data - Part IV
Big Data - Part IVBig Data - Part IV
Big Data - Part IV
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Big Data - Part II
Big Data - Part IIBig Data - Part II
Big Data - Part II
 
Big Data - Part I
Big Data - Part IBig Data - Part I
Big Data - Part I
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Big Data - Part III
Big Data - Part IIIBig Data - Part III
Big Data - Part III
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop Together
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 

Destaque

Sql server 2014 what's new-
Sql server 2014  what's new-Sql server 2014  what's new-
Sql server 2014 what's new-
Stuart Ainsworth
 
Team rockets oms.
Team rockets oms.Team rockets oms.
Team rockets oms.
c_liberty
 
Coke Ramadan Jay Chiat 2015
Coke Ramadan Jay Chiat 2015Coke Ramadan Jay Chiat 2015
Coke Ramadan Jay Chiat 2015
Evan Kearney
 
Functional programming
Functional programmingFunctional programming
Functional programming
NewHeart
 
Presentació curs fisqui
Presentació curs fisquiPresentació curs fisqui
Presentació curs fisqui
lauraod
 

Destaque (19)

Sql server 2014 what's new-
Sql server 2014  what's new-Sql server 2014  what's new-
Sql server 2014 what's new-
 
Team rockets oms.
Team rockets oms.Team rockets oms.
Team rockets oms.
 
All you need to know about WMS
All you need to know about WMSAll you need to know about WMS
All you need to know about WMS
 
Communicatie is topsport - Corine van Impelen
Communicatie is topsport - Corine van ImpelenCommunicatie is topsport - Corine van Impelen
Communicatie is topsport - Corine van Impelen
 
Use of Coordinated Multipoint Transmission for Relaxation of Relay Link Bott...
Use of Coordinated Multipoint Transmission for Relaxation of Relay Link Bott...Use of Coordinated Multipoint Transmission for Relaxation of Relay Link Bott...
Use of Coordinated Multipoint Transmission for Relaxation of Relay Link Bott...
 
Gruppo Ambiente Sicurezza & Lifegate
Gruppo Ambiente Sicurezza & LifegateGruppo Ambiente Sicurezza & Lifegate
Gruppo Ambiente Sicurezza & Lifegate
 
Bulungi Creative
Bulungi CreativeBulungi Creative
Bulungi Creative
 
Portafolio estudiantil de farmacología
Portafolio estudiantil de farmacologíaPortafolio estudiantil de farmacología
Portafolio estudiantil de farmacología
 
Coke Ramadan Jay Chiat 2015
Coke Ramadan Jay Chiat 2015Coke Ramadan Jay Chiat 2015
Coke Ramadan Jay Chiat 2015
 
Office Add-Ins
Office Add-InsOffice Add-Ins
Office Add-Ins
 
Functional programming
Functional programmingFunctional programming
Functional programming
 
заохочення і покарання
заохочення  і покараннязаохочення  і покарання
заохочення і покарання
 
Assignmen1
Assignmen1Assignmen1
Assignmen1
 
Circuitos mixtos
Circuitos mixtosCircuitos mixtos
Circuitos mixtos
 
Уникальное коммерческое предложение
Уникальное коммерческое предложениеУникальное коммерческое предложение
Уникальное коммерческое предложение
 
Sarus 2014 magazine
Sarus 2014 magazineSarus 2014 magazine
Sarus 2014 magazine
 
Presentació curs fisqui
Presentació curs fisquiPresentació curs fisqui
Presentació curs fisqui
 
SEO продвижение - сравнение с конкурентами
SEO продвижение - сравнение с конкурентамиSEO продвижение - сравнение с конкурентами
SEO продвижение - сравнение с конкурентами
 
Estación tercera
Estación terceraEstación tercera
Estación tercera
 

Semelhante a Elephant in the room: A DBA's Guide to Hadoop

Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
guest27e6764
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 

Semelhante a Elephant in the room: A DBA's Guide to Hadoop (20)

מיכאל
מיכאלמיכאל
מיכאל
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystem
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

Elephant in the room: A DBA's Guide to Hadoop

  • 1. The Elephant in the Room A DBA’s Guide to Hadoop & Big Data
  • 2.
  • 3.
  • 4. Purpose Rosetta Stone presentation High level overview of Hadoop & Big Data NOT a deep dive NOT a demo session Mostly theory & vocabulary Where to learn more
  • 5. About Me Manage DBA’s for financial services company Former Data Architect, DBA, developer Linchpin People TeamMate AtlantaMDF Chapter Leader Infrequent blogger: http://codegumbo.com
  • 6. About You Assume that ● mostly developers ● SQL experience ● exposure to database admin & architecture ● little to no experience with Big Data
  • 8. Big Data is like teenage sex... Everyone talks about it, Nobody really knows how to do it, Everyone thinks everyone else is doing it, So everyone claims they are doing it… -Dan Ariely
  • 9. The Four V’s of Big Data Volume - data is too big to scale out Velocity - decision window is small Variety - multiple formats challenge integration Variability - same data, different interpretations http://goo.gl/6icouZ
  • 10. RDBMS versus Big Data RDBMS Primarily Scale-Up Strong Typing Normalization Default Mutable Mature Big Data Primarily Scale-Out Schemaless Default Immutable Evolving
  • 11. Big Data Use Cases Massive Size PB of info Data Warehouse Large clusters High Cost Complex Analytics Schemaless Investigational Single-node Low Cost
  • 12. Foundations “Gentlemen, this is a football…” - Vince Lombardi
  • 14. Hadoop Scaleable, distributed processing framework open-source Hortonworks* Cloudera proprietary components Facebook Yahoo
  • 15. HDFS Hadoop Distributed File System Inspired by Google FileSystem (2002-2003) Cluster storage of large files across servers Yahoo - 10,000 core Hadoop cluster(s) Facebook - 100 PB+ (June, 2012) http://goo.gl/SpSN
  • 16. HDFS
  • 17. HDFS File permissions and authentication. Rack aware fsck: find missing files or blocks. Scheduled Rebalancing Redundancy & Replication Built around MapReduce
  • 18. MapReduce “Developed” by Google; patent issued in 2004 Map - filtering and sorting Reduce - summarization Inherently distributed
  • 20. Hive HiveQL - SQL like syntax DDL scripts define tables Query transformed into MapReduce jobs Performance increases with scalability Stinger initiative - MicrosoftHortonworks
  • 21. Hive
  • 22. Hive create external table price_data (stock_exchange string, symbol string, trade_date string, open float, high float, low float, close float, volume int, adj_close float) row format delimited fields terminated by ',' stored as textfile location '/user/hue/nyse/nyse_prices'; select * from price_data where symbol = 'IBM';
  • 23. Hive
  • 24. HCatalog Tight integration with Hive, but supports all Hadoop data access protocols Define relational view into data (DDL) “Tables” can be reused by Hive, Pig, Storm... Tutorial
  • 25. Pig Data abstraction language; Yahoo (2006) Based on Java; supports Python & Ruby Procedural (SQL is declarative) Allows for ETL Lazy evaluation
  • 26. Pig
  • 27. Pig
  • 28. Pig ETL service; useful as “duct tape” Typical scenario: Load data into HDFS Use Pig to scrub data, and Pump to another “db” (e.g., MongoDB) Web service reads from destination
  • 30.
  • 31. Hadoop SQL Server HDFS Windows Cluster Database MapReduce Query Optimizer Master Web Interface SQL Server Management Studio Hive SQL HCatalog Views Pig Powershell SSIS
  • 32. Big Data Administration The possession of facts is knowledge, the use of them is wisdom. – Thomas Jefferson
  • 33. Big Data Use Cases Massive Size PB of info Data Warehouse Large clusters High Cost Complex Analytics Schemaless Investigational Single-node Low Cost
  • 37. Scale-Up Costs (SQL Server) Single Server Maximum RAM SAN Licenses Windows SQL Server Microsoft Support Personnel Developers DBA SAN Admin Network Admin Facilities Minimum Footprint
  • 38. Scale-Out Costs (Hortonworks HDP) Multiple Servers Commodity Licenses Windows ($$$) Linux ($) HDP Support Personnel Developer HDP Admin Network Admin Facilities Power Space Air
  • 41. Performance Architecture Nathan Marz - Twitter, Storm Lambda Architecture
  • 43. Getting Started (Massive Size) 1. Lab Environment (Virtualized) 2. Setup OS (Windows or Linux) 3. Download (MSI or RPM) 4. Deploy Prereqs (Python, Java, C++) 5. Setup Master Node(s) 6. Setup Data Node(s)
  • 45. Big Data Use Cases Massive Size PB of info Data Warehouse Large clusters High Cost Complex Analytics Schemaless Investigational Single-node Low Cost
  • 46. Word Count Problem: count the number of times a word displays in a specific record. e.g. “Lorem ipsum dolor sit amet, consectetur adipiscing elit.”...
  • 47. Word Count SQL Server Create UDF to parse strings Hadoop Pig script to parse strings
  • 48. Word Count - SQL Server CREATE function WordRepeatedNumTimes (@SourceString varchar(max),@TargetWord varchar(8000)) RETURNS int AS BEGIN DECLARE @NumTimesRepeated int ,@CurrentStringPosition int ,@LengthOfString int ,@PatternStartsAtPosition int ,@LengthOfTargetWord int ,@NewSourceString varchar(max)
  • 49. Word Count - SQL Server SET @LengthOfTargetWord = len(@TargetWord) SET @LengthOfString = len(@SourceString) SET @NumTimesRepeated = 0 SET @CurrentStringPosition = 0 SET @PatternStartsAtPosition = 0 SET @NewSourceString = @SourceString WHILE len(@NewSourceString) >= @LengthOfTargetWord BEGIN SET @PatternStartsAtPosition = CHARINDEX (@TargetWord, @NewSourceString) IF @PatternStartsAtPosition <> 0 BEGIN
  • 50. Word Count - SQL Server SET @NumTimesRepeated = @NumTimesRepeated + 1 SET @CurrentStringPosition = @CurrentStringPosition + @PatternStartsAtPosition + @LengthOfTargetWord SET @NewSourceString = substring(@NewSourceString, @PatternStartsAtPosition + @LengthOfTargetWord, @LengthOfString) END ELSE BEGIN SET @NewSourceString = '' END END RETURN @NumTimesRepeated END
  • 51. Word Count (Hadoop) a = load '/user/hue/word_count_text.txt'; b = foreach a generate flatten(TOKENIZE ((chararray)$0)) as word; c = group b by word; d = foreach c generate COUNT(b), group; store d into '/user/hue/pig_wordcount';
  • 52. Getting Started (Complex Analysis) 1. Lab Environment (Virtualized) 2. Install Hortonworks Sandbox 1. Setup Azure account 2. HDInsight
  • 53. Theoretically, can scale to PB, but no idea what that will cost you. Note that the interface highlights Hive (with Stinger); Pig commands are run through Powershell
  • 54.
  • 55. In Conclusion Lots of vocabulary HDFS, Pig, Hive, MapReduce Map to SQL Server (RDBMS) vocabulary Different Use Cases Massive Data Complex Analysis
  • 57. Contact Me Stuart R. Ainsworth Twitter: @codegumbo Email: stuart@codegumbo.com SpeakerRate: http://spkr8.com/t/33521
  • 58. Big Data - Dangerous http://www.thefacehawk.com/