SlideShare uma empresa Scribd logo
1 de 48
Baixar para ler offline
Big Data
Askhat Murzabayev
Intro to Big Data
Askhat Murzabayev
Explicit attempt of self promotion
•

23 years old

•

Suleyman Demirel University, BSc in CS 2012

•

Chocomart.kz - Apr 2013 - present

•

Product Manager at Twitter Dec 2012 - May 2013(Twitter API & Android
app)

•

SE at Twitter Sep 2011 - Dec 2012(Search, Relevance and Machine
Learning dept.)

•

Sold diploma thesis(CV algorithm) to Microsoft, used in Bing Maps

•

Sold image processing algorithm(better pattern recognition of objects) to
Microsoft Research

•

Scalable Machine Learning algorithms are my passion
Numbers
•

1 zettabyte = 1,000,000 petabytes

•

2006 - 0.18 zettabytes

•

2011 - 1.8 zettabytes

•

2012 - 2.8 zettabytes(3% analyzed)

•

Estimate: 2020 - 40 zettabytes
Numbers Everyone Should Know
•

Numbers Everyone Should Know

•

L1 cache reference 0.5 ns

•

Branch mispredict 5 ns

•

L2 cache reference 7 ns

•

Mutex lock/unlock 100 ns

•

Main memory reference 100 ns

•

Compress 1K bytes with Zippy 10,000 ns

•

Send 2K bytes over 1 Gbps network 20,000 ns

•

Read 1 MB sequentially from memory 0.25 ms
Numbers Everyone Should Know part 2
•

Round trip within same datacenter 0.5 ms

•

Disk seek 10 ms

•

Read 1 MB sequentially from network 10 ms

•

Read 1 MB sequentially from disk 30 ms

•

Send packet CA->Netherlands->CA 150 ms

•

Send package via Kazpost - everlasting
Conclusion
!

•

time(CPU) < time(RAM) < time(Disk) <
time(Network)!

•

amount(CPU) < amount(RAM) <<< amount(Disk)
< amount(Network)
Problem statement
•

Tons of data

•

F*cking tons of data

•

We need to process it

•

Process it fast

•

Idea is to “parallelize” processing of data
The “Joys” of Real Hardware
•

~0.5 overheating (power down most
machines in <5 mins, ~1-2 days to recover)

•

~1 PDU failure (~500-1000 machines
suddenly disappear, ~6 hours to come back)

•

~1 rack-move (plenty of warning, ~500-1000
machines powered down, ~6 hours)

•

~1 network rewiring (rolling ~5% of machines
down over 2-day span)
•

~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get
back)

•

~5 racks go wonky (40-80 machines see 50% packetloss)

•

~8 network maintenances (4 might cause ~30-minute random
connectivity losses)

•

~12 router reloads (takes out DNS and external vips for a couple
minutes)

•

~3 router failures (have to immediately pull traffic for an hour)

•

~dozens of minor 30-second blips for dns

•

~1000 individual machine failures

•

~thousands of hard drive failures

•

slow disks, bad memory, misconfigured machines, flaky machines, etc
Problem statement(2)
•

A lot of data

•

Fast processing

•

Reliable

•

“Cheap”

•

Scale

•

Wouldn’t require much of hand work

•

Should work on many prog.languages, platforms
•

Google File System (GFS)
•
•

•

Distributed filesystem
Fault tolerant

MapReduce
•

Distributed processing framework
Apache Hadoop
•

“Hadoop was created by Doug Cutting, the creator
of Apache Lucene, the widely used text search
library. Hadoop has its origins in Apache Nutch, an
open source web search engine, itself a part of the
Lucene project”
Ecosystem
•

Apache Hadoop
•

Commons

•

HDFS (Hadoop Distributed FileSystem)

•

MapReduce(v1, v2)

•

Apache HBase

•

Apache Pig

•

Apache Hive

•

Apache Zookeeper

•

Apache Oozie

•

Apache Sqoop
Example
awk processing
MapReduce
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...) 

(106, 0043011990999991950051512004...9999999N9+00221+99999999999...) 

(212, 0043011990999991950051518004...9999999N9-00111+99999999999...) 

(318, 0043012650999991949032412004...0500001N9+01111+99999999999...) 

(424, 0043012650999991949032418004…0500001N9+00781+99999999999...)	


Input to Map function	

(1950, 0) 

(1950, 22) 

(1950, −11) 

(1949, 111) 

(1949, 78)	

Output from Map function is input for Reduce function	

(1949, [111, 78])

(1950, [0, 22, −11])	

Output from Reduce function	

(1949, 111) 

(1950, 22)	


!
Data locality optimization
!
!
!
!
!

•

HDFS block size 64 MB(default)
MapReduce dataflow
Combiner Functions
•

map1
•

•

map2
•

•

(1950, 0) 

(1950, 20) 

(1950, 10)

(1950, 25) 

(1950, 15)

reduce
•
•

•

input: (1950, [0, 20, 10, 25, 15])
output: (1950, 25)

job.setCombinerClass(MaxTemperatureReducer.class);
HDFS
Design and Concepts
The Design of HDFS

•

HDFS is a filesystem designed for storing very
large files with streaming data access patterns,
running on clusters of commodity hardware.
HDFS is not good fit for:
•

Low-latency data access(use HBase instead)

•

Lots of small files

•

Multiple writers, arbitrary file modifications
HDFS Concepts
•

Blocks
•
•

•

Size on “normal” filesystem: 512 bytes
Size in HDFS: 64 MB

File in HDFS that is smaller than a single block
does not occupy a full block’s worth of underlying
storage
Why block size is so large?
•

Disk seek time 10 ms

•

Transfer rate is 100 MB/s

•

Goal is to make the seek time 1% of the transfer
time

•

We need around 100 MB block size
Why blocks?
•

A file can be larger than any disk in the network

•

Making unit of abstraction a block rather than file
simplifies the storage subsystem
•

Simplifies storage subsystem(Fixed size of block,
it is easy to calculate how many can be stored)

•

Eliminating metadata info(Don’t need to store
permissions, created time, created user and etc.
with block)
Namenodes and Datanodes
•

Namenode = master
•

Manages filesystem namespace(filesystem tree, metadata
for dirs and files)
•

•

•

Namespace image, edit log - stored persistently on disk

Stores on which datanodes blocks of given file are
stored(stored in RAM)

Datanode = workers(slaves)
•

Store and retrieve blocks when needed
Troubles

•

If namenode fails - God save us, it hurts…
Solutions
•

Hadoop can be configured so that the namenode
writes its persistent state to multiple filesystems

•

Secondary namenode: main role is to periodically
merge the namespace image with the edit log to
prevent the edit log from becoming too large.
HDFS Federation(since 2.x)
•

Allows a cluster to scale by adding namenodes, each of which
manages a portion of the filesystem namespace.
•

/user

•

/share

•

Namenode manages a namespace volume, which is made up of
the metadata for the namespace, and a block pool containing all
the blocks for the files in the namespace.

•

Namespace volumes are independent of each other.

•

Block pool storage is not partitioned, datanodes register with each
namenode in the cluster and store blocks from multiple block pools.
HDFS High-Availability(since 2.x)
•

Namenode still is SPOF (Single Point of Failure)
•

•

If fails then, unable to do MR jobs, read/write/list
files

Recovery algorithm(could take 30 mins)
•

load its namespace image into memory,

•

replay its edit log, and

•

receive enough block reports from the datanodes
to leave safe mode.
HDFS HA
•

Switching namenodes could take 1-2 minutes
•

The namenodes must use highly available shared
storage to share the edit log.

•

Datanodes must send block reports to both
namenodes because the block mappings are stored
in a namenode’s memory, and not on disk.

•

Clients must be configured to handle namenode
failover, using a mechanism that is transparent to
users.
Reading data
Network distance in Hadoop
Writing data
Moving large datasets to HDFS
•

Apache Flume
•

•
•

Moving large quantities of streaming data into HDFS. Log
data from one system—a bank of web servers and
aggregating it in HDFS for later analysis.
Supports tail, syslog, and Apache log4j

Apache Sqoop
•

Designed for performing bulk imports of data into HDFS from
structured data stores, such as relational databases.

•

An example of a Sqoop use case is an organization that runs
a nightly Sqoop import to load the day’s data from a
production database into a Hive data warehouse for analysis.
Parallel Copying with distcp
•

% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar!

•

Will create foo directory inside of bar in namenode2

•

Only map jobs, no reducers pass option -m(shows amount of map jobs)

•

% hadoop distcp -update hdfs://namenode1/foo hdfs://namenode2/bar/foo!

•

% hadoop distcp webhdfs://namenode1:50070/foo 

webhdfs://namenode2:50070/bar
Balancer
•

Only one program at a time

•

Utilization is usage over total capacity

•

Utilization of every datanode differs from utilization
of cluster by no more than THRESHOLD_VALUE

•

Calling balancer

% start-balancer.sh [OPTIONAL default is 10%] THRESHOLD_VALUE
Hadoop Archives(HAR)
•

HDFS stores small files inefficiently.

•

Note: Small files do not take up any more disk
space than is required to store the raw contents of
the file.
•

1 MB file stored with a block size of 128 MB uses
1 MB of disk space, not 128 MB.
•

Archiver tool is MapReduce job

•

HAR is directory not single file

•

% hadoop archive -archiveName files.har /my/files /my
Limitations of HAR
•

No compression

•

Immutable

•

MapReduce split is still inefficient
Thanks
Questions?

Mais conteúdo relacionado

Mais procurados

In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteDataWorks Summit
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardMatthew Blair
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaCloudera, Inc.
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars GeorgeJAX London
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanNarayana B
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
 
Boosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and SparkBoosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and SparkDvir Volk
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemSteve Loughran
 
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseCloudera, Inc.
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxAlex Moundalexis
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.Jack Levin
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 ReleaseNick Dimiduk
 

Mais procurados (20)

In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ Flipboard
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars George
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 
HBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and CompactionHBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and Compaction
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
 
Boosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and SparkBoosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and Spark
 
Tutorial Haddop 2.3
Tutorial Haddop 2.3Tutorial Haddop 2.3
Tutorial Haddop 2.3
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBase
 
NoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBaseNoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBase
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 Release
 

Semelhante a Intro to big data choco devday - 23-01-2014

Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecturesaipriyacoool
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
Designs, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDesigns, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDaehyeok Kim
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxNIKHILGR3
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQueryCsaba Toth
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Alluxio, Inc.
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 

Semelhante a Intro to big data choco devday - 23-01-2014 (20)

Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Designs, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDesigns, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed Systems
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQuery
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop
HadoopHadoop
Hadoop
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 

Último

Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 

Último (20)

Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 

Intro to big data choco devday - 23-01-2014

  • 2. Intro to Big Data Askhat Murzabayev
  • 3. Explicit attempt of self promotion • 23 years old • Suleyman Demirel University, BSc in CS 2012 • Chocomart.kz - Apr 2013 - present • Product Manager at Twitter Dec 2012 - May 2013(Twitter API & Android app) • SE at Twitter Sep 2011 - Dec 2012(Search, Relevance and Machine Learning dept.) • Sold diploma thesis(CV algorithm) to Microsoft, used in Bing Maps • Sold image processing algorithm(better pattern recognition of objects) to Microsoft Research • Scalable Machine Learning algorithms are my passion
  • 4.
  • 5. Numbers • 1 zettabyte = 1,000,000 petabytes • 2006 - 0.18 zettabytes • 2011 - 1.8 zettabytes • 2012 - 2.8 zettabytes(3% analyzed) • Estimate: 2020 - 40 zettabytes
  • 6. Numbers Everyone Should Know • Numbers Everyone Should Know • L1 cache reference 0.5 ns • Branch mispredict 5 ns • L2 cache reference 7 ns • Mutex lock/unlock 100 ns • Main memory reference 100 ns • Compress 1K bytes with Zippy 10,000 ns • Send 2K bytes over 1 Gbps network 20,000 ns • Read 1 MB sequentially from memory 0.25 ms
  • 7. Numbers Everyone Should Know part 2 • Round trip within same datacenter 0.5 ms • Disk seek 10 ms • Read 1 MB sequentially from network 10 ms • Read 1 MB sequentially from disk 30 ms • Send packet CA->Netherlands->CA 150 ms • Send package via Kazpost - everlasting
  • 8. Conclusion ! • time(CPU) < time(RAM) < time(Disk) < time(Network)! • amount(CPU) < amount(RAM) <<< amount(Disk) < amount(Network)
  • 9. Problem statement • Tons of data • F*cking tons of data • We need to process it • Process it fast • Idea is to “parallelize” processing of data
  • 10. The “Joys” of Real Hardware • ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) • ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) • ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours) • ~1 network rewiring (rolling ~5% of machines down over 2-day span)
  • 11. • ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) • ~5 racks go wonky (40-80 machines see 50% packetloss) • ~8 network maintenances (4 might cause ~30-minute random connectivity losses) • ~12 router reloads (takes out DNS and external vips for a couple minutes) • ~3 router failures (have to immediately pull traffic for an hour) • ~dozens of minor 30-second blips for dns • ~1000 individual machine failures • ~thousands of hard drive failures • slow disks, bad memory, misconfigured machines, flaky machines, etc
  • 12. Problem statement(2) • A lot of data • Fast processing • Reliable • “Cheap” • Scale • Wouldn’t require much of hand work • Should work on many prog.languages, platforms
  • 13. • Google File System (GFS) • • • Distributed filesystem Fault tolerant MapReduce • Distributed processing framework
  • 14. Apache Hadoop • “Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project”
  • 15. Ecosystem • Apache Hadoop • Commons • HDFS (Hadoop Distributed FileSystem) • MapReduce(v1, v2) • Apache HBase • Apache Pig • Apache Hive • Apache Zookeeper • Apache Oozie • Apache Sqoop
  • 18. MapReduce (0, 0067011990999991950051507004...9999999N9+00001+99999999999...) 
 (106, 0043011990999991950051512004...9999999N9+00221+99999999999...) 
 (212, 0043011990999991950051518004...9999999N9-00111+99999999999...) 
 (318, 0043012650999991949032412004...0500001N9+01111+99999999999...) 
 (424, 0043012650999991949032418004…0500001N9+00781+99999999999...) Input to Map function (1950, 0) 
 (1950, 22) 
 (1950, −11) 
 (1949, 111) 
 (1949, 78) Output from Map function is input for Reduce function (1949, [111, 78])
 (1950, [0, 22, −11]) Output from Reduce function (1949, 111) 
 (1950, 22) !
  • 19.
  • 20.
  • 21.
  • 22.
  • 25. Combiner Functions • map1 • • map2 • • (1950, 0) 
 (1950, 20) 
 (1950, 10) (1950, 25) 
 (1950, 15) reduce • • • input: (1950, [0, 20, 10, 25, 15]) output: (1950, 25) job.setCombinerClass(MaxTemperatureReducer.class);
  • 26.
  • 28. The Design of HDFS • HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.
  • 29. HDFS is not good fit for: • Low-latency data access(use HBase instead) • Lots of small files • Multiple writers, arbitrary file modifications
  • 30. HDFS Concepts • Blocks • • • Size on “normal” filesystem: 512 bytes Size in HDFS: 64 MB File in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage
  • 31. Why block size is so large? • Disk seek time 10 ms • Transfer rate is 100 MB/s • Goal is to make the seek time 1% of the transfer time • We need around 100 MB block size
  • 32. Why blocks? • A file can be larger than any disk in the network • Making unit of abstraction a block rather than file simplifies the storage subsystem • Simplifies storage subsystem(Fixed size of block, it is easy to calculate how many can be stored) • Eliminating metadata info(Don’t need to store permissions, created time, created user and etc. with block)
  • 33. Namenodes and Datanodes • Namenode = master • Manages filesystem namespace(filesystem tree, metadata for dirs and files) • • • Namespace image, edit log - stored persistently on disk Stores on which datanodes blocks of given file are stored(stored in RAM) Datanode = workers(slaves) • Store and retrieve blocks when needed
  • 34. Troubles • If namenode fails - God save us, it hurts…
  • 35. Solutions • Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems • Secondary namenode: main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.
  • 36. HDFS Federation(since 2.x) • Allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace. • /user • /share • Namenode manages a namespace volume, which is made up of the metadata for the namespace, and a block pool containing all the blocks for the files in the namespace. • Namespace volumes are independent of each other. • Block pool storage is not partitioned, datanodes register with each namenode in the cluster and store blocks from multiple block pools.
  • 37. HDFS High-Availability(since 2.x) • Namenode still is SPOF (Single Point of Failure) • • If fails then, unable to do MR jobs, read/write/list files Recovery algorithm(could take 30 mins) • load its namespace image into memory, • replay its edit log, and • receive enough block reports from the datanodes to leave safe mode.
  • 38. HDFS HA • Switching namenodes could take 1-2 minutes • The namenodes must use highly available shared storage to share the edit log. • Datanodes must send block reports to both namenodes because the block mappings are stored in a namenode’s memory, and not on disk. • Clients must be configured to handle namenode failover, using a mechanism that is transparent to users.
  • 42. Moving large datasets to HDFS • Apache Flume • • • Moving large quantities of streaming data into HDFS. Log data from one system—a bank of web servers and aggregating it in HDFS for later analysis. Supports tail, syslog, and Apache log4j Apache Sqoop • Designed for performing bulk imports of data into HDFS from structured data stores, such as relational databases. • An example of a Sqoop use case is an organization that runs a nightly Sqoop import to load the day’s data from a production database into a Hive data warehouse for analysis.
  • 43. Parallel Copying with distcp • % hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar! • Will create foo directory inside of bar in namenode2 • Only map jobs, no reducers pass option -m(shows amount of map jobs) • % hadoop distcp -update hdfs://namenode1/foo hdfs://namenode2/bar/foo! • % hadoop distcp webhdfs://namenode1:50070/foo 
 webhdfs://namenode2:50070/bar
  • 44. Balancer • Only one program at a time • Utilization is usage over total capacity • Utilization of every datanode differs from utilization of cluster by no more than THRESHOLD_VALUE • Calling balancer
 % start-balancer.sh [OPTIONAL default is 10%] THRESHOLD_VALUE
  • 45. Hadoop Archives(HAR) • HDFS stores small files inefficiently. • Note: Small files do not take up any more disk space than is required to store the raw contents of the file. • 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.
  • 46. • Archiver tool is MapReduce job • HAR is directory not single file • % hadoop archive -archiveName files.har /my/files /my
  • 47. Limitations of HAR • No compression • Immutable • MapReduce split is still inefficient