SlideShare uma empresa Scribd logo
1 de 46
MapR, Implications for Integration CHUG – August 2011
Outline MapR system overview Map-reduce review MapR architecture Performance Results Map-reduce on MapR Architectural implications Search indexing / deployment EM algorithm for machine learning … and more …
Map-Reduce Shuffle Input Output
Bottlenecks and Issues Read-only files Many copies in I/O path Shuffle based on HTTP Can’t use new technologies Eats file descriptors Spills go to local file space Bad for skewed distribution of sizes
MapR Areas of Development
MapR Improvements Faster file system Fewer copies Multiple NICS No file descriptor or page-buf competition Faster map-reduce Uses distributed file system Direct RPC to receiver Very wide merges
MapR Innovations Volumes Distributed management Data placement Read/write random access file system Allows distributed meta-data Improved scaling Enables NFS access Application-level NIC bonding Transactionally correct snapshots and mirrors
MapR'sContainers Files/directories are sharded into blocks, whichare placed into mini NNs (containers ) on disks ,[object Object]
Directories & files
Data blocks
Replicated on servers
No need to manage directlyContainers are 16-32 GB segments of disk, placed on nodes
Container locations and replication CLDB N1, N2 N1 N3, N2 N1, N2 N2 N1, N3 N3, N2 N3 Container location database (CLDB) keeps track of nodes hosting each container
MapR Scaling Containers represent 16 - 32GB of data ,[object Object]
100M containers =  ~ 2 Exabytes  (a very large cluster)250 bytes DRAM to cache a container ,[object Object]
But not necessary, can page to disk
Typical large 10PB cluster needs 2GBContainer-reports are 100x - 1000x  <  HDFS block-reports ,[object Object]
Increase container size to 64G to serve 4EB cluster
Map/reduce not affected,[object Object]
Terasort on MapR 10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm Elapsed time (mins) Lower is better
HBase on MapR YCSB Random Read with 1 billion 1K records 10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM Recordspersecond Higher is better
Small Files (Apache Hadoop, 10 nodes) Out of box Op:  - create file         - write 100 bytes         - close Notes: - NN not replicated - NN uses 20G DRAM - DN uses  2G  DRAM Tuned Rate (files/sec) # of files (m)
MUCH faster for some operations Same 10 nodes … Create Rate # of files (millions)
What MapR is not Volumes != federation MapR supports > 10,000 volumes all with independent placement and defaults Volumes support snapshots and mirroring NFS != FUSE Checksum and compress at gateway IP fail-over Read/write/update semantics at full speed MapR != maprfs
New Capabilities
NFS mounting models Export to the world NFS gateway runs on selected gateway hosts Local server NFS gateway runs on local host Enables local compression and check summing Export to self NFS gateway runs on all data nodes, mounted from localhost
Export to the world NFS Server NFS Server NFS Server NFS Server NFS Client
Local server Client Application NFS Server Cluster Nodes
Universal export to self Cluster Nodes Cluster Node Task NFS Server
Cluster Node Task NFS Server Cluster Node Task Cluster Node Task NFS Server NFS Server Nodes are identical
Application architecture So now we have a hammer Let’s find us some nails!
Sharded text Indexing Index text to local disk and then copy index to distributed file store Assign documents to shards Map Reducer Clustered index storage Input documents Copy to local disk typically required before index can be loaded Local disk Search Engine Local disk
Shardedtext indexing Mapper assigns document to shard Shard is usually hash of document id Reducer indexes all documents for a shard Indexes created on local disk On success, copy index to DFS On failure, delete local files Must avoid directory collisions  can’t use shard id! Must manage and reclaim local disk space
Conventional data flow Failure of search engine requires another download of the index from clustered storage. Map Failure of a reducer causes garbage to accumulate in the local disk Reducer Clustered index storage Input documents Local disk Search Engine Local disk
Simplified NFS data flows Index to task work directory via NFS Map Reducer Search Engine Input documents Clustered index storage Failure of a reducer is cleaned up by map-reduce framework Search engine reads mirrored index directly.
Simplified NFS data flows Search Engine Mirroring allows exact placement of index data Map Reducer Input documents Search Engine Aribitrary levels of replication also possible Mirrors
How about another one?
K-means Classic E-M based algorithm Given cluster centroids, Assign each data point to nearest centroid Accumulate new centroids Rinse, lather, repeat
K-means, the movie Centroids Assign to Nearest centroid I n p u t Aggregate new centroids
But …
Parallel Stochastic Gradient Descent Model Train sub model I n p u t Average models
VariationalDirichlet Assignment Model Gather sufficient statistics I n p u t Update model
Old tricks, new dogs Mapper Assign point to cluster Emit cluster id, (1, point) Combiner and reducer Sum counts, weighted sum of points Emit cluster id, (n, sum/n) Output to HDFS Read from local disk from distributed cache Read from HDFS to local disk by distributed cache Written by map-reduce
Old tricks, new dogs Mapper Assign point to cluster Emit cluster id, (1, point) Combiner and reducer Sum counts, weighted sum of points Emit cluster id, (n, sum/n) Output to HDFS Read from NFS Written by map-reduce MapR FS
Poor man’s Pregel Mapper Lines in bold can use conventional I/O via NFS while not done:     read and accumulate input models     for each input:        accumulate model     write model    synchronize     reset input format emit summary 37
Click modeling architecture Map-reduce Side-data Now via NFS Feature extraction and down sampling I n p u t Data join Sequential SGD Learning

Mais conteúdo relacionado

Mais procurados

Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the ElephantDataWorks Summit
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanNarayana B
 
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunningOscon data-2011-ted-dunning
Oscon data-2011-ted-dunningTed Dunning
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basicHafizur Rahman
 
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, FacebookМасштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebookyaevents
 
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewHdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewNitesh Ghosh
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2Tianwei Liu
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APImcsrivas
 
JFall 2011 no sql workshop
JFall 2011 no sql workshopJFall 2011 no sql workshop
JFall 2011 no sql workshopfvanvollenhoven
 
Quantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSQuantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSbigdatagurus_meetup
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and PipesHanborq Inc.
 

Mais procurados (20)

Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the Elephant
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunningOscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
RuG Guest Lecture
RuG Guest LectureRuG Guest Lecture
RuG Guest Lecture
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, FacebookМасштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
 
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewHdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase API
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
JFall 2011 no sql workshop
JFall 2011 no sql workshopJFall 2011 no sql workshop
JFall 2011 no sql workshop
 
01 hbase
01 hbase01 hbase
01 hbase
 
Quantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSQuantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFS
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 

Destaque

Prezentarea agentiei Justpixel
Prezentarea agentiei JustpixelPrezentarea agentiei Justpixel
Prezentarea agentiei JustpixelUngureanu Lucian
 
OpenFest 2013 Open Source Hardware (OSHW) made in Bulgaria
OpenFest 2013 Open Source Hardware (OSHW) made in BulgariaOpenFest 2013 Open Source Hardware (OSHW) made in Bulgaria
OpenFest 2013 Open Source Hardware (OSHW) made in BulgariaOlimex Bulgaria
 
Come (e perchè) un'istituzione finanziaria può costruire un ottimo blog azien...
Come (e perchè) un'istituzione finanziaria può costruire un ottimo blog azien...Come (e perchè) un'istituzione finanziaria può costruire un ottimo blog azien...
Come (e perchè) un'istituzione finanziaria può costruire un ottimo blog azien...AdviseOnly
 
Aiguille du Midi en France
Aiguille du Midi  en FranceAiguille du Midi  en France
Aiguille du Midi en FranceBalcon60
 
Verden trenger mer sjømat - langsiktig megatrend - Holberg Fondene
Verden trenger mer sjømat - langsiktig megatrend - Holberg FondeneVerden trenger mer sjømat - langsiktig megatrend - Holberg Fondene
Verden trenger mer sjømat - langsiktig megatrend - Holberg FondeneNordnet Norge
 
Stream Data into the Cloud with Raspberry Pi and Windows 10 IoT Core
Stream Data into the Cloud with Raspberry Pi and Windows 10 IoT CoreStream Data into the Cloud with Raspberry Pi and Windows 10 IoT Core
Stream Data into the Cloud with Raspberry Pi and Windows 10 IoT CoreMike Branstein
 
Peningkatan mutu agregat ringan beton bertulang ringan struktural untuk bangu...
Peningkatan mutu agregat ringan beton bertulang ringan struktural untuk bangu...Peningkatan mutu agregat ringan beton bertulang ringan struktural untuk bangu...
Peningkatan mutu agregat ringan beton bertulang ringan struktural untuk bangu...Krismanto Mahendra
 
Çiçeklerin Dünyası, World of Flowers IV
Çiçeklerin Dünyası, World of Flowers IVÇiçeklerin Dünyası, World of Flowers IV
Çiçeklerin Dünyası, World of Flowers IV***
 
Маркетинг на вдъхновението
Маркетинг на вдъхновениетоМаркетинг на вдъхновението
Маркетинг на вдъхновениетоJustine Toms
 
L298N 碳刷馬達驅動
L298N 碳刷馬達驅動L298N 碳刷馬達驅動
L298N 碳刷馬達驅動Ziyuan Chen
 
Платформа и решения НРЕ для больших данных
Платформа и решения НРЕ для больших данныхПлатформа и решения НРЕ для больших данных
Платформа и решения НРЕ для больших данныхAndrey Karpov
 
Kya aap jantay hain
Kya aap jantay hainKya aap jantay hain
Kya aap jantay hainrubab fatima
 
Dziś już nie ma offline - Paweł Loedl
Dziś już nie ma offline - Paweł LoedlDziś już nie ma offline - Paweł Loedl
Dziś już nie ma offline - Paweł LoedlSchool of New Media
 
مذكرة The future للصف الثالث الابتدائى الترم الثانى
مذكرة The future للصف الثالث الابتدائى الترم الثانىمذكرة The future للصف الثالث الابتدائى الترم الثانى
مذكرة The future للصف الثالث الابتدائى الترم الثانىSalah Abdelsalam
 
Geopolítica do Petróleo UENF - 30 AGO 2016
Geopolítica do Petróleo UENF - 30 AGO 2016Geopolítica do Petróleo UENF - 30 AGO 2016
Geopolítica do Petróleo UENF - 30 AGO 2016Lincoln Weinhardt
 
Syllabus leertraject koninkrijkszaken
Syllabus leertraject koninkrijkszakenSyllabus leertraject koninkrijkszaken
Syllabus leertraject koninkrijkszakenSibrenne Wagenaar
 
Güller,Roses
Güller,RosesGüller,Roses
Güller,Roses***
 

Destaque (20)

Prezentarea agentiei Justpixel
Prezentarea agentiei JustpixelPrezentarea agentiei Justpixel
Prezentarea agentiei Justpixel
 
OpenFest 2013 Open Source Hardware (OSHW) made in Bulgaria
OpenFest 2013 Open Source Hardware (OSHW) made in BulgariaOpenFest 2013 Open Source Hardware (OSHW) made in Bulgaria
OpenFest 2013 Open Source Hardware (OSHW) made in Bulgaria
 
Come (e perchè) un'istituzione finanziaria può costruire un ottimo blog azien...
Come (e perchè) un'istituzione finanziaria può costruire un ottimo blog azien...Come (e perchè) un'istituzione finanziaria può costruire un ottimo blog azien...
Come (e perchè) un'istituzione finanziaria può costruire un ottimo blog azien...
 
Aiguille du Midi en France
Aiguille du Midi  en FranceAiguille du Midi  en France
Aiguille du Midi en France
 
Verden trenger mer sjømat - langsiktig megatrend - Holberg Fondene
Verden trenger mer sjømat - langsiktig megatrend - Holberg FondeneVerden trenger mer sjømat - langsiktig megatrend - Holberg Fondene
Verden trenger mer sjømat - langsiktig megatrend - Holberg Fondene
 
Stream Data into the Cloud with Raspberry Pi and Windows 10 IoT Core
Stream Data into the Cloud with Raspberry Pi and Windows 10 IoT CoreStream Data into the Cloud with Raspberry Pi and Windows 10 IoT Core
Stream Data into the Cloud with Raspberry Pi and Windows 10 IoT Core
 
Peningkatan mutu agregat ringan beton bertulang ringan struktural untuk bangu...
Peningkatan mutu agregat ringan beton bertulang ringan struktural untuk bangu...Peningkatan mutu agregat ringan beton bertulang ringan struktural untuk bangu...
Peningkatan mutu agregat ringan beton bertulang ringan struktural untuk bangu...
 
Çiçeklerin Dünyası, World of Flowers IV
Çiçeklerin Dünyası, World of Flowers IVÇiçeklerin Dünyası, World of Flowers IV
Çiçeklerin Dünyası, World of Flowers IV
 
TITANIC II
TITANIC IITITANIC II
TITANIC II
 
Маркетинг на вдъхновението
Маркетинг на вдъхновениетоМаркетинг на вдъхновението
Маркетинг на вдъхновението
 
Big Data y Salud. Un enfoque orientado a resultados
Big Data y Salud. Un enfoque orientado a resultadosBig Data y Salud. Un enfoque orientado a resultados
Big Data y Salud. Un enfoque orientado a resultados
 
L298N 碳刷馬達驅動
L298N 碳刷馬達驅動L298N 碳刷馬達驅動
L298N 碳刷馬達驅動
 
Платформа и решения НРЕ для больших данных
Платформа и решения НРЕ для больших данныхПлатформа и решения НРЕ для больших данных
Платформа и решения НРЕ для больших данных
 
Kya aap jantay hain
Kya aap jantay hainKya aap jantay hain
Kya aap jantay hain
 
Dziś już nie ma offline - Paweł Loedl
Dziś już nie ma offline - Paweł LoedlDziś już nie ma offline - Paweł Loedl
Dziś już nie ma offline - Paweł Loedl
 
калин 100
калин 100калин 100
калин 100
 
مذكرة The future للصف الثالث الابتدائى الترم الثانى
مذكرة The future للصف الثالث الابتدائى الترم الثانىمذكرة The future للصف الثالث الابتدائى الترم الثانى
مذكرة The future للصف الثالث الابتدائى الترم الثانى
 
Geopolítica do Petróleo UENF - 30 AGO 2016
Geopolítica do Petróleo UENF - 30 AGO 2016Geopolítica do Petróleo UENF - 30 AGO 2016
Geopolítica do Petróleo UENF - 30 AGO 2016
 
Syllabus leertraject koninkrijkszaken
Syllabus leertraject koninkrijkszakenSyllabus leertraject koninkrijkszaken
Syllabus leertraject koninkrijkszaken
 
Güller,Roses
Güller,RosesGüller,Roses
Güller,Roses
 

Semelhante a MapR, Implications for Integration

Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Data mining-2011-09
Data mining-2011-09Data mining-2011-09
Data mining-2011-09Ted Dunning
 
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...Amazon Web Services
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesappaji intelhunt
 
Ted Dunning - Whither Hadoop
Ted Dunning - Whither HadoopTed Dunning - Whither Hadoop
Ted Dunning - Whither HadoopEd Kohlwey
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
 

Semelhante a MapR, Implications for Integration (20)

Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Data mining-2011-09
Data mining-2011-09Data mining-2011-09
Data mining-2011-09
 
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Ted Dunning - Whither Hadoop
Ted Dunning - Whither HadoopTed Dunning - Whither Hadoop
Ted Dunning - Whither Hadoop
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Unit 1
Unit 1Unit 1
Unit 1
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Data Science
Data ScienceData Science
Data Science
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
 

Mais de trihug

TriHUG October: Apache Ranger
TriHUG October: Apache RangerTriHUG October: Apache Ranger
TriHUG October: Apache Rangertrihug
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparktrihug
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Productiontrihug
 
TriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache SentryTriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache Sentrytrihug
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Sharktrihug
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
Practical pig
Practical pigPractical pig
Practical pigtrihug
 
Financial services trihug
Financial services trihugFinancial services trihug
Financial services trihugtrihug
 
TriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris ShainTriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris Shaintrihug
 
TriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan GatesTriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan Gatestrihug
 
TriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan GatesTriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan Gatestrihug
 

Mais de trihug (11)

TriHUG October: Apache Ranger
TriHUG October: Apache RangerTriHUG October: Apache Ranger
TriHUG October: Apache Ranger
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Production
 
TriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache SentryTriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache Sentry
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Practical pig
Practical pigPractical pig
Practical pig
 
Financial services trihug
Financial services trihugFinancial services trihug
Financial services trihug
 
TriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris ShainTriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris Shain
 
TriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan GatesTriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan Gates
 
TriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan GatesTriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan Gates
 

Último

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

MapR, Implications for Integration

  • 1. MapR, Implications for Integration CHUG – August 2011
  • 2. Outline MapR system overview Map-reduce review MapR architecture Performance Results Map-reduce on MapR Architectural implications Search indexing / deployment EM algorithm for machine learning … and more …
  • 4. Bottlenecks and Issues Read-only files Many copies in I/O path Shuffle based on HTTP Can’t use new technologies Eats file descriptors Spills go to local file space Bad for skewed distribution of sizes
  • 5. MapR Areas of Development
  • 6. MapR Improvements Faster file system Fewer copies Multiple NICS No file descriptor or page-buf competition Faster map-reduce Uses distributed file system Direct RPC to receiver Very wide merges
  • 7. MapR Innovations Volumes Distributed management Data placement Read/write random access file system Allows distributed meta-data Improved scaling Enables NFS access Application-level NIC bonding Transactionally correct snapshots and mirrors
  • 8.
  • 12. No need to manage directlyContainers are 16-32 GB segments of disk, placed on nodes
  • 13. Container locations and replication CLDB N1, N2 N1 N3, N2 N1, N2 N2 N1, N3 N3, N2 N3 Container location database (CLDB) keeps track of nodes hosting each container
  • 14.
  • 15.
  • 16. But not necessary, can page to disk
  • 17.
  • 18. Increase container size to 64G to serve 4EB cluster
  • 19.
  • 20. Terasort on MapR 10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm Elapsed time (mins) Lower is better
  • 21. HBase on MapR YCSB Random Read with 1 billion 1K records 10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM Recordspersecond Higher is better
  • 22. Small Files (Apache Hadoop, 10 nodes) Out of box Op: - create file - write 100 bytes - close Notes: - NN not replicated - NN uses 20G DRAM - DN uses 2G DRAM Tuned Rate (files/sec) # of files (m)
  • 23. MUCH faster for some operations Same 10 nodes … Create Rate # of files (millions)
  • 24. What MapR is not Volumes != federation MapR supports > 10,000 volumes all with independent placement and defaults Volumes support snapshots and mirroring NFS != FUSE Checksum and compress at gateway IP fail-over Read/write/update semantics at full speed MapR != maprfs
  • 26. NFS mounting models Export to the world NFS gateway runs on selected gateway hosts Local server NFS gateway runs on local host Enables local compression and check summing Export to self NFS gateway runs on all data nodes, mounted from localhost
  • 27. Export to the world NFS Server NFS Server NFS Server NFS Server NFS Client
  • 28. Local server Client Application NFS Server Cluster Nodes
  • 29. Universal export to self Cluster Nodes Cluster Node Task NFS Server
  • 30. Cluster Node Task NFS Server Cluster Node Task Cluster Node Task NFS Server NFS Server Nodes are identical
  • 31. Application architecture So now we have a hammer Let’s find us some nails!
  • 32. Sharded text Indexing Index text to local disk and then copy index to distributed file store Assign documents to shards Map Reducer Clustered index storage Input documents Copy to local disk typically required before index can be loaded Local disk Search Engine Local disk
  • 33. Shardedtext indexing Mapper assigns document to shard Shard is usually hash of document id Reducer indexes all documents for a shard Indexes created on local disk On success, copy index to DFS On failure, delete local files Must avoid directory collisions can’t use shard id! Must manage and reclaim local disk space
  • 34. Conventional data flow Failure of search engine requires another download of the index from clustered storage. Map Failure of a reducer causes garbage to accumulate in the local disk Reducer Clustered index storage Input documents Local disk Search Engine Local disk
  • 35. Simplified NFS data flows Index to task work directory via NFS Map Reducer Search Engine Input documents Clustered index storage Failure of a reducer is cleaned up by map-reduce framework Search engine reads mirrored index directly.
  • 36. Simplified NFS data flows Search Engine Mirroring allows exact placement of index data Map Reducer Input documents Search Engine Aribitrary levels of replication also possible Mirrors
  • 38. K-means Classic E-M based algorithm Given cluster centroids, Assign each data point to nearest centroid Accumulate new centroids Rinse, lather, repeat
  • 39. K-means, the movie Centroids Assign to Nearest centroid I n p u t Aggregate new centroids
  • 41. Parallel Stochastic Gradient Descent Model Train sub model I n p u t Average models
  • 42. VariationalDirichlet Assignment Model Gather sufficient statistics I n p u t Update model
  • 43. Old tricks, new dogs Mapper Assign point to cluster Emit cluster id, (1, point) Combiner and reducer Sum counts, weighted sum of points Emit cluster id, (n, sum/n) Output to HDFS Read from local disk from distributed cache Read from HDFS to local disk by distributed cache Written by map-reduce
  • 44. Old tricks, new dogs Mapper Assign point to cluster Emit cluster id, (1, point) Combiner and reducer Sum counts, weighted sum of points Emit cluster id, (n, sum/n) Output to HDFS Read from NFS Written by map-reduce MapR FS
  • 45. Poor man’s Pregel Mapper Lines in bold can use conventional I/O via NFS while not done: read and accumulate input models for each input: accumulate model write model synchronize reset input format emit summary 37
  • 46. Click modeling architecture Map-reduce Side-data Now via NFS Feature extraction and down sampling I n p u t Data join Sequential SGD Learning
  • 47. Click modeling architecture Map-reduce Map-reduce Side-data Map-reduce cooperates with NFS Sequential SGD Learning Feature extraction and down sampling Sequential SGD Learning I n p u t Data join Sequential SGD Learning Sequential SGD Learning
  • 49. Hybrid model flow Map-reduce Map-reduce Feature extraction and down sampling Down stream modeling Deployed Model ?? SVD (PageRank) (spectral)
  • 50.
  • 51. Hybrid model flow Feature extraction and down sampling Down stream modeling Deployed Model Sequential Map-reduce SVD (PageRank) (spectral)
  • 53. Trivial visualization interface Map-reduce output is visible via NFS Legacy visualization just works $ R > x <- read.csv(“/mapr/my.cluster/home/ted/data/foo.out”) > plot(error ~ t, x) > q(save=‘n’)
  • 54. Conclusions We used to know all this Tab completion used to work 5 years of work-arounds have clouded our memories We just have to remember the future