SlideShare uma empresa Scribd logo
1 de 55
MapR, Architecture, Philosophy and Applications NY HUG – October 2011
Outline Architecture (MapR) Philosophy Architectural (Machine learning)
Map-Reduce, the Original Mission Shuffle Input Output
Bottlenecks and Issues Read-only files Many copies in I/O path Shuffle based on HTTP Can’t use new technologies Eats file descriptors Spills go to local file space Bad for skewed distribution of sizes
MapR Areas of Development
MapR Improvements Faster file system Fewer copies Multiple NICS No file descriptor or page-buf competition Faster map-reduce Uses distributed file system Direct RPC to receiver Very wide merges
MapR Innovations Volumes Distributed management Data placement Read/write random access file system Allows distributed meta-data Improved scaling Enables NFS access Application-level NIC bonding Transactionally correct snapshots and mirrors
MapR'sContainers Files/directories are sharded into blocks, whichare placed into mini NNs (containers ) on disks ,[object Object]
Directories & files
Data blocks
Replicated on servers
No need to manage directlyContainers are 16-32 GB segments of disk, placed on nodes
MapR'sContainers ,[object Object]
Updates are transactional
Failures are handled by rearranging replication,[object Object]
Directories & files
Data blocks
Replicated on servers
No need to manage directlyContainers are 16-32 GB segments of disk, placed on nodes
Container locations and replication CLDB N1, N2 N1 N3, N2 N1, N2 N2 N1, N3 N3, N2 N3 Container location database (CLDB) keeps track of nodes hosting each container and replication chain order
MapR Scaling Containers represent 16 - 32GB of data ,[object Object]
100M containers =  ~ 2 Exabytes  (a very large cluster)250 bytes DRAM to cache a container ,[object Object]
But not necessary, can page to disk
Typical large 10PB cluster needs 2GBContainer-reports are 100x - 1000x  <  HDFS block-reports ,[object Object]
Increase container size to 64G to serve 4EB cluster
Map/reduce not affected,[object Object]
Terasort on MapR 10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm Elapsed time (mins) Lower is better
HBase on MapR YCSB Random Read with 1 billion 1K records 10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM Recordspersecond Higher is better
Small Files (Apache Hadoop, 10 nodes) Out of box Op:  - create file         - write 100 bytes         - close Notes: - NN not replicated - NN uses 20G DRAM - DN uses  2G  DRAM Tuned Rate (files/sec) # of files (m)
MUCH faster for some operations Same 10 nodes … Create Rate # of files (millions)
What MapR is not Volumes != federation MapR supports > 10,000 volumes all with independent placement and defaults Volumes support snapshots and mirroring NFS != FUSE Checksum and compress at gateway IP fail-over Read/write/update semantics at full speed MapR != maprfs
Philosophy
Physics of startup companies
For startups History is always small The future is huge Must adopt new technology to survive Compatibility is not as important In fact, incompatibility is assumed
Physics of large companies Absolute growth still very large Startup phase
For large businesses Present state is always large Relative growth is much smaller Absolute growth rate can be very large Must adopt new technology to survive Cautiously! But must integrate technology with legacy Compatibility is crucial
The startup technology picture No compatibility requirement Old computers and software Expected hardware and software growth Current computers and software
The large enterprise picture Must work together ? Proof of concept Hadoop cluster Current hardware and software Long-term Hadoop cluster
What does this mean? Hadoop is very, very good at streaming through things in batch jobs Hbase is good at persisting data in very write-heavy workloads Unfortunately, the foundation of both systems is HDFS which does not export or import well
Narrow Foundations Pig Hive Big data is heavy and expensive to move. Web Services Sequential File Processing OLAP OLTP Map/ Reduce Hbase RDBMS NAS HDFS
Narrow Foundations Because big data has inertia, it is difficult to move It costs time to move It costs reliability because of more moving parts The result is many duplicate copies
One Possible Answer Widen the foundation Use standard communication protocols Allow conventional processing to share with parallel processing
Broad Foundation Pig Hive Web Services Sequential File Processing OLAP OLTP Map/ Reduce Hbase RDBMS NAS HDFS MapR
New Capabilities
Export to the world NFS Server NFS Server NFS Server NFS Server NFS Client
Local server Client Application NFS Server Cluster Nodes
Universal export to self Cluster Nodes Cluster Node Task NFS Server
Cluster Node Task NFS Server Cluster Node Task Cluster Node Task NFS Server NFS Server Nodes are identical
Application architecture High performance map-reduce is nice But algorithmic flexibility is even nicer
Hybrid model flow Map-reduce Map-reduce Feature extraction and  down sampling Down  stream  modeling Deployed Model ?? SVD (PageRank) (spectral)
Hybrid model flow Feature extraction and  down sampling Down  stream  modeling Deployed Model Sequential Map-reduce SVD (PageRank) (spectral)
Shardedtext indexing Mapper assigns document to shard Shard is usually hash of document id Reducer indexes all documents for a shard Indexes created on local disk On success, copy index to DFS On failure, delete local files Must avoid directory collisions  can’t use shard id! Must manage and reclaim local disk space
Sharded text Indexing Index text to local disk and then copy index to distributed file store Assign documents to shards Map Reducer Clustered index storage Input documents Copy to local disk typically required before index can be loaded Local disk Search Engine Local disk
Conventional data flow Failure of search engine requires another download of the index from clustered storage. Map Failure of a reducer causes garbage to accumulate in the local disk Reducer Clustered index storage Input documents Local disk Search Engine Local disk

Mais conteúdo relacionado

Mais procurados

MapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR HadoopMapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR Hadoopabord
 
Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Shivkumar Babshetty
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityEdureka!
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanNarayana B
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
 
MapR, Implications for Integration
MapR, Implications for IntegrationMapR, Implications for Integration
MapR, Implications for Integrationtrihug
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)mundlapudi
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesDataWorks Summit/Hadoop Summit
 
Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of HadoopKnoldus Inc.
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterAltoros
 
Hw09 Monitoring Best Practices
Hw09   Monitoring Best PracticesHw09   Monitoring Best Practices
Hw09 Monitoring Best PracticesCloudera, Inc.
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuningAnil Reddy
 

Mais procurados (20)

MapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR HadoopMapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR Hadoop
 
Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High Availability
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Apache Hadoop 0.22 and Other Versions
Apache Hadoop 0.22 and Other VersionsApache Hadoop 0.22 and Other Versions
Apache Hadoop 0.22 and Other Versions
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
MapR, Implications for Integration
MapR, Implications for IntegrationMapR, Implications for Integration
MapR, Implications for Integration
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of Hadoop
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
 
Hw09 Monitoring Best Practices
Hw09   Monitoring Best PracticesHw09   Monitoring Best Practices
Hw09 Monitoring Best Practices
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuning
 
ha_module5
ha_module5ha_module5
ha_module5
 

Destaque

Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Mathieu Dumoulin
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRclive boulton
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillMapR Technologies
 
Which data should you move to Hadoop?
Which data should you move to Hadoop?Which data should you move to Hadoop?
Which data should you move to Hadoop?Attunity
 
Design Patterns for working with Fast Data in Kafka
Design Patterns for working with Fast Data in KafkaDesign Patterns for working with Fast Data in Kafka
Design Patterns for working with Fast Data in KafkaIan Downard
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksMapR Technologies
 
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014John Berns
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...The Hive
 
Spark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleSpark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleIan Downard
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseHortonworks
 

Destaque (13)

Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
Philly DB MapR Overview
Philly DB MapR OverviewPhilly DB MapR Overview
Philly DB MapR Overview
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache Drill
 
Which data should you move to Hadoop?
Which data should you move to Hadoop?Which data should you move to Hadoop?
Which data should you move to Hadoop?
 
Design Patterns for working with Fast Data in Kafka
Design Patterns for working with Fast Data in KafkaDesign Patterns for working with Fast Data in Kafka
Design Patterns for working with Fast Data in Kafka
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
 
Spark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleSpark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating Example
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 

Semelhante a NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications

Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesappaji intelhunt
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Architectural Evolution Starting from Hadoop
Architectural Evolution Starting from HadoopArchitectural Evolution Starting from Hadoop
Architectural Evolution Starting from HadoopSpagoWorld
 
Quantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSQuantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSbigdatagurus_meetup
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, FacebookМасштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebookyaevents
 

Semelhante a NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications (20)

Llnl talk
Llnl talkLlnl talk
Llnl talk
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Hadoop Research
Hadoop Research Hadoop Research
Hadoop Research
 
Architectural Evolution Starting from Hadoop
Architectural Evolution Starting from HadoopArchitectural Evolution Starting from Hadoop
Architectural Evolution Starting from Hadoop
 
HADOOP
HADOOPHADOOP
HADOOP
 
Quantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSQuantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFS
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hadoop
HadoopHadoop
Hadoop
 
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, FacebookМасштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
 

Mais de Jason Shao

Managing Hadoop with Puppet
Managing Hadoop with PuppetManaging Hadoop with Puppet
Managing Hadoop with PuppetJason Shao
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 
NYC Java Meetup - Profiling and Performance
NYC Java Meetup - Profiling and PerformanceNYC Java Meetup - Profiling and Performance
NYC Java Meetup - Profiling and PerformanceJason Shao
 
Sakai NYC User Group
Sakai NYC User GroupSakai NYC User Group
Sakai NYC User GroupJason Shao
 

Mais de Jason Shao (6)

Tune hadoop
Tune hadoopTune hadoop
Tune hadoop
 
Sgi hadoop
Sgi hadoopSgi hadoop
Sgi hadoop
 
Managing Hadoop with Puppet
Managing Hadoop with PuppetManaging Hadoop with Puppet
Managing Hadoop with Puppet
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
NYC Java Meetup - Profiling and Performance
NYC Java Meetup - Profiling and PerformanceNYC Java Meetup - Profiling and Performance
NYC Java Meetup - Profiling and Performance
 
Sakai NYC User Group
Sakai NYC User GroupSakai NYC User Group
Sakai NYC User Group
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 

Último (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications

  • 1. MapR, Architecture, Philosophy and Applications NY HUG – October 2011
  • 2. Outline Architecture (MapR) Philosophy Architectural (Machine learning)
  • 3. Map-Reduce, the Original Mission Shuffle Input Output
  • 4. Bottlenecks and Issues Read-only files Many copies in I/O path Shuffle based on HTTP Can’t use new technologies Eats file descriptors Spills go to local file space Bad for skewed distribution of sizes
  • 5. MapR Areas of Development
  • 6. MapR Improvements Faster file system Fewer copies Multiple NICS No file descriptor or page-buf competition Faster map-reduce Uses distributed file system Direct RPC to receiver Very wide merges
  • 7. MapR Innovations Volumes Distributed management Data placement Read/write random access file system Allows distributed meta-data Improved scaling Enables NFS access Application-level NIC bonding Transactionally correct snapshots and mirrors
  • 8.
  • 12. No need to manage directlyContainers are 16-32 GB segments of disk, placed on nodes
  • 13.
  • 15.
  • 19. No need to manage directlyContainers are 16-32 GB segments of disk, placed on nodes
  • 20. Container locations and replication CLDB N1, N2 N1 N3, N2 N1, N2 N2 N1, N3 N3, N2 N3 Container location database (CLDB) keeps track of nodes hosting each container and replication chain order
  • 21.
  • 22.
  • 23. But not necessary, can page to disk
  • 24.
  • 25. Increase container size to 64G to serve 4EB cluster
  • 26.
  • 27. Terasort on MapR 10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm Elapsed time (mins) Lower is better
  • 28. HBase on MapR YCSB Random Read with 1 billion 1K records 10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM Recordspersecond Higher is better
  • 29. Small Files (Apache Hadoop, 10 nodes) Out of box Op: - create file - write 100 bytes - close Notes: - NN not replicated - NN uses 20G DRAM - DN uses 2G DRAM Tuned Rate (files/sec) # of files (m)
  • 30. MUCH faster for some operations Same 10 nodes … Create Rate # of files (millions)
  • 31. What MapR is not Volumes != federation MapR supports > 10,000 volumes all with independent placement and defaults Volumes support snapshots and mirroring NFS != FUSE Checksum and compress at gateway IP fail-over Read/write/update semantics at full speed MapR != maprfs
  • 33. Physics of startup companies
  • 34. For startups History is always small The future is huge Must adopt new technology to survive Compatibility is not as important In fact, incompatibility is assumed
  • 35. Physics of large companies Absolute growth still very large Startup phase
  • 36. For large businesses Present state is always large Relative growth is much smaller Absolute growth rate can be very large Must adopt new technology to survive Cautiously! But must integrate technology with legacy Compatibility is crucial
  • 37. The startup technology picture No compatibility requirement Old computers and software Expected hardware and software growth Current computers and software
  • 38. The large enterprise picture Must work together ? Proof of concept Hadoop cluster Current hardware and software Long-term Hadoop cluster
  • 39. What does this mean? Hadoop is very, very good at streaming through things in batch jobs Hbase is good at persisting data in very write-heavy workloads Unfortunately, the foundation of both systems is HDFS which does not export or import well
  • 40. Narrow Foundations Pig Hive Big data is heavy and expensive to move. Web Services Sequential File Processing OLAP OLTP Map/ Reduce Hbase RDBMS NAS HDFS
  • 41. Narrow Foundations Because big data has inertia, it is difficult to move It costs time to move It costs reliability because of more moving parts The result is many duplicate copies
  • 42. One Possible Answer Widen the foundation Use standard communication protocols Allow conventional processing to share with parallel processing
  • 43. Broad Foundation Pig Hive Web Services Sequential File Processing OLAP OLTP Map/ Reduce Hbase RDBMS NAS HDFS MapR
  • 45. Export to the world NFS Server NFS Server NFS Server NFS Server NFS Client
  • 46. Local server Client Application NFS Server Cluster Nodes
  • 47. Universal export to self Cluster Nodes Cluster Node Task NFS Server
  • 48. Cluster Node Task NFS Server Cluster Node Task Cluster Node Task NFS Server NFS Server Nodes are identical
  • 49. Application architecture High performance map-reduce is nice But algorithmic flexibility is even nicer
  • 50. Hybrid model flow Map-reduce Map-reduce Feature extraction and down sampling Down stream modeling Deployed Model ?? SVD (PageRank) (spectral)
  • 51.
  • 52. Hybrid model flow Feature extraction and down sampling Down stream modeling Deployed Model Sequential Map-reduce SVD (PageRank) (spectral)
  • 53. Shardedtext indexing Mapper assigns document to shard Shard is usually hash of document id Reducer indexes all documents for a shard Indexes created on local disk On success, copy index to DFS On failure, delete local files Must avoid directory collisions can’t use shard id! Must manage and reclaim local disk space
  • 54. Sharded text Indexing Index text to local disk and then copy index to distributed file store Assign documents to shards Map Reducer Clustered index storage Input documents Copy to local disk typically required before index can be loaded Local disk Search Engine Local disk
  • 55. Conventional data flow Failure of search engine requires another download of the index from clustered storage. Map Failure of a reducer causes garbage to accumulate in the local disk Reducer Clustered index storage Input documents Local disk Search Engine Local disk
  • 56. Simplified NFS data flows Index to task work directory via NFS Map Reducer Search Engine Input documents Clustered index storage Failure of a reducer is cleaned up by map-reduce framework Search engine reads mirrored index directly.
  • 57. Simplified NFS data flows Search Engine Mirroring allows exact placement of index data Map Reducer Input documents Search Engine Aribitrary levels of replication also possible Mirrors
  • 58. K-means Classic E-M based algorithm Given cluster centroids, Assign each data point to nearest centroid Accumulate new centroids Rinse, lather, repeat
  • 59. K-means, the movie Centroids Assign to Nearest centroid I n p u t Aggregate new centroids
  • 60. Parallel Stochastic Gradient Descent Model Train sub model I n p u t Average models
  • 61. VariationalDirichlet Assignment Model Gather sufficient statistics I n p u t Update model
  • 62. Old tricks, new dogs Mapper Assign point to cluster Emit cluster id, (1, point) Combiner and reducer Sum counts, weighted sum of points Emit cluster id, (n, sum/n) Output to HDFS Read from local disk from distributed cache Read from HDFS to local disk by distributed cache Written by map-reduce
  • 63. Old tricks, new dogs Mapper Assign point to cluster Emit cluster id, (1, point) Combiner and reducer Sum counts, weighted sum of points Emit cluster id, (n, sum/n) Output to HDFS Read from NFS Written by map-reduce MapR FS
  • 64. Poor man’s Pregel Mapper Lines in bold can use conventional I/O via NFS while not done: read and accumulate input models for each input: accumulate model write model synchronize reset input format emit summary 51
  • 65. Click modeling architecture Map-reduce Side-data Now via NFS Feature extraction and down sampling I n p u t Data join Sequential SGD Learning
  • 66. Click modeling architecture Map-reduce Map-reduce Side-data Map-reduce cooperates with NFS Sequential SGD Learning Feature extraction and down sampling Sequential SGD Learning I n p u t Data join Sequential SGD Learning Sequential SGD Learning
  • 67. Trivial visualization interface Map-reduce output is visible via NFS Legacy visualization just works $ R > x <- read.csv(“/mapr/my.cluster/home/ted/data/foo.out”) > plot(error ~ t, x) > q(save=‘n’)
  • 68. Conclusions We used to know all this Tab completion used to work 5 years of work-arounds have clouded our memories We just have to remember the future

Notas do Editor

  1. Constant time implies constantfactor of growth. Thus the accumulation of all of history before 10 time units ago is less than half the accumulation in the last 10 units alone. This is true at all time.
  2. Startups use this fact to their advantage and completely change everything to allow time-efficient development initially with conversion to computer-efficient systems later.
  3. Here the later history is shown after the initial exponential growth phase. This changes the economics of the company dramatically.
  4. The startup can throw away history because it is so small. That means that the startup has almost no compatibility requirement because the data lost due to lack of compatibility is a small fraction of the total data.
  5. A large enterprise cannot do that. They have to have access to the old data and have to share between old data and Hadoop accessible data.This doesn’t have to happen with the proof of concept level, but it really must happen when hadoop first goes to production.
  6. But stock Hadoop does not handle this well.
  7. This is because Hadoop and other data silos have different foundations. What is worse, there is a semantic wall that separates HDFS from normal resources.
  8. Here is a picture that shows how MapR can replace the foundation and provide compatibility. Of course, MapR provide much more than just the base, but the foundation is what provides the fundamental limitation or lack of limit in MapR’s case.