NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications

MapR, Architecture, Philosophy and Applications NY HUG – October 2011

Outline Architecture (MapR) Philosophy Architectural (Machine learning)

Map-Reduce, the Original Mission Shuffle Input Output

Bottlenecks and Issues Read-only files Many copies in I/O path Shuffle based on HTTP Can’t use new technologies Eats file descriptors Spills go to local file space Bad for skewed distribution of sizes

MapR Improvements Faster file system Fewer copies Multiple NICS No file descriptor or page-buf competition Faster map-reduce Uses distributed file system Direct RPC to receiver Very wide merges

MapR Innovations Volumes Distributed management Data placement Read/write random access file system Allows distributed meta-data Improved scaling Enables NFS access Application-level NIC bonding Transactionally correct snapshots and mirrors

MapR'sContainers Files/directories are sharded into blocks, whichare placed into mini NNs (containers ) on disks ,[object Object]

No need to manage directlyContainers are 16-32 GB segments of disk, placed on nodes

MapR'sContainers ,[object Object]

Failures are handled by rearranging replication,[object Object]

Container locations and replication CLDB N1, N2 N1 N3, N2 N1, N2 N2 N1, N3 N3, N2 N3 Container location database (CLDB) keeps track of nodes hosting each container and replication chain order

MapR Scaling Containers represent 16 - 32GB of data ,[object Object]

100M containers = ~ 2 Exabytes (a very large cluster)250 bytes DRAM to cache a container ,[object Object]

But not necessary, can page to disk

Typical large 10PB cluster needs 2GBContainer-reports are 100x - 1000x < HDFS block-reports ,[object Object]

Increase container size to 64G to serve 4EB cluster

Map/reduce not affected,[object Object]

Terasort on MapR 10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm Elapsed time (mins) Lower is better

HBase on MapR YCSB Random Read with 1 billion 1K records 10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM Recordspersecond Higher is better

Small Files (Apache Hadoop, 10 nodes) Out of box Op: - create file - write 100 bytes - close Notes: - NN not replicated - NN uses 20G DRAM - DN uses 2G DRAM Tuned Rate (files/sec) # of files (m)

MUCH faster for some operations Same 10 nodes … Create Rate # of files (millions)

What MapR is not Volumes != federation MapR supports > 10,000 volumes all with independent placement and defaults Volumes support snapshots and mirroring NFS != FUSE Checksum and compress at gateway IP fail-over Read/write/update semantics at full speed MapR != maprfs

For startups History is always small The future is huge Must adopt new technology to survive Compatibility is not as important In fact, incompatibility is assumed

Physics of large companies Absolute growth still very large Startup phase

For large businesses Present state is always large Relative growth is much smaller Absolute growth rate can be very large Must adopt new technology to survive Cautiously! But must integrate technology with legacy Compatibility is crucial

The startup technology picture No compatibility requirement Old computers and software Expected hardware and software growth Current computers and software

The large enterprise picture Must work together ? Proof of concept Hadoop cluster Current hardware and software Long-term Hadoop cluster

What does this mean? Hadoop is very, very good at streaming through things in batch jobs Hbase is good at persisting data in very write-heavy workloads Unfortunately, the foundation of both systems is HDFS which does not export or import well

Narrow Foundations Pig Hive Big data is heavy and expensive to move. Web Services Sequential File Processing OLAP OLTP Map/ Reduce Hbase RDBMS NAS HDFS

Narrow Foundations Because big data has inertia, it is difficult to move It costs time to move It costs reliability because of more moving parts The result is many duplicate copies

One Possible Answer Widen the foundation Use standard communication protocols Allow conventional processing to share with parallel processing

Broad Foundation Pig Hive Web Services Sequential File Processing OLAP OLTP Map/ Reduce Hbase RDBMS NAS HDFS MapR

Export to the world NFS Server NFS Server NFS Server NFS Server NFS Client

Local server Client Application NFS Server Cluster Nodes

Universal export to self Cluster Nodes Cluster Node Task NFS Server

Cluster Node Task NFS Server Cluster Node Task Cluster Node Task NFS Server NFS Server Nodes are identical

Application architecture High performance map-reduce is nice But algorithmic flexibility is even nicer

Hybrid model flow Map-reduce Map-reduce Feature extraction and down sampling Down stream modeling Deployed Model ?? SVD (PageRank) (spectral)

Hybrid model flow Feature extraction and down sampling Down stream modeling Deployed Model Sequential Map-reduce SVD (PageRank) (spectral)

Shardedtext indexing Mapper assigns document to shard Shard is usually hash of document id Reducer indexes all documents for a shard Indexes created on local disk On success, copy index to DFS On failure, delete local files Must avoid directory collisions can’t use shard id! Must manage and reclaim local disk space

Sharded text Indexing Index text to local disk and then copy index to distributed file store Assign documents to shards Map Reducer Clustered index storage Input documents Copy to local disk typically required before index can be loaded Local disk Search Engine Local disk

Conventional data flow Failure of search engine requires another download of the index from clustered storage. Map Failure of a reducer causes garbage to accumulate in the local disk Reducer Clustered index storage Input documents Local disk Search Engine Local disk

NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (13)

Semelhante a NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications

Semelhante a NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications (20)

Mais de Jason Shao

Mais de Jason Shao (6)

Último

Último (20)

NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications

Notas do Editor