These slides are from a recent talk I gave at Lawrence Livermore Labs.
The talk gives an architectural outline of the MapR system and then discusses how this architecture facilitates large scale machine learning algorithms.
4. Bottlenecks and Issues Read-only files Many copies in I/O path Shuffle based on HTTP Can’t use new technologies Eats file descriptors Spills go to local file space Bad for skewed distribution of sizes
5. MapR Improvements Faster file system Fewer copies Multiple NICS No file descriptor or page-buf competition Faster map-reduce Uses distributed file system Direct RPC to receiver Very wide merges
6. MapR Innovations Volumes Distributed management Data placement Read/write random access file system Allows distributed meta-data Improved scaling Enables NFS access Application-level NIC bonding Transactionally correct snapshots and mirrors
22. NFS mounting models Export to the world NFS gateway runs on selected gateway hosts Local server NFS gateway runs on local host Enables local compression and check summing Export to self NFS gateway runs on all data nodes, mounted from localhost
23. Export to the world NFS Server NFS Server NFS Server NFS Server NFS Client
26. Cluster Node Application NFS Server Cluster Node Application Cluster Node Application NFS Server NFS Server Nodes are identical
27. Shardedtext indexing Mapper assigns document to shard Shard is usually hash of document id Reducer indexes all documents for a shard Indexes created on local disk On success, copy index to DFS On failure, delete local files Must avoid directory collisions can’t use shard id! Must manage local disk space
28. Conventional data flows Failure of search engine requires another download of the index from clustered storage. Map Failure of a reducer causes garbage to accumulate in the local disk Reducer Clustered index storage Input documents Local disk Search Engine Local disk
29. Simplified NFS data flows Map Reducer Search Engine Input documents Clustered index storage Failure of a reducer is cleaned up by map-reduce framework Search engine reads mirrored index directly.
31. K-means Classic E-M based algorithm Given cluster centroids, Assign each data point to nearest centroid Accumulate new centroids Rinse, lather, repeat
32. K-means, the movie Centroids Assign to Nearest centroid I n p u t Aggregate new centroids
36. Old tricks, new dogs Mapper Assign point to cluster Emit cluster id, (1, point) Combiner and reducer Sum counts, weighted sum of points Emit cluster id, (n, sum/n) Output to HDFS Read from local disk from distributed cache Read from HDFS to local disk by distributed cache Written by map-reduce
37. Old tricks, new dogs Mapper Assign point to cluster Emit cluster id, 1, point Combiner and reducer Sum counts, weighted sum of points Emit cluster id, n, sum/n Output to HDFS Read from NFS Written by map-reduce MapR FS
38. Click modeling architecture Map-reduce Side-data Now via NFS Feature extraction and down sampling I n p u t Data join Sequential SGD Learning
39. Poor man’s Pregel Mapper Lines in bold can use conventional I/O via NFS while not done: read and accumulate input models for each input: accumulate model write model synchronize reset input format emit summary 31
40. Trivial visualization interface Map-reduce output is visible via NFS Legacy visualization just works $ R > x <- read.csv(“/mapr/my.cluster/home/ted/data/foo.out”) > plot(error ~ t, x) > q(save=‘n’)
41. Conclusions We used to know all this Tab completion used to work 5 years of work-arounds have clouded our memories We just have to remember the future