Slides from: http://www.meetup.com/Hadoop-NYC/events/34411232/
There are a number of assumptions that come with using standard Hadoop that are based on Hadoop's initial architecture. Many of these assumptions can be relaxed with more advanced architectures such as those provided by MapR. These changes in assumptions have ripple effects throughout the system architecture. This is significant because many systems like Mahout provide multiple implementations of various algorithms with very different performance and scaling implications.
I will describe several case studies and use these examples to show how these changes can simplify systems or, in some cases, make certain classes of programs run an order of magnitude faster.
About the speaker: Ted Dunning - Chief Application Architect (MapR)
Ted has held Chief Scientist positions at Veoh Networks, ID Analytics and at MusicMatch, (now Yahoo Music). Ted is responsible for building the most advanced identity theft detection system on the planet, as well as one of the largest peer-assisted video distribution systems and ground-breaking music and video recommendations systems. Ted has 15 issued and 15 pending patents and contributes to several Apache open source projects including Hadoop, Zookeeper and Hbase. He is also a committer for Apache Mahout. Ted earned a BS degree in electrical engineering from the University of Colorado; a MS degree in computer science from New Mexico State University; and a Ph.D. in computing science from Sheffield University in the United Kingdom. Ted also bought the drinks at one of the very first Hadoop User Group meetings.
4. Bottlenecks and Issues Read-only files Many copies in I/O path Shuffle based on HTTP Can’t use new technologies Eats file descriptors Spills go to local file space Bad for skewed distribution of sizes
6. MapR Improvements Faster file system Fewer copies Multiple NICS No file descriptor or page-buf competition Faster map-reduce Uses distributed file system Direct RPC to receiver Very wide merges
7. MapR Innovations Volumes Distributed management Data placement Read/write random access file system Allows distributed meta-data Improved scaling Enables NFS access Application-level NIC bonding Transactionally correct snapshots and mirrors
27. Terasort on MapR 10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm Elapsed time (mins) Lower is better
28. HBase on MapR YCSB Random Read with 1 billion 1K records 10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM Recordspersecond Higher is better
29. Small Files (Apache Hadoop, 10 nodes) Out of box Op: - create file - write 100 bytes - close Notes: - NN not replicated - NN uses 20G DRAM - DN uses 2G DRAM Tuned Rate (files/sec) # of files (m)
30. MUCH faster for some operations Same 10 nodes … Create Rate # of files (millions)
31. What MapR is not Volumes != federation MapR supports > 10,000 volumes all with independent placement and defaults Volumes support snapshots and mirroring NFS != FUSE Checksum and compress at gateway IP fail-over Read/write/update semantics at full speed MapR != maprfs
34. For startups History is always small The future is huge Must adopt new technology to survive Compatibility is not as important In fact, incompatibility is assumed
35. Physics of large companies Absolute growth still very large Startup phase
36. For large businesses Present state is always large Relative growth is much smaller Absolute growth rate can be very large Must adopt new technology to survive Cautiously! But must integrate technology with legacy Compatibility is crucial
37. The startup technology picture No compatibility requirement Old computers and software Expected hardware and software growth Current computers and software
38. The large enterprise picture Must work together ? Proof of concept Hadoop cluster Current hardware and software Long-term Hadoop cluster
39. What does this mean? Hadoop is very, very good at streaming through things in batch jobs Hbase is good at persisting data in very write-heavy workloads Unfortunately, the foundation of both systems is HDFS which does not export or import well
40. Narrow Foundations Pig Hive Big data is heavy and expensive to move. Web Services Sequential File Processing OLAP OLTP Map/ Reduce Hbase RDBMS NAS HDFS
41. Narrow Foundations Because big data has inertia, it is difficult to move It costs time to move It costs reliability because of more moving parts The result is many duplicate copies
42. One Possible Answer Widen the foundation Use standard communication protocols Allow conventional processing to share with parallel processing
43. Broad Foundation Pig Hive Web Services Sequential File Processing OLAP OLTP Map/ Reduce Hbase RDBMS NAS HDFS MapR
50. Hybrid model flow Map-reduce Map-reduce Feature extraction and down sampling Down stream modeling Deployed Model ?? SVD (PageRank) (spectral)
51.
52. Hybrid model flow Feature extraction and down sampling Down stream modeling Deployed Model Sequential Map-reduce SVD (PageRank) (spectral)
53. Shardedtext indexing Mapper assigns document to shard Shard is usually hash of document id Reducer indexes all documents for a shard Indexes created on local disk On success, copy index to DFS On failure, delete local files Must avoid directory collisions can’t use shard id! Must manage and reclaim local disk space
54. Sharded text Indexing Index text to local disk and then copy index to distributed file store Assign documents to shards Map Reducer Clustered index storage Input documents Copy to local disk typically required before index can be loaded Local disk Search Engine Local disk
55. Conventional data flow Failure of search engine requires another download of the index from clustered storage. Map Failure of a reducer causes garbage to accumulate in the local disk Reducer Clustered index storage Input documents Local disk Search Engine Local disk
56. Simplified NFS data flows Index to task work directory via NFS Map Reducer Search Engine Input documents Clustered index storage Failure of a reducer is cleaned up by map-reduce framework Search engine reads mirrored index directly.
57. Simplified NFS data flows Search Engine Mirroring allows exact placement of index data Map Reducer Input documents Search Engine Aribitrary levels of replication also possible Mirrors
58. K-means Classic E-M based algorithm Given cluster centroids, Assign each data point to nearest centroid Accumulate new centroids Rinse, lather, repeat
59. K-means, the movie Centroids Assign to Nearest centroid I n p u t Aggregate new centroids
62. Old tricks, new dogs Mapper Assign point to cluster Emit cluster id, (1, point) Combiner and reducer Sum counts, weighted sum of points Emit cluster id, (n, sum/n) Output to HDFS Read from local disk from distributed cache Read from HDFS to local disk by distributed cache Written by map-reduce
63. Old tricks, new dogs Mapper Assign point to cluster Emit cluster id, (1, point) Combiner and reducer Sum counts, weighted sum of points Emit cluster id, (n, sum/n) Output to HDFS Read from NFS Written by map-reduce MapR FS
64. Poor man’s Pregel Mapper Lines in bold can use conventional I/O via NFS while not done: read and accumulate input models for each input: accumulate model write model synchronize reset input format emit summary 51
65. Click modeling architecture Map-reduce Side-data Now via NFS Feature extraction and down sampling I n p u t Data join Sequential SGD Learning
66. Click modeling architecture Map-reduce Map-reduce Side-data Map-reduce cooperates with NFS Sequential SGD Learning Feature extraction and down sampling Sequential SGD Learning I n p u t Data join Sequential SGD Learning Sequential SGD Learning
67. Trivial visualization interface Map-reduce output is visible via NFS Legacy visualization just works $ R > x <- read.csv(“/mapr/my.cluster/home/ted/data/foo.out”) > plot(error ~ t, x) > q(save=‘n’)
68. Conclusions We used to know all this Tab completion used to work 5 years of work-arounds have clouded our memories We just have to remember the future
Notas do Editor
Constant time implies constantfactor of growth. Thus the accumulation of all of history before 10 time units ago is less than half the accumulation in the last 10 units alone. This is true at all time.
Startups use this fact to their advantage and completely change everything to allow time-efficient development initially with conversion to computer-efficient systems later.
Here the later history is shown after the initial exponential growth phase. This changes the economics of the company dramatically.
The startup can throw away history because it is so small. That means that the startup has almost no compatibility requirement because the data lost due to lack of compatibility is a small fraction of the total data.
A large enterprise cannot do that. They have to have access to the old data and have to share between old data and Hadoop accessible data.This doesn’t have to happen with the proof of concept level, but it really must happen when hadoop first goes to production.
But stock Hadoop does not handle this well.
This is because Hadoop and other data silos have different foundations. What is worse, there is a semantic wall that separates HDFS from normal resources.
Here is a picture that shows how MapR can replace the foundation and provide compatibility. Of course, MapR provide much more than just the base, but the foundation is what provides the fundamental limitation or lack of limit in MapR’s case.