More Related Content Similar to Dunning strata-2012-27-02 (20) More from MapR Technologies (20) Dunning strata-2012-27-022. My Background
University, Startups
– Aptex, MusicMatch, ID Analytics, Veoh
– big data since before it was big
Open source
– even before the internet
– Apache Hadoop, Mahout, Zookeeper, Drill
– bought the beer at first HUG
MapR
Founding member of Apache Drill
©MapR Technologies - Confidential 2
3. MapR Technologies
Enterprise quality distribution for Hadoop
– Many extensions beyond basic Hadoop
Super strong team
– Long history of successful startups
Strong supporter of Apache Drill
– and open source in general
©MapR Technologies - Confidential 3
5. meta
Meta- (from Greek: μετά = "after", "beyond",
"with", "adjacent", "self"), is a…
©MapR Technologies - Confidential 5
6. Answering
Beyond ≠ yesterday’s
problems
©MapR Technologies - Confidential 6
8. The study of the past
(what came before now)
©MapR Technologies - Confidential 8
9. What is the future?
(it comes after now)
©MapR Technologies - Confidential 9
12. But the future also
has a past!
©MapR Technologies - Confidential 12
13. the future of the past
is not
the past of the future
©MapR Technologies - Confidential 13
18. Those are
yesterday’s
answers
©MapR Technologies - Confidential 18
19. and also
the seeds
of tomorrow
©MapR Technologies - Confidential 19
20. Guys wearing
Fedoras
©MapR Technologies - Confidential 20
21. Hadoop has
a history
©MapR Technologies - Confidential 21
22. Hadoop also
has a
future
©MapR Technologies - Confidential 22
23. The Old Future of Hadoop
Implementing yet another Google paper
– Map-reduce and HDFS, and Yarn and Tez
– more and more, but not really different
Eco-system additions (more Google papers)
– simpler programming (Hive and Pig and Crunch) (Sawzall, FlumeJava, etc)
– key-value store (big table)
– ad hoc query (Dremel)
– also not really different
Stands apart from other computing
– required by HDFS and other limitations
©MapR Technologies - Confidential 23
24. The New Future of Hadoop
Real-time processing
– Combines real-time and long-time
Integration with traditional IT
– No need to stand apart
Integration with new technologies
– Solr, Node.js, Twisted all should work directly on Hadoop
Fast and flexible computation
– Drill logical plan language
©MapR Technologies - Confidential 24
25. Example #1
Search Abuse
©MapR Technologies - Confidential 25
26. History matrix
One row per user
One column per thing
©MapR Technologies - Confidential 26
27. Recommendation based on
cooccurrence
Cooccurrence gives item-item
mapping
One row and column per thing
©MapR Technologies - Confidential 27
29. SolR
SolR
Complete Cooccurrence Indexer
Solr
Indexer
history (Mahout) indexing
Item meta- Index
data shards
©MapR Technologies - Confidential 29
30. SolR
SolR
User Indexer
Solr
Web tier Indexer
history search
Item meta-
Index
data shards
©MapR Technologies - Confidential 30
31. Objective Results
At a very large credit card company
History is all transactions, all web interaction
Processing time cut from 20 hours per day to 3
Recommendation engine load time decreased from 8 hours to 3
minutes
©MapR Technologies - Confidential 31
32. Scaling Estimates – Twitter Fire hose
Old School – 8+ separate MapR – one platform
clusters, 20-25 nodes – 5-10 nodes total, any node does any
– >3 Kafka nodes job
– >2 TwitterLogger – Full HA included,
– 5-10 Hadoop backups included,
– >3 Storm disaster recovery included
– 3 zookeepers (or not?)
– NAS for web storage
– >2 web servers
©MapR Technologies - Confidential 32
33. Example #2
Web Technology
©MapR Technologies - Confidential 33
34. Real-time Fast analysis
data (Storm)
Analytic
Raw logs
output
©MapR Technologies - Confidential 34
35. Large analysis
(map-reduce)
Analytic
Raw logs
output
©MapR Technologies - Confidential 35
36. Presentation
Browser
tier (d3 +
query
node.js)
Analytic
Raw logs
output
©MapR Technologies - Confidential 36
37. Old School Storm: Complex architecture
Twitter
Twitter
API Kafka
Kafka API
TwitterLogger Kafka
Kafka
Cluster
Cluster
Cluster
Storm
Kafka Storm
Web
Flume Data
NAS
HDFS
Data
Hadoop http
Web-server
©MapR Technologies - Confidential 37
38. MapR: One Platform with Streaming Writes
Twitter
Twitter
API http
Catcher Web-server
TwitterLogger Catcher Storm
NFS NFS NFS NFS
Optional
HDFS
MapReduce Topic Web
API
Queue Data
MapR
Users can also run extended
analytics/MapReduce on the stored
data
©MapR Technologies - Confidential 38
40. Objective Results
Real-time + long-time analysis is seamless
Web tier can be rooted directly on Hadoop cluster
No need to move data
©MapR Technologies - Confidential 40
41. The future is
not what we
thought it
would be
©MapR Technologies - Confidential 41
43. Get Involved!
Tweet:
#strataconf
#mapr
@ted_dunning
©MapR Technologies - Confidential 43
44. Get Involved!
Join Apache Drill!
– drill-dev-subscribe@incubator.apache.org
– Follow @apachedrill
Join MapR!
– jobs@mapr.com
Download these slides
– http://www.mapr.com/company/events/strata-conference-2-2-27-13
Contact me:
– tdunning@maprtech.com
– tdunning@apache.org
– @ted_dunning
©MapR Technologies - Confidential 44
Editor's Notes Take all of Twitter400 x 10^6 tweets per day < 400 GB per day < 40MB/s Kafka is a message Queuing system Catcher is a processorAll of the systems can be run out of Hadoop. Warden can be configured to run Storm as well. Simple Architecture – all from one platform. The green blocks are data that is available for other analytics.