7. 7
Hadoop
A framework for distributed processing of large data sets
• Terabyte and petabyte datasets
• Data warehousing
• Advanced analytics
• Not a database
• No indexes
• Batch processing
11. 11
Commerce Use Case
Applications
powered by
Analysis
powered by
• Products & Inventory
• Recommended products
• Customer profile
• Session management
• Elastic pricing
• Recommendation models
• Predictive analytics
• Clickstream history
MongoDB
Connector for
Hadoop
12. 12
Insurance Use Case
Applications
powered by
Analysis
powered by
• Customer profiles
• Insurance policies
• Session data
• Call center data
• Customer action analysis
• Churn analysis
• Churn prediction
• Policy rates
MongoDB
Connector for
Hadoop
13. 13
Fraud Detection Use Case
Payments
Fraud modeling
Nightly
Analysis
MongoDB Connector
for Hadoop
Results
Cache
Online payments
processing
3rd Party Data
Sources
Fraud
Detection
query
only
query
only
17. 17
MongoDB Connector for Hadoop
• Low latency
• Rich fast querying
• Flexible indexing
• Aggregations in database
• Known data relationships
• Great for any subset of data
• Longer jobs
• Batch analytics
• Highly parallel processing
• Unknown data relationships
• Great for looking at all data or
large subsets
Applications Distributed Analytics
MongoDB
Connector for
Hadoop
19. 19
MongoDB Data Operations Spectrum
• Document Retrieval – 1ms if in cache, ~10ms from
spinning disk
• .find() – per-document cost similar to single document
– _id range
– any secondary index range, can be composite key
– intersect two indexes
– covered indexes even faster
• .count(), .distinct(), .group() – fast, may be covered
• .aggregate() – retrieval cost like find, plus pipeline
operations
– $match, $group
– $project, $redact
• .mapReduce() – in-database Javascript
• Hadoop Connector
– mongo.input.query for indexed partial scan
– full scan
Faster…………….....Slower
23. MetLife – Single View
…
Single CSR Applica3on Unified Customer
Portal
Opera3onal Repor3ng
Cards …Cards Silo 1
…
Opera'onal Data Layer
• Insurance policies
• Demographic data
• Customer web data
• Call center data DW/Data Lake
• Churn predic3on algorithms
MongoDB
Connector for Hadoop
Cards Cards Silo 2
Cards Cards Silo N
Pub-sub/ETL
Customer
Clustering
Churn Analysis
Predic3ve
analy3cs
…
24. 25
Foursquare
• k-nearest neighbor problems
– similarity of venues, people, or brands
• MongoDB data has advantages when used with MapReduce
– log files can be stale
– log files may not contain as much information
– you can scan much less data
BSON dump
MongoDB
Connector for Hadoop
32. 36
• High-level platform for creating MapReduce
• Pig Latin abstracts Java into easier-to-use notation
• Executed as a series of MapReduce applications
• Supports user-defined functions (UDFs)
Pig
33. 37
samples = LOAD 'mongodb://127.0.0.1:27017/sensor.logs'
USING
com.mongodb.hadoop.pig.MongoLoader(’deviceId:int,value:double');
grouped = GROUP samples by deviceId;
sample_stats = FOREACH grouped {
mean = AVG(samples.value);
GENERATE group as deviceId, mean as mean;
}
STORE sample_stats INTO 'mongodb://127.0.0.1:27017/sensor.stats'
USING com.mongodb.hadoop.pig.MongoStorage;
34. 38
• Data warehouse infrastructure built on top of Hadoop
• Provides data summarization, query, and analysis
• HiveQL is a subset of SQL
• Support for user-defined functions (UDFs)
37. 41
• Powerful built-in transformations and actions
– map, reduceByKey, union, distinct, sample, intersection, and more
– foreach, count, collect, take, and many more
An engine for processing Hadoop data. Can perform
MapReduce in addition to streaming, interactive queries,
and machine learning.
39. val fiveMinBars = groupBars.map(
g => (
g.head.get("_id"),
new BasicBSONObject(g.head.toMap()).
append("Close", g.last.get("Close") ).
append("High", g.map(b => b.get("High").toString.toFloat).reduceLeft(math.max) ).
append("Low", g.map(b => b.get("Low").toString.toFloat).reduceLeft(math.min) ).
append("Volume", g.map(b => b.get("Volume").toString.toInt).foldLeft(0)(_ + _) )
)
)
Operate through Spark on the RDD Object
40. // Create a separate Configuration for saving data back to MongoDB.
val outputConfig = new Configuration()
outputConfig.set("mongo.output.format", "com.mongodb.hadoop.MongoOutputFormat")
outputConfig.set("mongo.output.uri", "mongodb://"
+ mongoPort
+ "/marketdata.fiveminutebars")
fiveMinBars.saveAsNewAPIHadoopFile(
"file:///dummy",
classOf[Any],
classOf[Any],
classOf[MongoOutputFormat[_,_]],
outputConfig)
Put It Back Where You Found It
42. More Complete EDMArchitecture & Data Lake
…Siloed source
databases
External feeds
(batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Data processing pipeline
Pub-sub,ETL,fileimports
Stream Processing
Downstrea
m Systems
… …
Single CSR
Applica3on
Unified
Digital
Apps
Opera3ona
l Repor3ng
…
… …
Analy3c
Repor3ng
Drivers & Stacks
Customer
Clusterin
g
Churn
Analysis
Predic3v
e
Analy3cs
…
Distributed
Processing
Operational Applications & Reporting
Governance to
choose where to
load and process
data
Optimal location
for providing
operational
response times
& slices
Can run
processing on
all data or
slices
Data Lake
43. Code “JakeAngerman” gets 25% off
Super Early Bird Registration Ends March 25, 2016
June 28 - 29, 2016
New York, NY
www.mongodbworld.com
44. 48
Links
• Schema design basics
– https://www.mongodb.com/presentations/schema-design-basics-1
• 6 Rules of Thumb for MongoDB Schema Design
– http://blog.mongodb.org/post/87200945828/6-rules-of-thumb-for-mongodb-schema-design-part-1
– http://blog.mongodb.org/post/87892923503/6-rules-of-thumb-for-mongodb-schema-design-part-2
– http://blog.mongodb.org/post/88473035333/6-rules-of-thumb-for-mongodb-schema-design-part-3
• Multi-tenant Architecture
– https://www.mongodb.com/presentations/securing-mongodb-to-serve-an-aws-based-multi-tenant-security-fanatic-saas-application
• MongoDB and Spark Tutorial
– https://www.mongodb.com/blog/post/using-mongodb-hadoop-spark-part-1-introduction-setup