MongoDB and Hadoop can work together to solve big data problems facing today's enterprises. We will take an in-depth look at how the two technologies complement and enrich each other with complex analyses and greater intelligence. We will take a deep dive into the MongoDB Connector for Hadoop and how it can be applied to enable new business insights with MapReduce, Pig, and Hive, and demo a Spark application to drive product recommendations.
5. Operational: MongoDB
Real-Time
Analytics
Product/Asset
Catalogs
Security &
Fraud
Internet of
Things
Mobile Apps
Customer
Data Mgmt
Single View Social
Churn Analysis Recommender
Warehouse &
ETL
Risk Modeling
Trade
Surveillance
Predictive
Analytics
Ad Targeting
Sentiment
Analysis
6. MongoDB
• Store and read data frequently
• Easy administration
• Built-in analytical tools
– aggregation framework
– JavaScript MapReduce
– Geo/text indexes
7. Analytical: Hadoop
Real-Time
Analytics
Product/Asset
Catalogs
Security &
Fraud
Internet of
Things
Mobile Apps
Customer
Data Mgmt
Single View Social
Churn Analysis Recommender
Warehouse &
ETL
Risk Modeling
Trade
Surveillance
Predictive
Analytics
Ad Targeting
Sentiment
Analysis
8. Hadoop
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models.
• Terabyte and Petabyte datasets
• Data warehousing
• Advanced analytics
9. Operational vs. Analytical: Lifecycle
Real-Time
Analytics
Product/Asset
Catalogs
Security &
Fraud
Internet of
Things
Mobile Apps
Customer
Data Mgmt
Single View Social
Churn Analysis Recommender
Warehouse &
ETL
Risk Modeling
Trade
Surveillance
Predictive
Analytics
Ad Targeting
Sentiment
Analysis
11. Batch Aggregation
Applicatio
ns
powered
by
Analysis
powered
by
MongoDB Connector
for Hadoop
● Need more than MongoDB aggregation
● Need offline processing
● Results sent back to MongoDB
● Can be left as BSON on HDFS for further analysis
12. Commerce
Applicatio
ns
powered
by
Analysis
powered
by
• Products & Inventory
• Recommended
products
• Customer profile
• Session management
• Elastic pricing
• Recommendation
models
• Predictive analytics
• Clickstream history
MongoDB Connector
for Hadoop
13. Fraud Detection
Payments
Nightly
Analysis
Fraud modeling
MongoDB Connector
for Hadoop
Results
Cache
Online payments
processing
3rd Party Data
Sources
Fraud
Detection
query
only
query
only
16. Data Movement
Dynamic queries to MongoDB vs. BSON snapshots in
HDFS
Dynamic queries with
most recent data
Puts load on
operational database
Snapshots move load to
Hadoop
Snapshots add predictable
load to MongoDB
17. Connector Operation
1. Split according to given InputFormat
- many options available for reading from live cluster
- configure key pattern, split strategy
1. Write splits file
2. Output to BSON file or live MongoDB
- BSON file splits written automatically for future tasks
- Mongo insertion round-robin across collections
18. Getting Splits
• Split on a sharded cluster
– Split by chunk
– Split by shard
• Splits on replica
set/standalone
– splitVector command
• BSON files
– specify max docs
– split per input file
MongoDB Connector for Hadoop
Config
Servers
Shard
Chunk
Chunk
Chunk
Mongos
Shard
Chunk
Chunk
Chunk
Shard
Chunk
Chunk
Chunk
19. MongoDB Connector for Hadoop
Config
Servers
Getting Splits
• Split on a sharded cluster
– Split by chunk
– Split by shard
• Splits on replica
set/standalone
– splitVector command
• BSON files
– specify max docs
– split per input file
Shard
Chunk
Chunk
Chunk
Mongos
Shard
Chunk
Chunk
Chunk
Shard
Chunk
Chunk
Chunk
22. Spark Usage
• Use with MapReduce
input/output formats
• Create Configuration objects with
input/output formats and data
URI
• Load/save data using
SparkContext Hadoop file API
23. Hive Support
CREATE TABLE mongo_users (id int, name string, age int)
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"
WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”)
TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)
• Access collections as Hive tables
• Use with MongoStorageHandler or BSONSerDe
24. Hive Support
● Types given by schema
● May use structs to project fields out of documents and ease access
● Can explode nested fields to make them top-level:
{“customer”: {“name”: “Bart”}}
can be accessed with “customer.name”.
MongoDB Hive
Primitive type (int, String, etc.) Primitive type (int, float, etc.)
Document Row
Sub-document Struct, Map, or exploded field
Array Array or exploded field
25. Pig Mappings
• Input: BSONLoader and MongoLoader
data = LOAD ‘mongodb://mydb:27017/db.collection’
using com.mongodb.hadoop.pig.MongoLoader
• Output: BSONStorage and MongoInsertStorage
STORE records INTO ‘hdfs:///output.bson’
using com.mongodb.hadoop.pig.BSONStorage
26. Pig Mappings
● Organize and prune documents by specifying a schema
● Access full document in a Map without needing a schema
MongoDB Pig
Primitive type (int, String, etc.) Primitive type (int, chararray, etc.)
Document Tuple (schema given)
Document Tuple containing a Map (no schema)
Sub-document Map
Array Bag