Mongo db and hadoop driving business insights - final

MongoDB and Hadoop
Luke Lovett
Software Engineer, MongoDB

Agenda
• Complementary Approaches to Data
• MongoDB & Hadoop Use Cases
• MongoDB Connector Overview and Features
• Demo

Complementary Approaches
to Data

Operational: MongoDB
Real-Time
Analytics
Product/Asset
Catalogs
Security &
Fraud
Internet of
Things
Mobile Apps
Customer
Data Mgmt
Single View Social
Churn Analysis Recommender
Warehouse &
ETL
Risk Modeling
Trade
Surveillance
Predictive
Analytics
Ad Targeting
Sentiment
Analysis

MongoDB
• Store and read data frequently
• Easy administration
• Built-in analytical tools
– aggregation framework
– JavaScript MapReduce
– Geo/text indexes

Analytical: Hadoop
Real-Time
Analytics
Product/Asset
Catalogs
Security &
Fraud
Internet of
Things
Mobile Apps
Customer
Data Mgmt
Single View Social
Warehouse &
ETL
Risk Modeling
Trade
Surveillance
Predictive
Analytics
Ad Targeting
Sentiment
Analysis

Hadoop
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models.
• Terabyte and Petabyte datasets
• Data warehousing
• Advanced analytics

Operational vs. Analytical: Lifecycle
Real-Time
Analytics
Product/Asset
Catalogs
Security &
Fraud
Internet of
Things
Mobile Apps
Customer
Data Mgmt
Single View Social
Warehouse &
ETL
Risk Modeling
Trade
Surveillance
Predictive
Analytics
Ad Targeting
Sentiment
Analysis

Batch Aggregation
Applicatio
ns
powered
by
Analysis
powered
by
MongoDB Connector
for Hadoop
● Need more than MongoDB aggregation
● Need offline processing
● Results sent back to MongoDB
● Can be left as BSON on HDFS for further analysis

Commerce
Applicatio
ns
powered
by
Analysis
powered
by
• Products & Inventory
• Recommended
products
• Customer profile
• Session management
• Elastic pricing
• Recommendation
models
• Predictive analytics
• Clickstream history
MongoDB Connector
for Hadoop

Fraud Detection
Payments
Nightly
Analysis
Fraud modeling
MongoDB Connector
for Hadoop
Results
Cache
Online payments
processing
3rd Party Data
Sources
Fraud
Detection
query
only
query
only

Connector Overview
Hadoop
Map Reduce, Hive, Pig, Spark
HDFS / S3
Hadoop Connector
Text Files
Hadoop
Connector
BSON Files
MongoDB
Single Node, Replica Set,
Cluster
Apache Hadoop / Cloudera CDH / Hortonworks HDP / Amazon
EMR

Data Movement
Dynamic queries to MongoDB vs. BSON snapshots in
HDFS
Dynamic queries with
most recent data
Puts load on
operational database
Snapshots move load to
Hadoop
Snapshots add predictable
load to MongoDB

Connector Operation
1. Split according to given InputFormat
- many options available for reading from live cluster
- configure key pattern, split strategy
1. Write splits file
2. Output to BSON file or live MongoDB
- BSON file splits written automatically for future tasks
- Mongo insertion round-robin across collections

Getting Splits
• Split on a sharded cluster
– Split by chunk
– Split by shard
• Splits on replica
set/standalone
– splitVector command
• BSON files
– specify max docs
– split per input file
MongoDB Connector for Hadoop
Config
Servers
Shard
Chunk
Chunk
Chunk
Mongos
Shard
Chunk
Chunk
Chunk
Shard
Chunk
Chunk
Chunk

MongoDB Connector for Hadoop
Config
Servers
Getting Splits
• Split on a sharded cluster
– Split by chunk
– Split by shard
• Splits on replica
set/standalone
– splitVector command
• BSON files
– specify max docs
– split per input file
Shard
Chunk
Chunk
Chunk
Mongos
Shard
Chunk
Chunk
Chunk
Shard
Chunk
Chunk
Chunk

MapReduce Configuration
• MongoDB input
– mongo.job.input.format = com.hadoop.MongoInputFormat
– mongo.input.uri = mongodb://mydb:27017/db1.collection1
• MongoDB output
– mongo.job.output.format = com.hadoop.MongoOutputFormat
– mongo.output.uri = mongodb://mydb:27017/db1.collection2

MapReduce Configuration
• BSON input/output
– mongo.job.input.format = com.hadoop.BSONFileInputFormat
– mapred.input.dir = hdfs:///tmp/database.bson
– mongo.job.output.format =
com.hadoop.BSONFileOutputFormat
– mapred.output.dir = hdfs:///tmp/output.bson

Spark Usage
• Use with MapReduce
input/output formats
• Create Configuration objects with
input/output formats and data
URI
• Load/save data using
SparkContext Hadoop file API

Hive Support
CREATE TABLE mongo_users (id int, name string, age int)
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"
WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”)
TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)
• Access collections as Hive tables
• Use with MongoStorageHandler or BSONSerDe

Hive Support
● Types given by schema
● May use structs to project fields out of documents and ease access
● Can explode nested fields to make them top-level:
{“customer”: {“name”: “Bart”}}
can be accessed with “customer.name”.
MongoDB Hive
Primitive type (int, String, etc.) Primitive type (int, float, etc.)
Document Row
Sub-document Struct, Map, or exploded field
Array Array or exploded field

Pig Mappings
• Input: BSONLoader and MongoLoader
data = LOAD ‘mongodb://mydb:27017/db.collection’
using com.mongodb.hadoop.pig.MongoLoader
• Output: BSONStorage and MongoInsertStorage
STORE records INTO ‘hdfs:///output.bson’
using com.mongodb.hadoop.pig.BSONStorage

Pig Mappings
● Organize and prune documents by specifying a schema
● Access full document in a Map without needing a schema
MongoDB Pig
Primitive type (int, String, etc.) Primitive type (int, chararray, etc.)
Document Tuple (schema given)
Document Tuple containing a Map (no schema)
Sub-document Map
Array Bag

Mongo db and hadoop driving business insights - final

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (9)

Semelhante a Mongo db and hadoop driving business insights - final

Semelhante a Mongo db and hadoop driving business insights - final (20)

Mais de MongoDB

Mais de MongoDB (20)

Último

Último (20)

Mongo db and hadoop driving business insights - final