SlideShare uma empresa Scribd logo
1 de 28
MongoDB and Hadoop 
Luke Lovett 
Software Engineer, MongoDB
Agenda 
• Complementary Approaches to Data 
• MongoDB & Hadoop Use Cases 
• MongoDB Connector Overview and Features 
• Demo
Complementary Approaches 
to Data
Operational: MongoDB 
Real-Time 
Analytics 
Product/Asset 
Catalogs 
Security & 
Fraud 
Internet of 
Things 
Mobile Apps 
Customer 
Data Mgmt 
Single View Social 
Churn Analysis Recommender 
Warehouse & 
ETL 
Risk Modeling 
Trade 
Surveillance 
Predictive 
Analytics 
Ad Targeting 
Sentiment 
Analysis
MongoDB 
• Store and read data frequently 
• Easy administration 
• Built-in analytical tools 
– aggregation framework 
– JavaScript MapReduce 
– Geo/text indexes
Analytical: Hadoop 
Real-Time 
Analytics 
Product/Asset 
Catalogs 
Security & 
Fraud 
Internet of 
Things 
Mobile Apps 
Customer 
Data Mgmt 
Single View Social 
Churn Analysis Recommender 
Warehouse & 
ETL 
Risk Modeling 
Trade 
Surveillance 
Predictive 
Analytics 
Ad Targeting 
Sentiment 
Analysis
Hadoop 
The Apache Hadoop software library is a framework that allows for the 
distributed processing of large data sets across clusters of computers 
using simple programming models. 
• Terabyte and Petabyte datasets 
• Data warehousing 
• Advanced analytics
Operational vs. Analytical: Lifecycle 
Real-Time 
Analytics 
Product/Asset 
Catalogs 
Security & 
Fraud 
Internet of 
Things 
Mobile Apps 
Customer 
Data Mgmt 
Single View Social 
Churn Analysis Recommender 
Warehouse & 
ETL 
Risk Modeling 
Trade 
Surveillance 
Predictive 
Analytics 
Ad Targeting 
Sentiment 
Analysis
MongoDB & Hadoop Use 
Cases
Batch Aggregation 
Applicatio 
ns 
powered 
by 
Analysis 
powered 
by 
MongoDB Connector 
for Hadoop 
● Need more than MongoDB aggregation 
● Need offline processing 
● Results sent back to MongoDB 
● Can be left as BSON on HDFS for further analysis
Commerce 
Applicatio 
ns 
powered 
by 
Analysis 
powered 
by 
• Products & Inventory 
• Recommended 
products 
• Customer profile 
• Session management 
• Elastic pricing 
• Recommendation 
models 
• Predictive analytics 
• Clickstream history 
MongoDB Connector 
for Hadoop
Fraud Detection 
Payments 
Nightly 
Analysis 
Fraud modeling 
MongoDB Connector 
for Hadoop 
Results 
Cache 
Online payments 
processing 
3rd Party Data 
Sources 
Fraud 
Detection 
query 
only 
query 
only
MongoDB Connector for 
Hadoop
Connector Overview 
Hadoop 
Map Reduce, Hive, Pig, Spark 
HDFS / S3 
Hadoop Connector 
Text Files 
Hadoop 
Connector 
BSON Files 
MongoDB 
Single Node, Replica Set, 
Cluster 
Apache Hadoop / Cloudera CDH / Hortonworks HDP / Amazon 
EMR
Data Movement 
Dynamic queries to MongoDB vs. BSON snapshots in 
HDFS 
Dynamic queries with 
most recent data 
Puts load on 
operational database 
Snapshots move load to 
Hadoop 
Snapshots add predictable 
load to MongoDB
Connector Operation 
1. Split according to given InputFormat 
- many options available for reading from live cluster 
- configure key pattern, split strategy 
1. Write splits file 
2. Output to BSON file or live MongoDB 
- BSON file splits written automatically for future tasks 
- Mongo insertion round-robin across collections
Getting Splits 
• Split on a sharded cluster 
– Split by chunk 
– Split by shard 
• Splits on replica 
set/standalone 
– splitVector command 
• BSON files 
– specify max docs 
– split per input file 
MongoDB Connector for Hadoop 
Config 
Servers 
Shard 
Chunk 
Chunk 
Chunk 
Mongos 
Shard 
Chunk 
Chunk 
Chunk 
Shard 
Chunk 
Chunk 
Chunk
MongoDB Connector for Hadoop 
Config 
Servers 
Getting Splits 
• Split on a sharded cluster 
– Split by chunk 
– Split by shard 
• Splits on replica 
set/standalone 
– splitVector command 
• BSON files 
– specify max docs 
– split per input file 
Shard 
Chunk 
Chunk 
Chunk 
Mongos 
Shard 
Chunk 
Chunk 
Chunk 
Shard 
Chunk 
Chunk 
Chunk
MapReduce Configuration 
• MongoDB input 
– mongo.job.input.format = com.hadoop.MongoInputFormat 
– mongo.input.uri = mongodb://mydb:27017/db1.collection1 
• MongoDB output 
– mongo.job.output.format = com.hadoop.MongoOutputFormat 
– mongo.output.uri = mongodb://mydb:27017/db1.collection2
MapReduce Configuration 
• BSON input/output 
– mongo.job.input.format = com.hadoop.BSONFileInputFormat 
– mapred.input.dir = hdfs:///tmp/database.bson 
– mongo.job.output.format = 
com.hadoop.BSONFileOutputFormat 
– mapred.output.dir = hdfs:///tmp/output.bson
Spark Usage 
• Use with MapReduce 
input/output formats 
• Create Configuration objects with 
input/output formats and data 
URI 
• Load/save data using 
SparkContext Hadoop file API
Hive Support 
CREATE TABLE mongo_users (id int, name string, age int) 
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" 
WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) 
TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”) 
• Access collections as Hive tables 
• Use with MongoStorageHandler or BSONSerDe
Hive Support 
● Types given by schema 
● May use structs to project fields out of documents and ease access 
● Can explode nested fields to make them top-level: 
{“customer”: {“name”: “Bart”}} 
can be accessed with “customer.name”. 
MongoDB Hive 
Primitive type (int, String, etc.) Primitive type (int, float, etc.) 
Document Row 
Sub-document Struct, Map, or exploded field 
Array Array or exploded field
Pig Mappings 
• Input: BSONLoader and MongoLoader 
data = LOAD ‘mongodb://mydb:27017/db.collection’ 
using com.mongodb.hadoop.pig.MongoLoader 
• Output: BSONStorage and MongoInsertStorage 
STORE records INTO ‘hdfs:///output.bson’ 
using com.mongodb.hadoop.pig.BSONStorage
Pig Mappings 
● Organize and prune documents by specifying a schema 
● Access full document in a Map without needing a schema 
MongoDB Pig 
Primitive type (int, String, etc.) Primitive type (int, chararray, etc.) 
Document Tuple (schema given) 
Document Tuple containing a Map (no schema) 
Sub-document Map 
Array Bag
Demo!
Questions?

Mais conteúdo relacionado

Mais procurados

Frontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling FrameworkFrontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling Framework
sixtyone
 
Spark and MongoDB
Spark and MongoDBSpark and MongoDB
Spark and MongoDB
Norberto Leite
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
MongoDB
 

Mais procurados (20)

Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
 
Scrapinghub Deck for Startups
Scrapinghub Deck for StartupsScrapinghub Deck for Startups
Scrapinghub Deck for Startups
 
MongoDB and Spark
MongoDB and SparkMongoDB and Spark
MongoDB and Spark
 
Mongodb
MongodbMongodb
Mongodb
 
Frontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling FrameworkFrontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling Framework
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDB
 
MongoDB on Azure
MongoDB on AzureMongoDB on Azure
MongoDB on Azure
 
Mongo db 3.4 Overview
Mongo db 3.4 OverviewMongo db 3.4 Overview
Mongo db 3.4 Overview
 
Webinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerWebinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data Layer
 
Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: Managing Real Time Risk Analytics with MongoDB Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: Managing Real Time Risk Analytics with MongoDB
 
MongoDB Schema Design by Examples
MongoDB Schema Design by ExamplesMongoDB Schema Design by Examples
MongoDB Schema Design by Examples
 
Benefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSsBenefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSs
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
Introduction to mongodb
Introduction to mongodbIntroduction to mongodb
Introduction to mongodb
 
MongoDB + Spring
MongoDB + SpringMongoDB + Spring
MongoDB + Spring
 
Spark and MongoDB
Spark and MongoDBSpark and MongoDB
Spark and MongoDB
 
MongoDB 2.4 and spring data
MongoDB 2.4 and spring dataMongoDB 2.4 and spring data
MongoDB 2.4 and spring data
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
Data persistence using pouchdb and couchdb
Data persistence using pouchdb and couchdbData persistence using pouchdb and couchdb
Data persistence using pouchdb and couchdb
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
 

Destaque

MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 

Destaque (9)

MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
MongoDB Days Silicon Valley: MongoDB and the Hadoop ConnectorMongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsightsUse cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
 
Webinar: MongoDB + Hadoop
Webinar: MongoDB + HadoopWebinar: MongoDB + Hadoop
Webinar: MongoDB + Hadoop
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 

Semelhante a Mongo db and hadoop driving business insights - final

Mongo db first steps with csharp
Mongo db first steps with csharpMongo db first steps with csharp
Mongo db first steps with csharp
Serdar Buyuktemiz
 

Semelhante a Mongo db and hadoop driving business insights - final (20)

MongoDB and Hadoop
MongoDB and HadoopMongoDB and Hadoop
MongoDB and Hadoop
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business Insights
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business Insights
 
Using MongoDB For BigData in 20 Minutes
Using MongoDB For BigData in 20 MinutesUsing MongoDB For BigData in 20 Minutes
Using MongoDB For BigData in 20 Minutes
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDB
 
Mongo db halloween party
Mongo db halloween partyMongo db halloween party
Mongo db halloween party
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDB
 
How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...
 
Building your first app with MongoDB
Building your first app with MongoDBBuilding your first app with MongoDB
Building your first app with MongoDB
 
MongoDB NoSQL database a deep dive -MyWhitePaper
MongoDB  NoSQL database a deep dive -MyWhitePaperMongoDB  NoSQL database a deep dive -MyWhitePaper
MongoDB NoSQL database a deep dive -MyWhitePaper
 
Mongo db transcript
Mongo db transcriptMongo db transcript
Mongo db transcript
 
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
 
MongoDB : Scaling, Security & Performance
MongoDB : Scaling, Security & PerformanceMongoDB : Scaling, Security & Performance
MongoDB : Scaling, Security & Performance
 
MongoDB and AWS: Integrations
MongoDB and AWS: IntegrationsMongoDB and AWS: Integrations
MongoDB and AWS: Integrations
 
Spring data presentation
Spring data presentationSpring data presentation
Spring data presentation
 
Mongo db first steps with csharp
Mongo db first steps with csharpMongo db first steps with csharp
Mongo db first steps with csharp
 
MongoDB 3.4 webinar
MongoDB 3.4 webinarMongoDB 3.4 webinar
MongoDB 3.4 webinar
 
Introduction to MongoDB (Webinar Jan 2011)
Introduction to MongoDB (Webinar Jan 2011)Introduction to MongoDB (Webinar Jan 2011)
Introduction to MongoDB (Webinar Jan 2011)
 
MediaGlu and Mongo DB
MediaGlu and Mongo DBMediaGlu and Mongo DB
MediaGlu and Mongo DB
 
MongoDB
MongoDBMongoDB
MongoDB
 

Mais de MongoDB

Mais de MongoDB (20)

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Mongo db and hadoop driving business insights - final

  • 1.
  • 2. MongoDB and Hadoop Luke Lovett Software Engineer, MongoDB
  • 3. Agenda • Complementary Approaches to Data • MongoDB & Hadoop Use Cases • MongoDB Connector Overview and Features • Demo
  • 5. Operational: MongoDB Real-Time Analytics Product/Asset Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predictive Analytics Ad Targeting Sentiment Analysis
  • 6. MongoDB • Store and read data frequently • Easy administration • Built-in analytical tools – aggregation framework – JavaScript MapReduce – Geo/text indexes
  • 7. Analytical: Hadoop Real-Time Analytics Product/Asset Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predictive Analytics Ad Targeting Sentiment Analysis
  • 8. Hadoop The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. • Terabyte and Petabyte datasets • Data warehousing • Advanced analytics
  • 9. Operational vs. Analytical: Lifecycle Real-Time Analytics Product/Asset Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predictive Analytics Ad Targeting Sentiment Analysis
  • 10. MongoDB & Hadoop Use Cases
  • 11. Batch Aggregation Applicatio ns powered by Analysis powered by MongoDB Connector for Hadoop ● Need more than MongoDB aggregation ● Need offline processing ● Results sent back to MongoDB ● Can be left as BSON on HDFS for further analysis
  • 12. Commerce Applicatio ns powered by Analysis powered by • Products & Inventory • Recommended products • Customer profile • Session management • Elastic pricing • Recommendation models • Predictive analytics • Clickstream history MongoDB Connector for Hadoop
  • 13. Fraud Detection Payments Nightly Analysis Fraud modeling MongoDB Connector for Hadoop Results Cache Online payments processing 3rd Party Data Sources Fraud Detection query only query only
  • 15. Connector Overview Hadoop Map Reduce, Hive, Pig, Spark HDFS / S3 Hadoop Connector Text Files Hadoop Connector BSON Files MongoDB Single Node, Replica Set, Cluster Apache Hadoop / Cloudera CDH / Hortonworks HDP / Amazon EMR
  • 16. Data Movement Dynamic queries to MongoDB vs. BSON snapshots in HDFS Dynamic queries with most recent data Puts load on operational database Snapshots move load to Hadoop Snapshots add predictable load to MongoDB
  • 17. Connector Operation 1. Split according to given InputFormat - many options available for reading from live cluster - configure key pattern, split strategy 1. Write splits file 2. Output to BSON file or live MongoDB - BSON file splits written automatically for future tasks - Mongo insertion round-robin across collections
  • 18. Getting Splits • Split on a sharded cluster – Split by chunk – Split by shard • Splits on replica set/standalone – splitVector command • BSON files – specify max docs – split per input file MongoDB Connector for Hadoop Config Servers Shard Chunk Chunk Chunk Mongos Shard Chunk Chunk Chunk Shard Chunk Chunk Chunk
  • 19. MongoDB Connector for Hadoop Config Servers Getting Splits • Split on a sharded cluster – Split by chunk – Split by shard • Splits on replica set/standalone – splitVector command • BSON files – specify max docs – split per input file Shard Chunk Chunk Chunk Mongos Shard Chunk Chunk Chunk Shard Chunk Chunk Chunk
  • 20. MapReduce Configuration • MongoDB input – mongo.job.input.format = com.hadoop.MongoInputFormat – mongo.input.uri = mongodb://mydb:27017/db1.collection1 • MongoDB output – mongo.job.output.format = com.hadoop.MongoOutputFormat – mongo.output.uri = mongodb://mydb:27017/db1.collection2
  • 21. MapReduce Configuration • BSON input/output – mongo.job.input.format = com.hadoop.BSONFileInputFormat – mapred.input.dir = hdfs:///tmp/database.bson – mongo.job.output.format = com.hadoop.BSONFileOutputFormat – mapred.output.dir = hdfs:///tmp/output.bson
  • 22. Spark Usage • Use with MapReduce input/output formats • Create Configuration objects with input/output formats and data URI • Load/save data using SparkContext Hadoop file API
  • 23. Hive Support CREATE TABLE mongo_users (id int, name string, age int) STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”) • Access collections as Hive tables • Use with MongoStorageHandler or BSONSerDe
  • 24. Hive Support ● Types given by schema ● May use structs to project fields out of documents and ease access ● Can explode nested fields to make them top-level: {“customer”: {“name”: “Bart”}} can be accessed with “customer.name”. MongoDB Hive Primitive type (int, String, etc.) Primitive type (int, float, etc.) Document Row Sub-document Struct, Map, or exploded field Array Array or exploded field
  • 25. Pig Mappings • Input: BSONLoader and MongoLoader data = LOAD ‘mongodb://mydb:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader • Output: BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage
  • 26. Pig Mappings ● Organize and prune documents by specifying a schema ● Access full document in a Map without needing a schema MongoDB Pig Primitive type (int, String, etc.) Primitive type (int, chararray, etc.) Document Tuple (schema given) Document Tuple containing a Map (no schema) Sub-document Map Array Bag
  • 27. Demo!