SlideShare uma empresa Scribd logo
1 de 51
Baixar para ler offline
MongoDB + Hadoop
Wingman, MongoDB Inc.
Norberto Leite
#bcnmug
Agenda
•  MongoDB
•  Hadoop
•  MongoDB + Hadoop Connector
•  How it works
•  What can we do with it
MongoDB
MongoDB
The leading NoSQL database
Document
Database
Open-
Source
General
Purpose
Better Data
Locality
Performance
In-Memory Caching In-Place
Updates
Scalability
Auto-Sharding
•  Increase capacity as you go
•  Commodity and cloud architectures
•  Improved operational simplicity and cost visibility
High Availability
•  Automated replication and failover
•  Multi-data center support
•  Improved operational simplicity (e.g., HW swaps)
•  Data durability and consistency
Shell
Command-line shell for interacting
directly with database
Shell and Drivers
Drivers
Drivers for most popular
programming languages and
frameworks
> db.collection.insert({product:“MongoDB”,
type:“Document Database”})
>
> db.collection.findOne()
{
“_id” : ObjectId(“5106c1c2fc629bfe52792e86”),
“product” : “MongoDB”
“type” : “Document Database”
}
Java
Python
Perl
Ruby
Haskell
JavaScript
Hadoop
Hadoop
•  Google publishes seminal papers
–  GFS – global file system (Oct 2003)
–  MapReduce – divide and conquer (Dec 2004)
–  How they indexed the internet
•  Yahoo builds and open sources (2006)
–  Doug Cutting lead, now at Cloudera
–  Most others now at Hortonworks
•  Commonly mentioned has:
–  The elephant in the room!
Hadoop
•  Primary components
–  HDFS–Hadoop Distributed File System
–  MapReduce–parallel processing engine
•  Ecosystem
–  HIVE
–  HBASE
–  PIG
–  Oozie
–  Sqoop
–  Zookeeper
MongoDB+Hadoop Connector
MongoDB Hadoop Connector
•  http://api.mongodb.org/hadoop/MongoDB
%2BHadoop+Connector.html
•  Allows interoperation of MongoDB and Hadoop
•  “Give power to the people”
•  Allows processing across multiple sources
•  Avoid custom hacky exports and imports
•  Scalability and Flexibility to accommodate Hadoop
and|or MongoDB changes
MongoDB with Hadoop• 
MongoDB
MongoDB with Hadoop• 
MongoDB warehouse
MongoDB with Hadoop
MongoDBETL
Benefits and Features
•  Full multi-core parallelism to process data in
MongoDB
•  Full integration w/ Hadoop and JVM ecosystem
•  Can be used on Amazon Elastic MapReduce
•  Read and write backup files to local,HDFS and S3
•  Vanilla Java MapReduce
•  But not only using Hadoop Streaming
Benefits and Features
•  Supports HIVE•  Supports PIG
How it works
•  Adapter examines MongoDB input collection and
calculates a set of splits from data
•  Each split is assigned to a Hadoop node
•  In parallel hadoop pulls data from splits on
MongoDB (or BSON) and starts processing locally
•  Hadoop merges results and streams output back to
MongoDB (or BSON) output collection
Example Tour
Tour bus stops
•  Java MapReduce w/ MongoDB-Hadoop Connector
•  Using Hadoop Streaming
•  Pig and Hive
Data Set
•  ENRON emails corpus (501 records,1.75GB)
•  Each document is one email
•  https://www.cs.cmu.edu/~enron/
{
"_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"),
"body" : "Here is our forecastnn ",
"filename" : "1.",
"headers" : {
"From" : "phillip.allen@enron.com",
"Subject" : "Forecast Info",
"X-bcc" : "",
"To" : "tim.belden@enron.com",
"X-Origin" : "Allen-P",
"X-From" : "Phillip K Allen",
"Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)",
"X-To" : "Tim Belden ",
"Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}
}
Document Example
Let’s build a graph between
senders and recipients and
count the messages exchanged
Graph Sketch
{"_id": {"t":"bob@enron.com", "f":"alice@enron.com"}, "count" : 14}
{"_id": {"t":"bob@enron.com", "f":"eve@enron.com"}, "count" : 9}
{"_id": {"t":"alice@enron.com", "f":"charlie@enron.com"}, "count" : 99}
{"_id": {"t":"charlie@enron.com", "f":"bob@enron.com"}, "count" : 48}
{"_id": {"t":"eve@enron.com", "f":"charlie@enron.com"}, "count" : 20}
Receiver Sender Pairs
Vanilla Java MapReduce
@Override
public void map(NullWritable key, BSONObject val, final Context context){
BSONObject headers = (BSONObject)val.get("headers");
if(headers.containsKey("From") && headers.containsKey("To")){
String from = (String)headers.get("From"); String to = (String) headers.get("To");
String[] recips = to.split(",");
for(int i=0;i<recips.length;i++){
String recip = recips[i].trim();
context.write(new MailPair(from, recip), new IntWritable(1));
}
}
}
Map Phase – each document get’s
through mapper function
public void reduce( final MailPair pKey, final Iterable<IntWritable> pValues,
final Context pContext ){
int sum = 0;
for ( final IntWritable value : pValues ){
sum += value.get();
}
BSONObject outDoc = new BasicDBObjectBuilder().start()
.add( "f" , pKey.from)
.add( "t" , pKey.to )
.get();
BSONWritable pkeyOut = new BSONWritable(outDoc);
pContext.write( pkeyOut, new IntWritable(sum) ); }
Reduce Phase – output Maps are
grouped by key and passed to Reducer
mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat
mongo.input.uri=mongodb://my-db:27017/enron.messages
mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat
mapred.input.dir= file:///tmp/messages.bson
mapred.input.dir= hdfs:///tmp/messages.bson
mapred.input.dir= s3:///tmp/messages.bson
(or BSON)Read From MongoDB
mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat
mongo.output.uri=mongodb://my-db:27017/enron.results_out
mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat
mapred.output.dir= file:///tmp/results.bson
mapred.output.dir= hdfs:///tmp/results.bson
mapred.output.dir= s3:///tmp/results.bson
(or BSON)Write To MongoDB
mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/})
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "15126-1267@m2.innovyx.com" }, "count" : 1 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "2586207@www4.imakenews.com" }, "count" : 1 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "40enron@enron.com" }, "count" : 2 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "a..davis@enron.com" }, "count" : 2 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "a..hughes@enron.com" }, "count" : 4 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "a..lindholm@enron.com" }, "count" : 1 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "a..schroeder@enron.com" }, "count" : 1 }
Query Data
Hadoop Streaming
Same thing but lets use Python
PYTHON FTW!!!
Hadoop Streaming: How it works
> pip install pymongo_hadoop
Install python library
from pymongo_hadoop import BSONMapper
def mapper(documents): i=0
for doc in documents: i=i+1
from_field = doc['headers']['From']
to_field = doc['headers']['To']
recips = [x.strip() for x in to_field.split(',')]
for r in recips:
yield {'_id': {'f':from_field, 't':r}, 'count': 1}
BSONMapper(mapper)
print >> sys.stderr, "Done Mapping."
Map Phase
from pymongo_hadoop import BSONReducer
def reducer(key, values):
print >> sys.stderr, "Processing from/to %s" % str(key)
_count = 0
for v in values:
_count += v['count']
return {'_id': key, 'count': _count}
BSONReducer(reducer)
Reduce Phase
Surviving Hadoop
MapReduce easier w/ PIG and Hive
•  PIG
–  Powerful language
–  Generates sophisticated map/reduce
–  Workflows from simple scripts
•  HIVE
–  Similar to PIG
–  SQL as language
data = LOAD 'mongodb://localhost:27017/db.collection'
using com.mongodb.hadoop.pig.MongoLoader;
STORE records INTO 'file:///output.bson'
using com.mongodb.hadoop.pig.BSONStorage;
MongoDB+Hadoop and PIG
MongoDB + Hadoop and PIG
•  PIG has a some special datatypes
–  Bags
–  Maps
–  Tuples
•  MongoDB+Hadoop Connector converts between PIG
and MongoDB datatypes
raw = LOAD 'hdfs:///messages.bson'
using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ;
send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To'
as to;
//filter && split
send_recip_filtered = FILTER send_recip BY to IS NOT NULL;
send_recip_split = FOREACH send_recip_filtered GENERATE
from as from, TRIM(FLATTEN(TOKENIZE(to))) as to;
//group && count
send_recip_grouped = GROUP send_recip_split BY (from, to);
send_recip_counted = FOREACH send_recip_grouped GENERATE
group, COUNT($1) as count;
STORE send_recip_counted INTO 'file:///enron_results.bson' using
com.mongodb.hadoop.pig.BSONStorage;
PIG
Upcoming
Roadmap Features
•  Performance Improvements–Lazy BSON
•  Full-Featured Hive Support
•  Support multi-collection input
•  API for custom splitter implementations
•  And lots more …
Recap
•  Use Hadoop for massive MapReduce computations on
big data sets stored in MongoDB
•  MongoDB can be used as Hadoop filesystem
•  There’s lots of tools to make it easier
–  Streaming
–  Hive
–  PIG
–  EMR
•  https://github.com/mongodb/mongo-hadoop/tree/
master/examples
QA?
MongoDB + Hadoop
Senior Solutions Architect, MongoDB Inc.
Norberto Leite
#bcnmug

Mais conteúdo relacionado

Mais procurados

R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo db
MongoDB
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
MongoDB
 

Mais procurados (20)

Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQL
 
Introduction to df
Introduction to dfIntroduction to df
Introduction to df
 
MongoDB and Python
MongoDB and PythonMongoDB and Python
MongoDB and Python
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Supercharge your Analytics with ClickHouse, v.2. By Vadim Tkachenko
Supercharge your Analytics with ClickHouse, v.2. By Vadim TkachenkoSupercharge your Analytics with ClickHouse, v.2. By Vadim Tkachenko
Supercharge your Analytics with ClickHouse, v.2. By Vadim Tkachenko
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop Together
 
The Very ^ 2 Basics of R
The Very ^ 2 Basics of RThe Very ^ 2 Basics of R
The Very ^ 2 Basics of R
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo db
 
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
Python and MongoDB
Python and MongoDB Python and MongoDB
Python and MongoDB
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
 
Webinar: MongoDB + Hadoop
Webinar: MongoDB + HadoopWebinar: MongoDB + Hadoop
Webinar: MongoDB + Hadoop
 
Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20
 
Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework
 
Introduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopIntroduction to MongoDB and Hadoop
Introduction to MongoDB and Hadoop
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQL
 

Semelhante a Barcelona MUG MongoDB + Hadoop Presentation

Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014
MongoDB
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 

Semelhante a Barcelona MUG MongoDB + Hadoop Presentation (20)

Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
2016 feb-23 pyugre-py_mongo
2016 feb-23 pyugre-py_mongo2016 feb-23 pyugre-py_mongo
2016 feb-23 pyugre-py_mongo
 
Using MongoDB and Python
Using MongoDB and PythonUsing MongoDB and Python
Using MongoDB and Python
 
hadoop
hadoophadoop
hadoop
 
CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 Hour
 
Couchbas for dummies
Couchbas for dummiesCouchbas for dummies
Couchbas for dummies
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Mongodb intro
Mongodb introMongodb intro
Mongodb intro
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
מיכאל
מיכאלמיכאל
מיכאל
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDB
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 

Mais de Norberto Leite

Spark and MongoDB
Spark and MongoDBSpark and MongoDB
Spark and MongoDB
Norberto Leite
 

Mais de Norberto Leite (20)

Data Modelling for MongoDB - MongoDB.local Tel Aviv
Data Modelling for MongoDB - MongoDB.local Tel AvivData Modelling for MongoDB - MongoDB.local Tel Aviv
Data Modelling for MongoDB - MongoDB.local Tel Aviv
 
Avoid Query Pitfalls
Avoid Query PitfallsAvoid Query Pitfalls
Avoid Query Pitfalls
 
MongoDB and Spark
MongoDB and SparkMongoDB and Spark
MongoDB and Spark
 
Mongo db 3.4 Overview
Mongo db 3.4 OverviewMongo db 3.4 Overview
Mongo db 3.4 Overview
 
MongoDB Certification Study Group - May 2016
MongoDB Certification Study Group - May 2016MongoDB Certification Study Group - May 2016
MongoDB Certification Study Group - May 2016
 
Geospatial and MongoDB
Geospatial and MongoDBGeospatial and MongoDB
Geospatial and MongoDB
 
MongodB Internals
MongodB InternalsMongodB Internals
MongodB Internals
 
MongoDB WiredTiger Internals
MongoDB WiredTiger InternalsMongoDB WiredTiger Internals
MongoDB WiredTiger Internals
 
MongoDB 3.2 Feature Preview
MongoDB 3.2 Feature PreviewMongoDB 3.2 Feature Preview
MongoDB 3.2 Feature Preview
 
Mongodb Spring
Mongodb SpringMongodb Spring
Mongodb Spring
 
MongoDB on Azure
MongoDB on AzureMongoDB on Azure
MongoDB on Azure
 
MongoDB: Agile Combustion Engine
MongoDB: Agile Combustion EngineMongoDB: Agile Combustion Engine
MongoDB: Agile Combustion Engine
 
MongoDB Capacity Planning
MongoDB Capacity PlanningMongoDB Capacity Planning
MongoDB Capacity Planning
 
Spark and MongoDB
Spark and MongoDBSpark and MongoDB
Spark and MongoDB
 
Analyse Yourself
Analyse YourselfAnalyse Yourself
Analyse Yourself
 
Strongly Typed Languages and Flexible Schemas
Strongly Typed Languages and Flexible SchemasStrongly Typed Languages and Flexible Schemas
Strongly Typed Languages and Flexible Schemas
 
Effectively Deploying MongoDB on AEM
Effectively Deploying MongoDB on AEMEffectively Deploying MongoDB on AEM
Effectively Deploying MongoDB on AEM
 
Advanced applications with MongoDB
Advanced applications with MongoDBAdvanced applications with MongoDB
Advanced applications with MongoDB
 
MongoDB and Node.js
MongoDB and Node.jsMongoDB and Node.js
MongoDB and Node.js
 
MongoDB + Spring
MongoDB + SpringMongoDB + Spring
MongoDB + Spring
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Barcelona MUG MongoDB + Hadoop Presentation

  • 1. MongoDB + Hadoop Wingman, MongoDB Inc. Norberto Leite #bcnmug
  • 2. Agenda •  MongoDB •  Hadoop •  MongoDB + Hadoop Connector •  How it works •  What can we do with it
  • 4. MongoDB The leading NoSQL database Document Database Open- Source General Purpose
  • 6. Scalability Auto-Sharding •  Increase capacity as you go •  Commodity and cloud architectures •  Improved operational simplicity and cost visibility
  • 7. High Availability •  Automated replication and failover •  Multi-data center support •  Improved operational simplicity (e.g., HW swaps) •  Data durability and consistency
  • 8. Shell Command-line shell for interacting directly with database Shell and Drivers Drivers Drivers for most popular programming languages and frameworks > db.collection.insert({product:“MongoDB”, type:“Document Database”}) > > db.collection.findOne() { “_id” : ObjectId(“5106c1c2fc629bfe52792e86”), “product” : “MongoDB” “type” : “Document Database” } Java Python Perl Ruby Haskell JavaScript
  • 10. Hadoop •  Google publishes seminal papers –  GFS – global file system (Oct 2003) –  MapReduce – divide and conquer (Dec 2004) –  How they indexed the internet •  Yahoo builds and open sources (2006) –  Doug Cutting lead, now at Cloudera –  Most others now at Hortonworks •  Commonly mentioned has: –  The elephant in the room!
  • 11. Hadoop •  Primary components –  HDFS–Hadoop Distributed File System –  MapReduce–parallel processing engine •  Ecosystem –  HIVE –  HBASE –  PIG –  Oozie –  Sqoop –  Zookeeper
  • 12.
  • 14.
  • 15. MongoDB Hadoop Connector •  http://api.mongodb.org/hadoop/MongoDB %2BHadoop+Connector.html •  Allows interoperation of MongoDB and Hadoop •  “Give power to the people” •  Allows processing across multiple sources •  Avoid custom hacky exports and imports •  Scalability and Flexibility to accommodate Hadoop and|or MongoDB changes
  • 19. Benefits and Features •  Full multi-core parallelism to process data in MongoDB •  Full integration w/ Hadoop and JVM ecosystem •  Can be used on Amazon Elastic MapReduce •  Read and write backup files to local,HDFS and S3 •  Vanilla Java MapReduce •  But not only using Hadoop Streaming
  • 20. Benefits and Features •  Supports HIVE•  Supports PIG
  • 21. How it works •  Adapter examines MongoDB input collection and calculates a set of splits from data •  Each split is assigned to a Hadoop node •  In parallel hadoop pulls data from splits on MongoDB (or BSON) and starts processing locally •  Hadoop merges results and streams output back to MongoDB (or BSON) output collection
  • 23. Tour bus stops •  Java MapReduce w/ MongoDB-Hadoop Connector •  Using Hadoop Streaming •  Pig and Hive
  • 24. Data Set •  ENRON emails corpus (501 records,1.75GB) •  Each document is one email •  https://www.cs.cmu.edu/~enron/
  • 25. { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecastnn ", "filename" : "1.", "headers" : { "From" : "phillip.allen@enron.com", "Subject" : "Forecast Info", "X-bcc" : "", "To" : "tim.belden@enron.com", "X-Origin" : "Allen-P", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } } Document Example
  • 26. Let’s build a graph between senders and recipients and count the messages exchanged
  • 28. {"_id": {"t":"bob@enron.com", "f":"alice@enron.com"}, "count" : 14} {"_id": {"t":"bob@enron.com", "f":"eve@enron.com"}, "count" : 9} {"_id": {"t":"alice@enron.com", "f":"charlie@enron.com"}, "count" : 99} {"_id": {"t":"charlie@enron.com", "f":"bob@enron.com"}, "count" : 48} {"_id": {"t":"eve@enron.com", "f":"charlie@enron.com"}, "count" : 20} Receiver Sender Pairs
  • 30. @Override public void map(NullWritable key, BSONObject val, final Context context){ BSONObject headers = (BSONObject)val.get("headers"); if(headers.containsKey("From") && headers.containsKey("To")){ String from = (String)headers.get("From"); String to = (String) headers.get("To"); String[] recips = to.split(","); for(int i=0;i<recips.length;i++){ String recip = recips[i].trim(); context.write(new MailPair(from, recip), new IntWritable(1)); } } } Map Phase – each document get’s through mapper function
  • 31. public void reduce( final MailPair pKey, final Iterable<IntWritable> pValues, final Context pContext ){ int sum = 0; for ( final IntWritable value : pValues ){ sum += value.get(); } BSONObject outDoc = new BasicDBObjectBuilder().start() .add( "f" , pKey.from) .add( "t" , pKey.to ) .get(); BSONWritable pkeyOut = new BSONWritable(outDoc); pContext.write( pkeyOut, new IntWritable(sum) ); } Reduce Phase – output Maps are grouped by key and passed to Reducer
  • 34. mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/}) { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "15126-1267@m2.innovyx.com" }, "count" : 1 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "2586207@www4.imakenews.com" }, "count" : 1 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "40enron@enron.com" }, "count" : 2 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "a..davis@enron.com" }, "count" : 2 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "a..hughes@enron.com" }, "count" : 4 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "a..lindholm@enron.com" }, "count" : 1 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "a..schroeder@enron.com" }, "count" : 1 } Query Data
  • 36. Same thing but lets use Python PYTHON FTW!!!
  • 38. > pip install pymongo_hadoop Install python library
  • 39. from pymongo_hadoop import BSONMapper def mapper(documents): i=0 for doc in documents: i=i+1 from_field = doc['headers']['From'] to_field = doc['headers']['To'] recips = [x.strip() for x in to_field.split(',')] for r in recips: yield {'_id': {'f':from_field, 't':r}, 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping." Map Phase
  • 40. from pymongo_hadoop import BSONReducer def reducer(key, values): print >> sys.stderr, "Processing from/to %s" % str(key) _count = 0 for v in values: _count += v['count'] return {'_id': key, 'count': _count} BSONReducer(reducer) Reduce Phase
  • 42. MapReduce easier w/ PIG and Hive •  PIG –  Powerful language –  Generates sophisticated map/reduce –  Workflows from simple scripts •  HIVE –  Similar to PIG –  SQL as language
  • 43. data = LOAD 'mongodb://localhost:27017/db.collection' using com.mongodb.hadoop.pig.MongoLoader; STORE records INTO 'file:///output.bson' using com.mongodb.hadoop.pig.BSONStorage; MongoDB+Hadoop and PIG
  • 44. MongoDB + Hadoop and PIG •  PIG has a some special datatypes –  Bags –  Maps –  Tuples •  MongoDB+Hadoop Connector converts between PIG and MongoDB datatypes
  • 45. raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ; send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to; //filter && split send_recip_filtered = FILTER send_recip BY to IS NOT NULL; send_recip_split = FOREACH send_recip_filtered GENERATE from as from, TRIM(FLATTEN(TOKENIZE(to))) as to; //group && count send_recip_grouped = GROUP send_recip_split BY (from, to); send_recip_counted = FOREACH send_recip_grouped GENERATE group, COUNT($1) as count; STORE send_recip_counted INTO 'file:///enron_results.bson' using com.mongodb.hadoop.pig.BSONStorage; PIG
  • 47. Roadmap Features •  Performance Improvements–Lazy BSON •  Full-Featured Hive Support •  Support multi-collection input •  API for custom splitter implementations •  And lots more …
  • 48. Recap •  Use Hadoop for massive MapReduce computations on big data sets stored in MongoDB •  MongoDB can be used as Hadoop filesystem •  There’s lots of tools to make it easier –  Streaming –  Hive –  PIG –  EMR •  https://github.com/mongodb/mongo-hadoop/tree/ master/examples
  • 49.
  • 50. QA?
  • 51. MongoDB + Hadoop Senior Solutions Architect, MongoDB Inc. Norberto Leite #bcnmug