SlideShare uma empresa Scribd logo
1 de 84
Baixar para ler offline
Table of Contents

      1. Structure:............................................................................................................................................................. 4
        1. Markus ............................................................................................................................................................... 4
        2. Flavio................................................................................................................................................................. 5
      2. Who are we? ........................................................................................................................................................ 6
        1. Markus Gattol ..................................................................................................................................................... 7
        2. Flavio Percoco Premoli .......................................................................................................................................... 8
      3. Introduction Part 1 .............................................................................................................................................. 9
        1. What I am going to tell you................................................................................................................................... 9
      4. Integration with other Technologies .................................................................................................................. 10
      5. Frequently Asked Questions ............................................................................................................................... 11
        1. Basics .............................................................................................................................................................. 12
          1. Are there any Reasons not to use MongoDB? ...................................................................................................... 13
          2. What are the supported Programming Languages? .............................................................................................. 14
          3. What is the Status of Python 3 Support? ............................................................................................................ 15
          4. What is the difference in the main Building-blocks to RDBMSs? ............................................................................. 16
        2. Administration................................................................................................................................................... 17
          1. Is there a Web GUI? What about a REST Interface/API? ....................................................................................... 18
          2. Can I rename a Database? ............................................................................................................................... 19
          3. How do I physically migrate a Database? ........................................................................................................... 20
            1. Secure Copy .... as in scp .............................................................................................................................. 20
            2. Minimum Downtime...................................................................................................................................... 20
          4. How do I update to a new MongoDB version?...................................................................................................... 22
          5. What is the default listening Port and IP? ........................................................................................................... 23
          6. Is there a Way to do automatic Backups? ........................................................................................................... 24
          7. What is getSisterDB() good for? ........................................................................................................................ 25
          8. How can I make MongoDB automatically start/restart on Server boot/reboot? ......................................................... 26
        3. Resource Usage................................................................................................................................................. 27
          1. Why is my Database growing so fast? ................................................................................................................ 28
          2. What Caching Algorithm does MongoDB use? ...................................................................................................... 29
          3. Why does MongoDB use so much RAM? ............................................................................................................. 30
4. What is the so-called Working Set Size? ............................................................................................................. 31
  5. How much RAM does MongoDB need?................................................................................................................ 32
    1. Speed Impact of not having enough RAM ........................................................................................................ 32
  6. Can I limit MongoDB's RAM Usage? ................................................................................................................... 33
  7. What can I do about Out Of Memory Errors? ....................................................................................................... 34
    1. OpenVZ ...................................................................................................................................................... 35
  8. Does MongoDB use more than one CPU Core?..................................................................................................... 36
  9. How can I tell how many clients are connected? .................................................................................................. 37
 10. How many parallel Client Connections to MongoDB can there be? .......................................................................... 38
 11. Does MongoDB do Connection Pooling? .............................................................................................................. 39
 12. Is there a Size limit of how much Data can be stored inside MongoDB? .................................................................. 40
 13. Do embedded Documents count toward the 4 MiB BSON Document Size Limit? ....................................................... 41
 14. Does Document Size impact read/write Performance? .......................................................................................... 42
 15. Is there a Way to tell the Size of a specific Document? ......................................................................................... 43
 16. How can I tell the Size of a Collection and its Indexes? ........................................................................................ 44
4. Collections / Namespaces ................................................................................................................................... 46
  1. What is a Capped Collection? Why use it? ........................................................................................................... 47
  2. Can I rename a Collection?............................................................................................................................... 48
  3. What is a Virtual Collection? Why use it? ............................................................................................................ 49
  4. Can I use a larger Number of Collections/Namespaces?........................................................................................ 50
  5. How about cloning a Collection? ........................................................................................................................ 51
  6. Can I merge two or more Collections into one? ................................................................................................... 52
  7. How can I get a list of Collections in my Database?.............................................................................................. 53
  8. How do I delete a Collection?............................................................................................................................ 55
  9. What is a Namespace with regards to MongoDB?................................................................................................. 56
 10. How can I get a list of Namespaces in Database? ................................................................................................ 57
5. Statistics / Monitoring ........................................................................................................................................ 58
  1. The Server Status, what does it tell? ................................................................................................................. 59
6. Schema / Configuration ...................................................................................................................................... 62
7. Indexes / Search / Metadata ............................................................................................................................... 63
8. Map / Reduce .................................................................................................................................................... 64
9. GridFS / Data Size ............................................................................................................................................. 65
1. What is GridFS? .............................................................................................................................................. 66
              1. What can we do with GridFS .......................................................................................................................... 66
          2. Why use GridFS over ordinary Filesystem Storage?.............................................................................................. 67
       10. Scalability / Fault Tolerance / Load Balancing ........................................................................................................ 68
       11. Miscellaneous .................................................................................................................................................... 69
      6. Use Case ............................................................................................................................................................ 70
      7. Summary Part 1 ................................................................................................................................................. 71
      8. Introduction Part 2 ............................................................................................................................................ 72
      9. Existing Technologies......................................................................................................................................... 73
     10. SQL to MongoDB Query Translation.................................................................................................................... 74
     11. Keeping things lazy... ......................................................................................................................................... 75
     12. Keeping Relations or Embedding? ...................................................................................................................... 76
        1. Using References:.............................................................................................................................................. 77
        2. Without references: ........................................................................................................................................... 78
        3. Light and fast (For registered users): ................................................................................................................... 79
        4. Heavy and slow (For any user): ........................................................................................................................... 79
        5. Lazy relations or mongodb like ones:.................................................................................................................... 80
     13. Taking Advantage from schema-less Databases for Web Development ..............................................................81
     14. Summary Part 2 ................................................................................................................................................. 83



• 2min: tell the audience what I am going to tell them (a summary) and why I think it's worth mentioning
• 3min: I'll start with a big picture view (how MongoDB just integrates nicely with existing setups eg folks can continue on using dm-
  crypt/luks) basic principles like
• 5min: pick a few FAQs items and elaborate on them eg "Why is MongoDB using so much RAM"
• 5min: I will then go on taking a use case as an example (a webapplication build with Django and MongoDB) from the financial
  domain where we need transactions/locking/ACID and talk about the differences to eg MySQL/PostgreSQL
• 5min: also, with this use case, other things like: storing various precison numbers
• 5min: summarize what I've told them

        You start after me and drill down on details (the stuff you mentioned in your email ~9 days ago) or whatever you/we see fit.


• 2min: I'll tell the audience the topics I'll talk about and how they help us with mongodb and django integration
• 5min: Mappers & Stack, I'll list some of the current ODM's used to integrate mongodb and django and how django-mongodb-engine
  integrates with django and mongodb.
• 5min: I'll talk about queries, what we have in sql that we don't have in mongodb and how we can obtain the same results using it
        ◦ perfect, nothing to add/change here
• 3min: I'll talk about embedding and referencing, when it worths doing each and why
• 5min: I'll talk about how it is possible to take advantage of schemeless databases in web programming (django oriented)
        ◦ ok sounds good, not sure I understand exactly; approach me today on #sunoano and give me an example
• 5min: Summarize and maybe some benchmark!!!
Who are we?

 Still, with all the technology we have these days, at the end of the day it is all about the people ...

       /me definitely not a
Markus Gattol

• grown up in Carinthia (southernmost Austrian state, bordering Italy), lives in the UK now
• technical background, MSc (Computer Science, Electrical Engineering)
• with Linux (Debian) since 1995, Contributor
• RDBMSs, the usual ...
• Open Source Developer/Contributor in general
• website
• works for Heart Internet Ltd., NSN before that
Flavio Percoco Premoli

• GNOME a11y Contributor (MouseTrap [])
• Open Source Developer/Contributor (Web and Desktop)
• R&D Developer at The Net Planet Europe
        ◦ NoSQL Technologies
        ◦ Cloud Computing
        ◦ Knowledge Management Systems
• Linux Lover/User and Mac user too
• website:
• Twitter: FlaPer87
• Github: FlaPer87
• Bitbucket: FlaPer87
• Everywhere else: FlaPer87
Introduction Part 1

 The why ...

   1.   why are you here today?
   2.   why does some business want to know about new technology?
   3.   why are we looking to move away from RDBMs to NoSQL DBMSs?
   4.   German: Hardware und Software sind dann gut, wenn sie sich verstehen lassen, während man sie benutzt - und nicht, wenn
        man damit vielleicht zum Mars fliegen kann.

                    Part 1 is mainly about MongoDB itself and not about Django/Python .... Part 2? .... Django!

   What I am going to tell you

        Best listener experience possible ...

               Introduction Part 1 ... Tell the audience what you're going to tell them
               Tell them
                     Integration with other Technologies
                     Frequently Asked Questions
                     Use Case
               Summary Part 1 ... Tell the audience what you told them
Integration with other Technologies

   • How can I get MongoDB?
   • Ok, have it! Now what?

   1.   full-disk encryption / filesystem-level encryption
   2.   backup technologies, Rsync/Unison, Bacula, Amanda
   3.   LVM
   4.   VPN, SSH
   5.   Virtualization, OpenVZ
Frequently Asked Questions

 Well, just because ...

Before we start running we need to be able to walk ...
Are there any Reasons not to use MongoDB?

1.   We need transactions (ACID (Atomicity, Consistency, Isolation, Durability)).
2.   Our data is very relational.
3.   Related to 2, we want to be able to do joins on the server (but can not do embedded objects / arrays).
4.   We need triggers on our tables. There might be triggers available soon however.
5.   We rely on triggers (or similar functionality) for cascading updates or deletes.
6.   We need the database to enforce referential integrity (MongoDB has no notion of this at all).
7.   If we need 100% per node durability.
8.   Write ahead log. MongoDB does not have one simply because it does not need one.
9.   Dynamic aggregation with ad-hoc queries; Crystal reports, reporting, business logic, ... RDBMSs heartland ...
What are the supported Programming Languages?

Right now (June 2010) we can use MongoDB from at least C, C++, C#, .NET, ColdFusion, Erlang, Factor, Java,
Javascript, PHP, Python, Ruby, Perl. Of course, there might be more languages available in the future.
What is the Status of Python 3 Support?

The current thought is to use Django as more or less a signal for when adding full support for Python 3 makes sense.
MongoDB can probably support it a bit earlier than Django does, but that is certainly not something the MongoDB community
wants to rush and then have to support two totally different code bases.
What is the difference in the main Building-blocks to RDBMSs?

We have RDBMSs like for example MySQL, Oracle, PostgreSQL and then there are NoSQL DBMSs like for example MongoDB.
Below is a breakout about how MongoDB relates to
the afore mentioned, it is a breakout about how the main building blocks of each party resemble:

 MySQL, PostgreSQL, Oracle
  - Database
    - Table
      - Row

  - Database
    - Collection
      - Document

The usual handicraft work ... get and keep it running ... if in doubt, automate!
Is there a Web GUI? What about a REST Interface/API?

• assuming a mongod process is running on localhost then we can access some statistics at http://localhost:28017/ and
• In order to have a REST interface to MongoDB, same as CouchDB has it, we have to start mongod with the --rest switch.
        ◦ Note however that this is just a read-only REST interface.
• For a read and/or write REST interface:
• If we wanted real-time updates from the CLI, then we could also use mongostat.
Can I rename a Database?

Yes, but it is not as easy as renaming a collection. As of now, the recommended way to rename a database is to clone it
and thereby rename it. This will require enough additional free disk space to fit the current/old database at least twice.
How do I physically migrate a Database?

There is even a clone command for that. Note however that neither copyDatabase() nor cloneDatabase() actually perform a
point-in-time snapshot of the entire database -- what they basically do is query the source database and then
replicate to the target database i.e. if we use copyDatabase() or cloneDatabase() on a source database which is online
and has operations performed on it, then the target database cannot be a point-in-time snapshot pointing to the
exact time when either one command was issued. Rather, at some point in time, they will/might have the same data/state as
their source database.

   Secure Copy .... as in scp

     A bit downtime but the chance to resume a canceled transfer ....

        • shutdown mongod on the old machine
        • copied/sync the database directory to the new machine
        • start mongod on the new machine with dbpath set appropriately

   Minimum Downtime

     Below is what we could do in order to have as little downtime as possible:

        • stop and re-start the existing mongod as master (if it is not already running as master that is)
        • install mongod on the new machine and configure it as slave using --slave and --source
        • wait while the slave copies the database, re-indexes and then catches up with its master (this happens
          automatically when we point a slave to its master). Once the slave has caught up, we
        • disable writes to the master (clients can still read/query)
        • once all outstanding writes have been committed on the master and the slave caught up, we shutdown the master
          and restart the slave as new master. The old master can now be removed entirely.
        • now we point all traffic at the new master
• finally we enable writes on the new master again, ... Et voilà!

Of course, we might also use OpenVZ and its live-migration feature ...
How do I update to a new MongoDB version?

If it is a drop-in replacement we just need to shutdown the older version and start the new one with the
appropriate dbpath. Otherwise, i.e. if it is not a drop-in replacement, we would use mongoexport followed by
What is the default listening Port and IP?

We can use netstat to find out:

 wks:/home/sa# netstat -tulpena | grep mongo
 tcp 0 0* LISTEN 124               1474236 8822/mongod
 tcp 0 0* LISTEN 124               1474237 8822/mongod

The default listening port for mongod is 27017. 28017 is where we can point our web browser in order to get some
statistics. The default listening IPs are all local IPs i.e. 0/0 which matches all source addresses from with
netmask i.e all source addresses from the local machine ... plus ...

And yes, this includes the loopback device/address/network, the private class A network, the
private class B network and of course also the private class C network amongst others.

Both, listening port and IP address, can be changed either by using the CLI switches --port and --bind_ip or the
configuration file which we can figure out by looking at the runtime configuration.
Is there a Way to do automatic Backups?

What is getSisterDB() good for?

We can use it to get ourselves references to databases which not just saves a lot of typing but is, once we got used to
using it, a lot more intuitive:

  1   sa@wks:~/mm/new$ mongo
  2   MongoDB shell version: 1.5.2-pre-
  3   url: test
  4   connecting to: test
  5   type "help" for help
  6   > db.getCollectionNames();
  7   [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ]
  8   > reference_to_test_db = db.getSisterDB('test');
  9   test
 10   > reference_to_test_db.getCollectionNames();
 11   [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ]
 12   > use admin
 13   switched to db admin
 14   > reference_to_test_db.getCollectionNames();
 15   [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ]
 16   > bye
 17   sa@wks:~/mm/new$

Note how we get a reference to our test database in line 8 and how it is used in lines 10 and even line 14, after switching from
our test database to the admin database. getCollectionNames() has just been chosen as an example, it could have been any
other command as well of course.
How can I make MongoDB automatically start/restart on Server boot/reboot?

One way would be to use the @reboot directive with Cron. However, .deb and .rpm packages install init scripts (sysv or
upstart style, as appropriate) on Debian, Ubuntu, Fedora, and CentOS already so MongoDB will restart there without
further need from us to do anything special.

     • For other constellations, an init.d script for Unix-like systems based on
     • For Mac OS X, people have reported that launchctl configurations like
       Launchctl/blob/master/org.mongo.mongod.plist work.
     • For Windows, we have documentation.
Resource Usage

Lot's of confusion amongst beginners ...
Why is my Database growing so fast?

The first file for a database is dbname.0, then dbname.1, etc. dbname.0 will be 64 MiB, dbname.1 128 MiB, ... up to 2 GiB.
Once the files reach 2 GiB in size, each successive file is also 2 GiB.

So, if we have say, database files up to dbname.n, then dbname.n-1 might be 90% unused but dbname.n has already be
allocated once we start using dbname.n-1. The reasoning here is simple: we do not want to wait for new database files
when we need them so we always allocate the next one in the background as soon as we start to use an empty

Note that deleting data and/or dropping a collection or index will not release already allocated disk space since it is
allocated per database. Disk space will only be released if a database is repaired or the database is dropped altogether. Go to for more information.
What Caching Algorithm does MongoDB use?

Actually, that is done by the OS using the LRU (Least Recently Used) caching pattern.
Why does MongoDB use so much RAM?

  Well, it does not actually, it is just that most folks do not really understand memory management -- there is more to it than
  just is in RAM or is not in RAM.

  The current default storage engine for MongoDB is called MongoMemMapped_RecStore. It uses memory-mapped files for
  all disk I/O operations. Using this strategy, the operating system's virtual memory manager is in charge of caching.
  This has several implications:

• There is no redundancy between file system cache and database cache, actually, they are one and the same.
• MongoDB can use all free memory on the server for cache space automatically without any configuration of a cache size.
• Virtual memory size and RSS (Resident Set Size) will appear to be very large for the mongod process. This is benign
  however -- virtual memory space will be just larger than the size of the datafiles open and mapped i.e. resident size will
  vary depending on the amount of memory not used by other processes on the machine.
• Caching behavior (such as LRU'ing out of pages, and laziness of page writes) is controlled by the operating system. The
  quality of the VMM (Virtual Memory Manager) implementation will vary by OS.

  As of now, an alternative storage engine (CachedBasicRecStore), which does not use memory-mapped files, is under
  development. This engine is more traditional in design with its own page cache. With this store the database has more control
  over the exact timing of reads and writes, and of the cache LRU strategy.

  Generally, the memory-mapped store (MongoMemMapped_RecStore) works quite well. The alternative store will be useful in
  cases where an operating system's VMM is behaving suboptimal.
What is the so-called Working Set Size?

Working set size can roughly be thought of as how much data we will need MongoDB (or any other DBMS, relational or
non-relational) to access in a period of time.

For example, YouTube has ridiculous amounts of data, but only 1% may be accessed at any given time. If, however, we are
in the rare case where all the data we store is accessed at the same rate at all times (LRU), then our working set size can be
defined as our entire data set stored in MongoDB.
How much RAM does MongoDB need?

We now know MongoDB's caching pattern, we also know what a working set size is. Therefore we can have the following rule
of thumb on how much RAM a machine needs in order to work properly.

It is the working set size plus MongoDB's indexes which should reside in RAM at all times i.e. the amount of available
RAM should be at least the working set size plus the size of indexes plus what the rest of the OS and other software running
on the same machine needs.

   Speed Impact of not having enough RAM

     Generally, when databases are to big to fit into RAM entirely, and if we are doing random access, we are in
     trouble as HDDs are slow at that (roughly a 100 operations per second per drive).

     One solution is to have lots of HDDs (10, 100, ...). Another one is to use SSDs (Solid State Drives) or, even better,
     add more RAM. Now that being said, the key factor here is random access. If we do sequential access to data
     bigger than RAM, then that is fine.

     So, it is ok if the database is huge (more than RAM size), but if we do a lot of random access to data, it is best if
     the working set fits in RAM entirely.

     However, there are some nuances around having indexes bigger than RAM with MongoDB. For example, we can
     speed up inserts if the index keys have certain properties -- if inserts are an issue, then that would help.
Can I limit MongoDB's RAM Usage?

No, it is not designed to do that, it is designed for speed and scalability.

If we wanted to run MongoDB on the same physical machine alongside some web server and for example some application
server like Django, then we could ensure memory limits on each one by simply using virtualization and putting each one in
its own VE (Virtual Environment). In the end we would thus have a web application made of MongoDB, Django and for
example Cherokee, all running on the same physical machine but being limited to whatever limits we set on each VE they run
What can I do about Out Of Memory Errors?

If we are getting something like this Fri May 21 08:29:52 JS Error: out of memory (or akin stuff) in our logs, then we hit a
memory limit.

As we already know, MongoDB takes all RAM it can get i.e. RAM, or more precisely RSS (Resident Set Size), itself part of
virtual memory, will appear to be very large for the mongod process.

The important point here is how it is handled by the OS. If the OS just blocks any attempt to get more virtual
memory or, even worse, kills the process (e.g. mongod) which tries to get more virtual memory, then we have got a
problem. What can be done is to elevated/alter a few settings:

  1   sa@wks:~$ ulimit -a | egrep virtual|open
  2   open files                       (-n) 1024
  3   virtual memory           (kbytes, -v) unlimited
  4   sa@wks:~$ lsb_release -irc
  5   Distributor ID: Debian
  6   Release:        unstable
  7   Codename:       sid
  8   sa@wks:~$ uname -a
  9   Linux wks 2.6.32-trunk-amd64 #1 SMP Sun Jan 10 22:40:40 UTC 2010 x86_64 GNU/Linux
 10   sa@wks:~$

As we can see from lines 5 to 9, I am on Debian sid (still in development) running the 2.6.32 Linux kernel.

The settings we are interested in are with lines 2 and 3. Virtual memory is unlimited by default so that is fine already --
this is actually what causes the most problems so we need to make sure virtual memory is either reasonably high or, even
better, set to unlimited as shown above. With regards to allowed open file descriptors -- by default we are limited to 1024
open files which, in some cases, might pose a problem -- simply elevating it might be enough already and make memory
errors go away.

Note that we need to run these commands (e.g. ulimit -v unlimited) in the same user context as mongod i.e. we basically
want to script them as part of our mongod startup process.


     If we are running MongoDB with OpenVZ then there are some more settings we might want to tune in order to avoid the
     OOM (Out of memory) killer to kick in or simply hit the virtual memory ceiling if not set to unlimited. Special attention
     should be paid to the OpenVZ memory settings i.e. they should be set to reflect MongoDB's memory usage.
Does MongoDB use more than one CPU Core?

For write operations MongoDB makes use of one CPU core. For read operations however, which tend to be the
majority of operations, MongoDB uses all CPU cores available to it.

In short: one will notice a speed increase going from a single-core CPU to dual-core or even higher e.g. quad-core
or maybe even octo-core since the speed increase is roughly proportional to the available CPU cores.
How can I tell how many clients are connected?

We can look at the connections field (current) with the server status:

 sa@wks:~$ mongo --quiet
 type "help" for help
 > db.serverStatus();

 [skipping a lot of lines ...]

           "connections" : {
                   "current" : 2,
                   "available" : 19998

 [skipping a lot of lines ...]

 > bye
How many parallel Client Connections to MongoDB can there be?

Have a look at the connections field (available) with the server status.
Does MongoDB do Connection Pooling?

Yes, we can do connection pooling for performance reasons and overall resource usage optimization -- without it things
would be a lot slower and resource intensive. Fact is that as of now (June 2010) most of the client drivers do connection
pooling, how exactly it is done varies with driver e.g. PyMongo.
Is there a Size limit of how much Data can be stored inside MongoDB?

4 MiB is the limit on individual documents, but GridFS uses many documents, so there is no limit, technically/
practically speaking.

As the above is true for x86-64, it is not entirely true for x86 (32 bit) -- there is a limit because of how memory mapped files
work which
is a limit of 2GiB per database.
Do embedded Documents count toward the 4 MiB BSON Document Size Limit?

Yes, the entire BSON (Binary JSON) document (including all embedded documents, etc.) cannot be more than 4 MiB in size.
Does Document Size impact read/write Performance?

Yes, but this is mostly due to network limitations e.g. one will max out a GigE link with inserts before document size starts
to slow down MongoDB itself.
Is there a Way to tell the Size of a specific Document?

Yes, one can use Object.bsonsize(db.whatever.findOne()) in the shell like this:

 sa@wks:~$ mongo
 MongoDB shell version: 1.5.1-pre-
 url: test
 connecting to: test
 type "help" for help
 >{ name : "katze" });
 > Object.bsonsize(db.test.findOne({ name : "katze"}))
 > bye
How can I tell the Size of a Collection and its Indexes?

 sa@wks:~$ mongo --quiet
 type "help" for help
 > db.getCollectionNames();
 [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ]
 > db.test.dataSize();
 > db.test.storageSize();
 > db.test.totalIndexSize();
 > db.test.totalSize();

We are using the test collection here. dataSize() is self-explanatory. storageSize() includes our data and all the still free
but already allocated disk space to this collection. totalIndexSize() is the size in bytes of all the indexes in this
collection and totalSize() is all the storage allocated for all data and indexes in this collection. If we need/want a
more detailed view we could also have a look at

 > db.test.validate();
          "ns" : "test.test",
          "result" : "
   firstExtent:2:2b00 ns:test.test
   lastExtent:2:2b00 ns:test.test
   # extents:1
   datasize?:160 nrecords?:4 lastExtentSize:2304
      first extent:
        loc:2:2b00 xnext:null xprev:null
        size:2304 firstRecord:2:2be8 lastRecord:2:2c58
      4 objects found, nobj:4
      224 bytes data w/headers
      160 bytes data wout/headers
      deletedList: 0000001000000000000
      deleted: n: 1 size: 1904
        test.test.$_id_ keys:4
            "ok" : 1,
            "valid" : true,
            "lastExtentSize" : 2304
 > bye

Note that while MongoDB generally does a lot of pre-allocation, we can remedy this by starting mongod with --noprealloc
and --smallfiles.
Collections / Namespaces

Needs to be known, plain and simple ...
What is a Capped Collection? Why use it?

• Size:
• Time (TTL Collections):
Can I rename a Collection?

Yes. Using help(); from MongoDB's interactive shell we get, amongst others, db.test.renameCollection( newName ,
<dropTarget> ) which renames the collection. So yes, we could do'bar'); and have the collection foo
renamed to bar. Renaming a collection is an atomic operation by the way.
What is a Virtual Collection? Why use it?

It refers to the ability to reference embedded documents as if they were a first-class collection of top level
documents, querying on them and returning them as stand-alone entities, etc.
Can I use a larger Number of Collections/Namespaces?

There is a limit to how much collections/namespaces we can have within a single MongoDB database. It is ~24000
namespaces per database. This is essentially the number of collections plus the number of indexes.
How about cloning a Collection?

Yes, possible. Have a look at mongoexport and mongoimport.
Can I merge two or more Collections into one?

Yes, we read from all collections we want to merge and use insert() to write it into our single target collection. This
can be done on the server (using MongoDB's interactive shell) or from a client.
How can I get a list of Collections in my Database?

We can use getCollectionNames() as shown below in lines 8 and 9. Yet another possibility is shown in lines 23 to 28. Of
course, since every collection is also a namespace, we can find them aside indexes in lines 11 to 21:

  1   sa@wks:~$ mongo
  2   MongoDB shell version: 1.2.4
  3   url: test
  4   connecting to: test
  5   type "help" for help
  6   > db
  7   test
  8   > db.getCollectionNames();
  9   [ "fs.chunks", "fs.files", "mycollection", "system.indexes", "things" ]
 10   > db.system.namespaces.find();
 11   { "name" : "test.system.indexes" }
 12   { "name" : "test.fs.files" }
 13   { "name" : "test.fs.files.$_id_" }
 14   { "name" : "test.fs.files.$filename_1" }
 15   { "name" : "test.fs.chunks" }
 16   { "name" : "test.fs.chunks.$_id_" }
 17   { "name" : "test.fs.chunks.$files_id_1_n_1" }
 18   { "name" : "test.things" }
 19   { "name" : "test.things.$_id_" }
 20   { "name" : "test.mycollection" }
 21   { "name" : "test.mycollection.$_id_" }
 23   > show collections
 24   fs.chunks
 25   fs.files
 26   mycollection
27   system.indexes
28   things
29   > bye
30   sa@wks:~$
How do I delete a Collection?

db.collection.drop() but there is no undo so beware.
What is a Namespace with regards to MongoDB?

Collections can be organized in namespaces. These are named groups of collections defined using a dot notation. For
example, we could define collections blog.posts and blog.authors, both reside under the namespace blog but are two separate

Namespaces can then be used to access these collections using the dot notation e.g.; will return all
documents from the collection blog.posts but nothing from the collection blog.authors.

Namespaces simply provide an organizational mechanism for the user i.e. the collection namespace is flat from the
database point of view which means that blog.authors really just is a collection on its own and not some collection authors
grouped under some namespace blog. Again, the collection namespace is flat from the database point of view i.e. technically
speaking blog.authors is no different than foo or -- grouping just helps the humans keep track ...
How can I get a list of Namespaces in Database?

One way to list all namespaces for a particular database would be to enter MongoDB's interactive shell:

 sa@wks:~$ mongo
 MongoDB shell version: 1.2.4
 url: test
 connecting to: test
 type "help" for help
 > db.system.namespaces.find();
 { "name" : "test.system.indexes" }
 { "name" : "test.fs.files" }
 { "name" : "test.fs.files.$_id_" }
 { "name" : "test.fs.files.$filename_1" }
 { "name" : "test.fs.chunks" }
 { "name" : "test.fs.chunks.$_id_" }
 { "name" : "test.fs.chunks.$files_id_1_n_1" }
 { "name" : "test.things" }
 { "name" : "test.things.$_id_" }
 { "name" : "test.mycollection" }
 { "name" : "test.mycollection.$_id_" }
 > db.system.namespaces.count();
 > bye

The system namespace in MongoDB is special since it contains database system information (read metadata). There are
several collections like for example system.namespaces which for example can be used to get information about all the
namespaces with some database.
Statistics / Monitoring

Because pilots need to know ...
The Server Status, what does it tell?

 sa@wks:~$ mongo --quiet
 type "help" for help
 > db.serverStatus();
         "uptime" : 6695,
         "localTime" : "Sun Apr 11 2010 11:22:19 GMT+0200 (CEST)",
         "globalLock" : {
                 "totalTime" : 6694193239,
                 "lockTime" : 45048,
                 "ratio" : 0.000006729414343397326
         "mem" : {
                 "resident" : 3,
                 "virtual" : 138,
                 "supported" : true,
                 "mapped" : 0

Most of it is obvious like for example uptime. The globalLock part is interesting. totalTime is the same as uptime but in
microseconds. lockTime is the amount of time the global lock has been held i.e. the total time spend waiting for write
queries until a lock has been assigned and thus a write could be made.

One may ask what is the point of having both, uptime and totalTime? Well, totalTime will rollover faster since it is in
microseconds, at some point they diverge. The rollover is coordinated between totalTime and lockTime.

mem units are in MiB, all of them. resident, what is in physical memory (also known as RAM), virtual is the virtual
address space, mapped is the space memory mapped, and supported is if memory info is supported on our platform.
"connections" : {
                  "current" : 2,
                  "available" : 19998
          "extra_info" : {
                  "note" : "fields vary by platform",
                  "heap_usage_bytes" : 146048,
                  "page_faults" : 57
          "indexCounters" : {
                  "btree" : {
                           "accesses" : 0,
                           "hits" : 0,
                           "misses" : 0,
                           "resets" : 0,
                           "missRatio" : 0
          "backgroundFlushing" : {
                  "flushes" : 111,
                  "total_ms" : 2,
                  "average_ms" : 0.018018018018018018,
                  "last_ms" : 0,
                  "last_finished" : "Sun Apr 11 2010 11:21:45 GMT+0200 (CEST)"

connections tells us how many client connections we can open against mongod, more precisely, current tells us how
many existing client connections to mongod there are right now and available shows us how many we got left.

Within the extra_info part we have heap_usage_bytes which is the main memory needed by the database.
"opcounters" : {
                    "insert" : 16513,
                    "query" : 1482263,
                    "update" : 141594,
                    "delete" : 38,
                    "getmore" : 246889,
                    "command" : 1247316
           "asserts" : {
                    "regular" : 0,
                    "warning" : 0,
                    "msg" : 0,
                    "user" : 0,
                    "rollovers" : 0
           "ok" : 1
 > bye

The opcounters part is also pretty interesting. insert, query, update, and delete are self-explanatory but getmore and
command are probably not. When we do a query, we get results in batches. The first batch is counted in query, all
subsequent in getmore. commands are things like count, group, distinct, etc.

And yes, taking those numbers and dividing them by time (delta or total) will give us operations/time e.g. operations
per second or operations since mongod got started. In fact, there is a Munin plugin (
which does use this.
Schema / Configuration

  Sorry folks, no can do, lack of time ... go to
Indexes / Search / Metadata

  Sorry folks, no can do, lack of time ... go to
Map / Reduce

 Sorry folks, no can do, lack of time ... go to
GridFS / Data Size

Store tons of data reliable and smart ...
What is GridFS?

Basically a collection of normal documents. We have two collections, one for metadata (fs.files) and one consisting of
chunks of data (fs.chunks).

The GridFS spec provides a mechanism for transparently dividing a large file among multiple documents. This allows
us to efficiently store large objects, and in the case of especially large files, such as videos, permits range operations
(e.g., fetching only the first n bytes of a file).

   What can we do with GridFS

Store ridcoulous amounts of data in a smart way.
Why use GridFS over ordinary Filesystem Storage?

If we use the filesystem we would have to handle backup/replication/scaling ourselves. We would also have to come up
with some sort of hashing scheme ourselves plus we would need to take care about cleanup/sorting/moving because
filesystems do not love lots of small files.

With GridFS, we can use MongoDB's built-in replication/backup/scaling e.g. scale reads by adding more read-only
slaves and writes by using sharding. We also get out of the box hashing (read UUID (Universally Unique Identifier)) for
stored content plus we do not suffer from filesystem performance degradation because of a myriad of small files.

Also, we can easily access information from random sections of large files, another thing traditional tools working with
data right off the filesystem are not good at. Last but not least, we can keep information associated with the file (who has
edited it, download count, description, etc.) right with the file itself.
Scalability / Fault Tolerance / Load Balancing

  Sorry folks, no can do, lack of time ... go to

  Sorry folks, no can do, lack of time ... go to
Use Case

   This should have been my major part
         ◦ locking (read transactions)
         ◦ asynchronous as opposed to synchronous operations
         ◦ numbers (double precision)

                   Again, lack of time ... go to
Summary Part 1

 Tell them what you told them ... simple as that ...
Introduction Part 2

 Before starting with mongodb specific topics it's important to know that we don't dislike relational databases, we know they
 are good for many things but we also know that web applications success is mainly based on their performance and speed
 so that's what we're running after and that's why we're all here.
Existing Technologies

    • MongoKit (Nicolas Clairon):
          ◦ Great for completely unstructured model programming. It has structure validation but I’ve never used it, I prefer
            to use mongokit on models that may be constantly changing their structure.

    • mongoengine (Harry Marr):
          ◦ It allows you to define schemas for documents and query collections using django-like syntax.

    • django-mongodb-engine (Alberto Paro and myself):
           ◦ This is a real Django backend based on django-mongodb and mongoengine, adapted to work with django-
             nonrel and mongodb without changing anything in the code.
SQL to MongoDB Query Translation....

 "What matters is who adapts faster to the changing conditions"
     - Charles Darwin

  The first we should remember when passing from SQL databases to NoSQL ones is that models were made to model data but,
  models can be modeled too, what I mean is that people use to adapt databases features to their models instead of adapting
  models to databases. I'll try to mention some of the common quesitons found in the m-l:

       • Lets start with JOINS. Why JOINS? Because we don’t have those in MongoDB and we might need them so, we have to
         figure out what’s the best workaround for this. The best thing you can do here is forget about JOINS, you wont have
         them we are not talking about highly relational databases we are talking about non relational ones so there can't be joins
         between 2 collections if there's no relation between them. One of the things we did was remodeling the way we stored
         data. We embedded what could be embedded and did 2 or more queries where embedding was not possible.

       • What about ForeignKeys, do we have those? Yes, or kind off. We have DBRef which is a kind of ForeignKey but I
         personally wouldn't use refs in mongodb. As I said, MongoDB is not about referencing and collection relations it is about
         performance based on dynamism.

       • If MongoDB barely has references you could guess that many to many is insignificant, instead of that I would start
         thinking on dicionaries/maps and lists/arrays.

       • And last but not least, If you really need to do a query that joins 2 collections based on a field reference that should
         handle a many to many relation then you have map/reduce.
Keeping things lazy...

 Yes, because we’re lazy people so we do lazy things ...

 It is important when getting orms to work with mongodb that we keep things lazy to avoid bottle necks in our web applications.
 Mongodb doesn't have many to many relations but it can have lists and dictionaries saved. For example

      class User(models.Model)
          nickname = models.CharField(max_length=255)
          full_name = models.CharField(max_length=255)
          friends = ListField()
          groups = ListField()

 In the User model we have 2 ListFields that may cause some slow downs in our web application, the first one is a list containing
 ids/names of the user friends and the second one containing the groups user is related to so, think of a user that have many
 friends and that is related to many groups (a popular one), that's a lot of data transfer and many instantiations for our code
 because each object/id in the ListField should be instantiated. Maybe this might sound obvios but trust me, nothing is obvious
 when doing web programming.
Keeping Relations or Embedding?

 This is a common question when moving from relational databases to non-rel ones. Should we keep our models related or embed
 smallest ones into the biggest ones?. The answer is NO, you shouldn't keep them related. For Example, A common situation (or
 commonly used to show how mongodb works) is a blog engine with posts and comments. Lets see how we could handle
 comments (not threaded) in our blog engine:
Using References:

   class Comment(models.Model):
       post = models.ForeignKey(Post)
       user = models.ForeignKey(User)
       text = models.CharField(max_length=255)

   my_comment, created = Comment.objects.get_or_create(post=my_post, user=my_user, text=my_text,
Without references:

     class Post(models.Model)
         comments = ListField()

     post.comments.append({ ‘user’ : user,         ‘text’ : text})

The first example is the most used because is the way we're used to think when we write our models but, the second one is the
right one when talking about nosql databases because references make things slower.

The bad thing about embedding our comments like that is that we have to worry about our 4mb Document limit so if we are
really popular on the net and many people comes to our blog and comments our posts, that might be a problem for us, even
though, This is great, I mean, we have removed a model from our app so it should be easier to maintain, shouldn't it? but, what
is user supposed to be? Is it an embedded user object? is it a ForeignKey? what is it? How should we handle users there?

It again depends on how you'd like to do things, for example It is possible to save the username as it should be showed and then
when the comments are loaded just show the username, for those wanting to know more about this user then it is possible to do
that just by clicking on its username it'll load the user's personal info. Here are some examples:
Light and fast (For registered users):

   post.comments.append({'user' : 'FlaPer87', 'text' : 'My Comment'})

Heavy and slow (For any user):

   post.comments.append({'user' : {'username' : 'FlaPer87',
                                   'email'    : '',
                                   'url'      : ''},
                         'text' : 'My Comment'})
Lazy relations or mongodb like ones:

   #Automatic serialization done in django-mongodb-engine
        post.comments.append({'user' : {'_app': model._meta.app_label,
                                        '_model': model._meta.module_name,
                                        '_type': "django"},
                                        'text' : 'My Comment'})
Taking Advantage from schema-less Databases for Web

 One of the things I like more from mongodb is that it is schema-less. People use to think about schema-less dbs as a mess which
 they're not. Schema-less databases do have a structure the difference between them and Schema based ones is that the
 schema-less structures are dynamic, this means that they can be modified at anytime and they're not typed, you can think about
 schema-less dbs as (just like mongodb does) json based maps.

 This kind of structures can be really helpful when doing web programing, in our case they let us save any kind of data in our
 collections and have generic structures that changed during the time. For example, let's try to improve our Comment model (in
 case we decided to have some relations).
class Comment(models.Model):
        post = models.ForeignKey(Post)
        user = GenericField()
        text = models.CharField(max_length=255)

      my_user = "FlaPer87" #Known User

      my_comment, created = Comment.objects.get_or_create(post=my_post,
                                                              text=my_text, defaults={})

      my_user = {'nickname' : 'FlaPer87',
                    'full_name' : 'Flavio Percoco Premoli',
                    'email'     : '',
                    'url'       : ''} #Anonymous User

      my_comment2, created = Comment.objects.get_or_create(post=my_post,

Using a GenericField we'll be able to save anything into that attr and we'll have to do our checks and controls code side. In this
case the Schema-less collection helped us to get/save the anonymous users information without having to create a record in our
Users table or without forcing the user to register.
Summary Part 2

•   Re-model your models
•   Be Lazy to be faster
•   Forget about relations, they will slow you down
•   Remember that dynamism is better than restrictions
Mongouk talk june_18

Mais conteúdo relacionado

Mais procurados

Office Enterprise2007 Product Guide
Office Enterprise2007 Product GuideOffice Enterprise2007 Product Guide
Office Enterprise2007 Product Guide
Open text web_site_management_server_11.2.1_-_smartedit_guide_english_(wsmsse...
Open text web_site_management_server_11.2.1_-_smartedit_guide_english_(wsmsse...Open text web_site_management_server_11.2.1_-_smartedit_guide_english_(wsmsse...
Open text web_site_management_server_11.2.1_-_smartedit_guide_english_(wsmsse...
Subandi Wahyudi
Planning a Microsoft Virtual Server infrastructure with HP ...
Planning a Microsoft Virtual Server infrastructure with HP ...Planning a Microsoft Virtual Server infrastructure with HP ...
Planning a Microsoft Virtual Server infrastructure with HP ...
White paper en
White paper enWhite paper en
White paper en

Mais procurados (20)

CamScanner Iphone Manual English
CamScanner Iphone Manual EnglishCamScanner Iphone Manual English
CamScanner Iphone Manual English
CamScanner Android Manual English
CamScanner Android Manual EnglishCamScanner Android Manual English
CamScanner Android Manual English
School library management system software
School library management system softwareSchool library management system software
School library management system software
Google Search Quality Rating Program General Guidelines 2011
Google Search Quality Rating Program General Guidelines 2011Google Search Quality Rating Program General Guidelines 2011
Google Search Quality Rating Program General Guidelines 2011
Dtcmsv3 nov 14
Dtcmsv3 nov 14Dtcmsv3 nov 14
Dtcmsv3 nov 14
UiTM Thesis guidelines 2013
UiTM Thesis guidelines 2013UiTM Thesis guidelines 2013
UiTM Thesis guidelines 2013
Handbook for teacher
Handbook for teacherHandbook for teacher
Handbook for teacher
Teaching english language
Teaching english languageTeaching english language
Teaching english language
Report on dotnetnuke
Report on dotnetnukeReport on dotnetnuke
Report on dotnetnuke
Student handbook2
Student handbook2Student handbook2
Student handbook2
Operating System (Mac OS) Journal
Operating System (Mac OS) JournalOperating System (Mac OS) Journal
Operating System (Mac OS) Journal
Office Enterprise2007 Product Guide
Office Enterprise2007 Product GuideOffice Enterprise2007 Product Guide
Office Enterprise2007 Product Guide
Open text web_site_management_server_11.2.1_-_smartedit_guide_english_(wsmsse...
Open text web_site_management_server_11.2.1_-_smartedit_guide_english_(wsmsse...Open text web_site_management_server_11.2.1_-_smartedit_guide_english_(wsmsse...
Open text web_site_management_server_11.2.1_-_smartedit_guide_english_(wsmsse...
UAUT lLibrary SRS dDocument
UAUT lLibrary SRS dDocumentUAUT lLibrary SRS dDocument
UAUT lLibrary SRS dDocument
Health Literacy Online: A Guide to Writing and Designing Easy-to-Use Health W...
Health Literacy Online: A Guide to Writing and Designing Easy-to-Use Health W...Health Literacy Online: A Guide to Writing and Designing Easy-to-Use Health W...
Health Literacy Online: A Guide to Writing and Designing Easy-to-Use Health W...
Import export procedure flowchart
Import export procedure flowchartImport export procedure flowchart
Import export procedure flowchart
Planning a Microsoft Virtual Server infrastructure with HP ...
Planning a Microsoft Virtual Server infrastructure with HP ...Planning a Microsoft Virtual Server infrastructure with HP ...
Planning a Microsoft Virtual Server infrastructure with HP ...
Project caponera lssgb
Project caponera lssgbProject caponera lssgb
Project caponera lssgb
White paper en
White paper enWhite paper en
White paper en


How i do risk management
How i do risk managementHow i do risk management
How i do risk management
Skills Matter
Net kernel nkp-roc-cloud
Net kernel nkp-roc-cloudNet kernel nkp-roc-cloud
Net kernel nkp-roc-cloud
Skills Matter
Cqrs race conditions_and_sagas_ohmy
Cqrs race conditions_and_sagas_ohmyCqrs race conditions_and_sagas_ohmy
Cqrs race conditions_and_sagas_ohmy
Skills Matter

Destaque (8)

How i do risk management
How i do risk managementHow i do risk management
How i do risk management
Richard sbt
Richard sbtRichard sbt
Richard sbt
Lightning talks
Lightning talksLightning talks
Lightning talks
Net kernel nkp-roc-cloud
Net kernel nkp-roc-cloudNet kernel nkp-roc-cloud
Net kernel nkp-roc-cloud
Cqrs race conditions_and_sagas_ohmy
Cqrs race conditions_and_sagas_ohmyCqrs race conditions_and_sagas_ohmy
Cqrs race conditions_and_sagas_ohmy
Patterns for slick database applications
Patterns for slick database applicationsPatterns for slick database applications
Patterns for slick database applications
5 things cucumber is bad at by Richard Lawrence
5 things cucumber is bad at by Richard Lawrence5 things cucumber is bad at by Richard Lawrence
5 things cucumber is bad at by Richard Lawrence

Semelhante a Mongouk talk june_18

Soa In The Real World
Soa In The Real WorldSoa In The Real World
Soa In The Real World
Sap tables for technical consultants
Sap tables for technical consultantsSap tables for technical consultants
Sap tables for technical consultants
Sugar Crm Manuale25
Sugar Crm Manuale25Sugar Crm Manuale25
Sugar Crm Manuale25
Spring Reference
Spring ReferenceSpring Reference
Spring Reference
Spring Reference
Spring ReferenceSpring Reference
Spring Reference
Syed Shahul

Semelhante a Mongouk talk june_18 (20)

Thesis and Dissertation Guide 2013 According to Cornell University
Thesis and Dissertation Guide 2013 According to Cornell UniversityThesis and Dissertation Guide 2013 According to Cornell University
Thesis and Dissertation Guide 2013 According to Cornell University
R Ints
R IntsR Ints
R Ints
Soa In The Real World
Soa In The Real WorldSoa In The Real World
Soa In The Real World
Google General Guidelines 2011
Google General Guidelines 2011Google General Guidelines 2011
Google General Guidelines 2011
General guidelines 2011
General guidelines 2011General guidelines 2011
General guidelines 2011
Sap tables for technical consultants
Sap tables for technical consultantsSap tables for technical consultants
Sap tables for technical consultants
Sap Tables Sdn
Sap Tables SdnSap Tables Sdn
Sap Tables Sdn
Sugar Crm Manuale25
Sugar Crm Manuale25Sugar Crm Manuale25
Sugar Crm Manuale25
Benefits of share_point_2010_as_a_product_platform
Benefits of share_point_2010_as_a_product_platformBenefits of share_point_2010_as_a_product_platform
Benefits of share_point_2010_as_a_product_platform
White Paper: Look Before You Leap Into Google Apps
White Paper: Look Before You Leap Into Google AppsWhite Paper: Look Before You Leap Into Google Apps
White Paper: Look Before You Leap Into Google Apps
Spring Reference
Spring ReferenceSpring Reference
Spring Reference
Manual tutorial-spring-java
Manual tutorial-spring-javaManual tutorial-spring-java
Manual tutorial-spring-java
Bash Beginners Guide
Bash Beginners GuideBash Beginners Guide
Bash Beginners Guide
By d ui_styleguide_2012_fp35
By d ui_styleguide_2012_fp35By d ui_styleguide_2012_fp35
By d ui_styleguide_2012_fp35
WebIT2 Consultants Proposal
WebIT2 Consultants ProposalWebIT2 Consultants Proposal
WebIT2 Consultants Proposal
Spring Reference
Spring ReferenceSpring Reference
Spring Reference
R Data
R DataR Data
R Data
Transforming a Paper-Based Library System to Digital in Example of Herat Univ...
Transforming a Paper-Based Library System to Digital in Example of Herat Univ...Transforming a Paper-Based Library System to Digital in Example of Herat Univ...
Transforming a Paper-Based Library System to Digital in Example of Herat Univ...

Mais de Skills Matter

Oscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheimOscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheim
Skills Matter
Russ miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-diveRuss miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-dive
Skills Matter
I went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_tI went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_t
Skills Matter
Bootstrapping a-devops-matter
Bootstrapping a-devops-matterBootstrapping a-devops-matter
Bootstrapping a-devops-matter
Skills Matter

Mais de Skills Matter (20)

Scala e xchange 2013 haoyi li on metascala a tiny diy jvm
Scala e xchange 2013 haoyi li on metascala a tiny diy jvmScala e xchange 2013 haoyi li on metascala a tiny diy jvm
Scala e xchange 2013 haoyi li on metascala a tiny diy jvm
Oscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheimOscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheim
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Cukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberlCukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.jsCukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.js
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source worldProgressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source world
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#
A poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testingA poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testing
Russ miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-diveRuss miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-dive
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelism
Plug 20110217
Plug   20110217Plug   20110217
Plug 20110217
Lug presentation
Lug presentationLug presentation
Lug presentation
I went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_tI went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_t
Plug saiku
Plug   saikuPlug   saiku
Plug saiku
Huguk lily
Huguk lilyHuguk lily
Huguk lily
Bootstrapping a-devops-matter
Bootstrapping a-devops-matterBootstrapping a-devops-matter
Bootstrapping a-devops-matter

Mongouk talk june_18

  • 1.
  • 2. Table of Contents 1. Structure:............................................................................................................................................................. 4 1. Markus ............................................................................................................................................................... 4 2. Flavio................................................................................................................................................................. 5 2. Who are we? ........................................................................................................................................................ 6 1. Markus Gattol ..................................................................................................................................................... 7 2. Flavio Percoco Premoli .......................................................................................................................................... 8 3. Introduction Part 1 .............................................................................................................................................. 9 1. What I am going to tell you................................................................................................................................... 9 4. Integration with other Technologies .................................................................................................................. 10 5. Frequently Asked Questions ............................................................................................................................... 11 1. Basics .............................................................................................................................................................. 12 1. Are there any Reasons not to use MongoDB? ...................................................................................................... 13 2. What are the supported Programming Languages? .............................................................................................. 14 3. What is the Status of Python 3 Support? ............................................................................................................ 15 4. What is the difference in the main Building-blocks to RDBMSs? ............................................................................. 16 2. Administration................................................................................................................................................... 17 1. Is there a Web GUI? What about a REST Interface/API? ....................................................................................... 18 2. Can I rename a Database? ............................................................................................................................... 19 3. How do I physically migrate a Database? ........................................................................................................... 20 1. Secure Copy .... as in scp .............................................................................................................................. 20 2. Minimum Downtime...................................................................................................................................... 20 4. How do I update to a new MongoDB version?...................................................................................................... 22 5. What is the default listening Port and IP? ........................................................................................................... 23 6. Is there a Way to do automatic Backups? ........................................................................................................... 24 7. What is getSisterDB() good for? ........................................................................................................................ 25 8. How can I make MongoDB automatically start/restart on Server boot/reboot? ......................................................... 26 3. Resource Usage................................................................................................................................................. 27 1. Why is my Database growing so fast? ................................................................................................................ 28 2. What Caching Algorithm does MongoDB use? ...................................................................................................... 29 3. Why does MongoDB use so much RAM? ............................................................................................................. 30
  • 3. 4. What is the so-called Working Set Size? ............................................................................................................. 31 5. How much RAM does MongoDB need?................................................................................................................ 32 1. Speed Impact of not having enough RAM ........................................................................................................ 32 6. Can I limit MongoDB's RAM Usage? ................................................................................................................... 33 7. What can I do about Out Of Memory Errors? ....................................................................................................... 34 1. OpenVZ ...................................................................................................................................................... 35 8. Does MongoDB use more than one CPU Core?..................................................................................................... 36 9. How can I tell how many clients are connected? .................................................................................................. 37 10. How many parallel Client Connections to MongoDB can there be? .......................................................................... 38 11. Does MongoDB do Connection Pooling? .............................................................................................................. 39 12. Is there a Size limit of how much Data can be stored inside MongoDB? .................................................................. 40 13. Do embedded Documents count toward the 4 MiB BSON Document Size Limit? ....................................................... 41 14. Does Document Size impact read/write Performance? .......................................................................................... 42 15. Is there a Way to tell the Size of a specific Document? ......................................................................................... 43 16. How can I tell the Size of a Collection and its Indexes? ........................................................................................ 44 4. Collections / Namespaces ................................................................................................................................... 46 1. What is a Capped Collection? Why use it? ........................................................................................................... 47 2. Can I rename a Collection?............................................................................................................................... 48 3. What is a Virtual Collection? Why use it? ............................................................................................................ 49 4. Can I use a larger Number of Collections/Namespaces?........................................................................................ 50 5. How about cloning a Collection? ........................................................................................................................ 51 6. Can I merge two or more Collections into one? ................................................................................................... 52 7. How can I get a list of Collections in my Database?.............................................................................................. 53 8. How do I delete a Collection?............................................................................................................................ 55 9. What is a Namespace with regards to MongoDB?................................................................................................. 56 10. How can I get a list of Namespaces in Database? ................................................................................................ 57 5. Statistics / Monitoring ........................................................................................................................................ 58 1. The Server Status, what does it tell? ................................................................................................................. 59 6. Schema / Configuration ...................................................................................................................................... 62 7. Indexes / Search / Metadata ............................................................................................................................... 63 8. Map / Reduce .................................................................................................................................................... 64 9. GridFS / Data Size ............................................................................................................................................. 65
  • 4. 1. What is GridFS? .............................................................................................................................................. 66 1. What can we do with GridFS .......................................................................................................................... 66 2. Why use GridFS over ordinary Filesystem Storage?.............................................................................................. 67 10. Scalability / Fault Tolerance / Load Balancing ........................................................................................................ 68 11. Miscellaneous .................................................................................................................................................... 69 6. Use Case ............................................................................................................................................................ 70 7. Summary Part 1 ................................................................................................................................................. 71 8. Introduction Part 2 ............................................................................................................................................ 72 9. Existing Technologies......................................................................................................................................... 73 10. SQL to MongoDB Query Translation.................................................................................................................... 74 11. Keeping things lazy... ......................................................................................................................................... 75 12. Keeping Relations or Embedding? ...................................................................................................................... 76 1. Using References:.............................................................................................................................................. 77 2. Without references: ........................................................................................................................................... 78 3. Light and fast (For registered users): ................................................................................................................... 79 4. Heavy and slow (For any user): ........................................................................................................................... 79 5. Lazy relations or mongodb like ones:.................................................................................................................... 80 13. Taking Advantage from schema-less Databases for Web Development ..............................................................81 14. Summary Part 2 ................................................................................................................................................. 83 Structure: Markus • 2min: tell the audience what I am going to tell them (a summary) and why I think it's worth mentioning • 3min: I'll start with a big picture view (how MongoDB just integrates nicely with existing setups eg folks can continue on using dm- crypt/luks) basic principles like • 5min: pick a few FAQs items and elaborate on them eg "Why is MongoDB using so much RAM"
  • 5. • 5min: I will then go on taking a use case as an example (a webapplication build with Django and MongoDB) from the financial domain where we need transactions/locking/ACID and talk about the differences to eg MySQL/PostgreSQL • 5min: also, with this use case, other things like: storing various precison numbers • 5min: summarize what I've told them You start after me and drill down on details (the stuff you mentioned in your email ~9 days ago) or whatever you/we see fit. Flavio • 2min: I'll tell the audience the topics I'll talk about and how they help us with mongodb and django integration • 5min: Mappers & Stack, I'll list some of the current ODM's used to integrate mongodb and django and how django-mongodb-engine integrates with django and mongodb. • 5min: I'll talk about queries, what we have in sql that we don't have in mongodb and how we can obtain the same results using it ◦ perfect, nothing to add/change here • 3min: I'll talk about embedding and referencing, when it worths doing each and why • 5min: I'll talk about how it is possible to take advantage of schemeless databases in web programming (django oriented) ◦ ok sounds good, not sure I understand exactly; approach me today on #sunoano and give me an example • 5min: Summarize and maybe some benchmark!!!
  • 6. Who are we? Still, with all the technology we have these days, at the end of the day it is all about the people ... /me definitely not a
  • 7. Markus Gattol • grown up in Carinthia (southernmost Austrian state, bordering Italy), lives in the UK now ◦ • technical background, MSc (Computer Science, Electrical Engineering) • with Linux (Debian) since 1995, Contributor • RDBMSs, the usual ... • Open Source Developer/Contributor in general • website ◦ • works for Heart Internet Ltd., NSN before that ◦
  • 8. Flavio Percoco Premoli • GNOME a11y Contributor (MouseTrap []) • Open Source Developer/Contributor (Web and Desktop) • R&D Developer at The Net Planet Europe ◦ NoSQL Technologies ◦ Cloud Computing ◦ Knowledge Management Systems • Linux Lover/User and Mac user too • website: • Twitter: FlaPer87 • Github: FlaPer87 • Bitbucket: FlaPer87 • Everywhere else: FlaPer87
  • 9. Introduction Part 1 The why ... 1. why are you here today? 2. why does some business want to know about new technology? 3. why are we looking to move away from RDBMs to NoSQL DBMSs? 4. German: Hardware und Software sind dann gut, wenn sie sich verstehen lassen, während man sie benutzt - und nicht, wenn man damit vielleicht zum Mars fliegen kann. Part 1 is mainly about MongoDB itself and not about Django/Python .... Part 2? .... Django! What I am going to tell you Best listener experience possible ... Introduction Part 1 ... Tell the audience what you're going to tell them Tell them Integration with other Technologies Frequently Asked Questions Use Case Summary Part 1 ... Tell the audience what you told them
  • 10. Integration with other Technologies • How can I get MongoDB? • Ok, have it! Now what? 1. full-disk encryption / filesystem-level encryption 2. backup technologies, Rsync/Unison, Bacula, Amanda 3. LVM 4. VPN, SSH 5. Virtualization, OpenVZ
  • 11. Frequently Asked Questions Well, just because ...
  • 12. Basics Before we start running we need to be able to walk ...
  • 13. Are there any Reasons not to use MongoDB? 1. We need transactions (ACID (Atomicity, Consistency, Isolation, Durability)). 2. Our data is very relational. 3. Related to 2, we want to be able to do joins on the server (but can not do embedded objects / arrays). 4. We need triggers on our tables. There might be triggers available soon however. 5. We rely on triggers (or similar functionality) for cascading updates or deletes. 6. We need the database to enforce referential integrity (MongoDB has no notion of this at all). 7. If we need 100% per node durability. 8. Write ahead log. MongoDB does not have one simply because it does not need one. 9. Dynamic aggregation with ad-hoc queries; Crystal reports, reporting, business logic, ... RDBMSs heartland ...
  • 14. What are the supported Programming Languages? Right now (June 2010) we can use MongoDB from at least C, C++, C#, .NET, ColdFusion, Erlang, Factor, Java, Javascript, PHP, Python, Ruby, Perl. Of course, there might be more languages available in the future.
  • 15. What is the Status of Python 3 Support? The current thought is to use Django as more or less a signal for when adding full support for Python 3 makes sense. MongoDB can probably support it a bit earlier than Django does, but that is certainly not something the MongoDB community wants to rush and then have to support two totally different code bases.
  • 16. What is the difference in the main Building-blocks to RDBMSs? We have RDBMSs like for example MySQL, Oracle, PostgreSQL and then there are NoSQL DBMSs like for example MongoDB. Below is a breakout about how MongoDB relates to the afore mentioned, it is a breakout about how the main building blocks of each party resemble: MySQL, PostgreSQL, Oracle -------------------------------------------- Server:Port - Database - Table - Row MongoDB -------------------------------------------- Server:Port - Database - Collection - Document
  • 17. Administration The usual handicraft work ... get and keep it running ... if in doubt, automate!
  • 18. Is there a Web GUI? What about a REST Interface/API? • assuming a mongod process is running on localhost then we can access some statistics at http://localhost:28017/ and http://localhost:28017/_status • In order to have a REST interface to MongoDB, same as CouchDB has it, we have to start mongod with the --rest switch. ◦ Note however that this is just a read-only REST interface. • For a read and/or write REST interface: ◦ ◦ ◦ • If we wanted real-time updates from the CLI, then we could also use mongostat.
  • 19. Can I rename a Database? Yes, but it is not as easy as renaming a collection. As of now, the recommended way to rename a database is to clone it and thereby rename it. This will require enough additional free disk space to fit the current/old database at least twice.
  • 20. How do I physically migrate a Database? There is even a clone command for that. Note however that neither copyDatabase() nor cloneDatabase() actually perform a point-in-time snapshot of the entire database -- what they basically do is query the source database and then replicate to the target database i.e. if we use copyDatabase() or cloneDatabase() on a source database which is online and has operations performed on it, then the target database cannot be a point-in-time snapshot pointing to the exact time when either one command was issued. Rather, at some point in time, they will/might have the same data/state as their source database. Secure Copy .... as in scp A bit downtime but the chance to resume a canceled transfer .... • shutdown mongod on the old machine • copied/sync the database directory to the new machine • start mongod on the new machine with dbpath set appropriately ◦ Minimum Downtime Below is what we could do in order to have as little downtime as possible: • stop and re-start the existing mongod as master (if it is not already running as master that is) • install mongod on the new machine and configure it as slave using --slave and --source • wait while the slave copies the database, re-indexes and then catches up with its master (this happens automatically when we point a slave to its master). Once the slave has caught up, we • disable writes to the master (clients can still read/query) • once all outstanding writes have been committed on the master and the slave caught up, we shutdown the master and restart the slave as new master. The old master can now be removed entirely. • now we point all traffic at the new master
  • 21. • finally we enable writes on the new master again, ... Et voilà! Of course, we might also use OpenVZ and its live-migration feature ...
  • 22. How do I update to a new MongoDB version? If it is a drop-in replacement we just need to shutdown the older version and start the new one with the appropriate dbpath. Otherwise, i.e. if it is not a drop-in replacement, we would use mongoexport followed by mongoimport.
  • 23. What is the default listening Port and IP? We can use netstat to find out: wks:/home/sa# netstat -tulpena | grep mongo tcp 0 0* LISTEN 124 1474236 8822/mongod tcp 0 0* LISTEN 124 1474237 8822/mongod wks:/home/sa# The default listening port for mongod is 27017. 28017 is where we can point our web browser in order to get some statistics. The default listening IPs are all local IPs i.e. 0/0 which matches all source addresses from with netmask i.e all source addresses from the local machine ... plus ... And yes, this includes the loopback device/address/network, the private class A network, the private class B network and of course also the private class C network amongst others. Both, listening port and IP address, can be changed either by using the CLI switches --port and --bind_ip or the configuration file which we can figure out by looking at the runtime configuration.
  • 24. Is there a Way to do automatic Backups? Yes,
  • 25. What is getSisterDB() good for? We can use it to get ourselves references to databases which not just saves a lot of typing but is, once we got used to using it, a lot more intuitive: 1 sa@wks:~/mm/new$ mongo 2 MongoDB shell version: 1.5.2-pre- 3 url: test 4 connecting to: test 5 type "help" for help 6 > db.getCollectionNames(); 7 [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ] 8 > reference_to_test_db = db.getSisterDB('test'); 9 test 10 > reference_to_test_db.getCollectionNames(); 11 [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ] 12 > use admin 13 switched to db admin 14 > reference_to_test_db.getCollectionNames(); 15 [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ] 16 > bye 17 sa@wks:~/mm/new$ Note how we get a reference to our test database in line 8 and how it is used in lines 10 and even line 14, after switching from our test database to the admin database. getCollectionNames() has just been chosen as an example, it could have been any other command as well of course.
  • 26. How can I make MongoDB automatically start/restart on Server boot/reboot? One way would be to use the @reboot directive with Cron. However, .deb and .rpm packages install init scripts (sysv or upstart style, as appropriate) on Debian, Ubuntu, Fedora, and CentOS already so MongoDB will restart there without further need from us to do anything special. • For other constellations, an init.d script for Unix-like systems based on • For Mac OS X, people have reported that launchctl configurations like Launchctl/blob/master/org.mongo.mongod.plist work. • For Windows, we have documentation.
  • 27. Resource Usage Lot's of confusion amongst beginners ...
  • 28. Why is my Database growing so fast? The first file for a database is dbname.0, then dbname.1, etc. dbname.0 will be 64 MiB, dbname.1 128 MiB, ... up to 2 GiB. Once the files reach 2 GiB in size, each successive file is also 2 GiB. So, if we have say, database files up to dbname.n, then dbname.n-1 might be 90% unused but dbname.n has already be allocated once we start using dbname.n-1. The reasoning here is simple: we do not want to wait for new database files when we need them so we always allocate the next one in the background as soon as we start to use an empty one. Note that deleting data and/or dropping a collection or index will not release already allocated disk space since it is allocated per database. Disk space will only be released if a database is repaired or the database is dropped altogether. Go to for more information.
  • 29. What Caching Algorithm does MongoDB use? Actually, that is done by the OS using the LRU (Least Recently Used) caching pattern.
  • 30. Why does MongoDB use so much RAM? Well, it does not actually, it is just that most folks do not really understand memory management -- there is more to it than just is in RAM or is not in RAM. The current default storage engine for MongoDB is called MongoMemMapped_RecStore. It uses memory-mapped files for all disk I/O operations. Using this strategy, the operating system's virtual memory manager is in charge of caching. This has several implications: • There is no redundancy between file system cache and database cache, actually, they are one and the same. • MongoDB can use all free memory on the server for cache space automatically without any configuration of a cache size. • Virtual memory size and RSS (Resident Set Size) will appear to be very large for the mongod process. This is benign however -- virtual memory space will be just larger than the size of the datafiles open and mapped i.e. resident size will vary depending on the amount of memory not used by other processes on the machine. • Caching behavior (such as LRU'ing out of pages, and laziness of page writes) is controlled by the operating system. The quality of the VMM (Virtual Memory Manager) implementation will vary by OS. As of now, an alternative storage engine (CachedBasicRecStore), which does not use memory-mapped files, is under development. This engine is more traditional in design with its own page cache. With this store the database has more control over the exact timing of reads and writes, and of the cache LRU strategy. Generally, the memory-mapped store (MongoMemMapped_RecStore) works quite well. The alternative store will be useful in cases where an operating system's VMM is behaving suboptimal.
  • 31. What is the so-called Working Set Size? Working set size can roughly be thought of as how much data we will need MongoDB (or any other DBMS, relational or non-relational) to access in a period of time. For example, YouTube has ridiculous amounts of data, but only 1% may be accessed at any given time. If, however, we are in the rare case where all the data we store is accessed at the same rate at all times (LRU), then our working set size can be defined as our entire data set stored in MongoDB.
  • 32. How much RAM does MongoDB need? We now know MongoDB's caching pattern, we also know what a working set size is. Therefore we can have the following rule of thumb on how much RAM a machine needs in order to work properly. It is the working set size plus MongoDB's indexes which should reside in RAM at all times i.e. the amount of available RAM should be at least the working set size plus the size of indexes plus what the rest of the OS and other software running on the same machine needs. Speed Impact of not having enough RAM Generally, when databases are to big to fit into RAM entirely, and if we are doing random access, we are in trouble as HDDs are slow at that (roughly a 100 operations per second per drive). One solution is to have lots of HDDs (10, 100, ...). Another one is to use SSDs (Solid State Drives) or, even better, add more RAM. Now that being said, the key factor here is random access. If we do sequential access to data bigger than RAM, then that is fine. So, it is ok if the database is huge (more than RAM size), but if we do a lot of random access to data, it is best if the working set fits in RAM entirely. However, there are some nuances around having indexes bigger than RAM with MongoDB. For example, we can speed up inserts if the index keys have certain properties -- if inserts are an issue, then that would help.
  • 33. Can I limit MongoDB's RAM Usage? No, it is not designed to do that, it is designed for speed and scalability. If we wanted to run MongoDB on the same physical machine alongside some web server and for example some application server like Django, then we could ensure memory limits on each one by simply using virtualization and putting each one in its own VE (Virtual Environment). In the end we would thus have a web application made of MongoDB, Django and for example Cherokee, all running on the same physical machine but being limited to whatever limits we set on each VE they run in.
  • 34. What can I do about Out Of Memory Errors? If we are getting something like this Fri May 21 08:29:52 JS Error: out of memory (or akin stuff) in our logs, then we hit a memory limit. As we already know, MongoDB takes all RAM it can get i.e. RAM, or more precisely RSS (Resident Set Size), itself part of virtual memory, will appear to be very large for the mongod process. The important point here is how it is handled by the OS. If the OS just blocks any attempt to get more virtual memory or, even worse, kills the process (e.g. mongod) which tries to get more virtual memory, then we have got a problem. What can be done is to elevated/alter a few settings: 1 sa@wks:~$ ulimit -a | egrep virtual|open 2 open files (-n) 1024 3 virtual memory (kbytes, -v) unlimited 4 sa@wks:~$ lsb_release -irc 5 Distributor ID: Debian 6 Release: unstable 7 Codename: sid 8 sa@wks:~$ uname -a 9 Linux wks 2.6.32-trunk-amd64 #1 SMP Sun Jan 10 22:40:40 UTC 2010 x86_64 GNU/Linux 10 sa@wks:~$ As we can see from lines 5 to 9, I am on Debian sid (still in development) running the 2.6.32 Linux kernel. The settings we are interested in are with lines 2 and 3. Virtual memory is unlimited by default so that is fine already -- this is actually what causes the most problems so we need to make sure virtual memory is either reasonably high or, even better, set to unlimited as shown above. With regards to allowed open file descriptors -- by default we are limited to 1024 open files which, in some cases, might pose a problem -- simply elevating it might be enough already and make memory
  • 35. errors go away. Note that we need to run these commands (e.g. ulimit -v unlimited) in the same user context as mongod i.e. we basically want to script them as part of our mongod startup process. OpenVZ If we are running MongoDB with OpenVZ then there are some more settings we might want to tune in order to avoid the OOM (Out of memory) killer to kick in or simply hit the virtual memory ceiling if not set to unlimited. Special attention should be paid to the OpenVZ memory settings i.e. they should be set to reflect MongoDB's memory usage.
  • 36. Does MongoDB use more than one CPU Core? For write operations MongoDB makes use of one CPU core. For read operations however, which tend to be the majority of operations, MongoDB uses all CPU cores available to it. In short: one will notice a speed increase going from a single-core CPU to dual-core or even higher e.g. quad-core or maybe even octo-core since the speed increase is roughly proportional to the available CPU cores.
  • 37. How can I tell how many clients are connected? We can look at the connections field (current) with the server status: sa@wks:~$ mongo --quiet type "help" for help > db.serverStatus(); { [skipping a lot of lines ...] "connections" : { "current" : 2, "available" : 19998 }, [skipping a lot of lines ...] } > bye sa@wks:~$
  • 38. How many parallel Client Connections to MongoDB can there be? Have a look at the connections field (available) with the server status.
  • 39. Does MongoDB do Connection Pooling? Yes, we can do connection pooling for performance reasons and overall resource usage optimization -- without it things would be a lot slower and resource intensive. Fact is that as of now (June 2010) most of the client drivers do connection pooling, how exactly it is done varies with driver e.g. PyMongo.
  • 40. Is there a Size limit of how much Data can be stored inside MongoDB? 4 MiB is the limit on individual documents, but GridFS uses many documents, so there is no limit, technically/ practically speaking. As the above is true for x86-64, it is not entirely true for x86 (32 bit) -- there is a limit because of how memory mapped files work which is a limit of 2GiB per database.
  • 41. Do embedded Documents count toward the 4 MiB BSON Document Size Limit? Yes, the entire BSON (Binary JSON) document (including all embedded documents, etc.) cannot be more than 4 MiB in size.
  • 42. Does Document Size impact read/write Performance? Yes, but this is mostly due to network limitations e.g. one will max out a GigE link with inserts before document size starts to slow down MongoDB itself.
  • 43. Is there a Way to tell the Size of a specific Document? Yes, one can use Object.bsonsize(db.whatever.findOne()) in the shell like this: sa@wks:~$ mongo MongoDB shell version: 1.5.1-pre- url: test connecting to: test type "help" for help >{ name : "katze" }); > Object.bsonsize(db.test.findOne({ name : "katze"})) 38 > bye sa@wks:~$
  • 44. How can I tell the Size of a Collection and its Indexes? sa@wks:~$ mongo --quiet type "help" for help > db.getCollectionNames(); [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ] > db.test.dataSize(); 160 > db.test.storageSize(); 2304 > db.test.totalIndexSize(); 8192 > db.test.totalSize(); 10496 We are using the test collection here. dataSize() is self-explanatory. storageSize() includes our data and all the still free but already allocated disk space to this collection. totalIndexSize() is the size in bytes of all the indexes in this collection and totalSize() is all the storage allocated for all data and indexes in this collection. If we need/want a more detailed view we could also have a look at > db.test.validate(); { "ns" : "test.test", "result" : " validate firstExtent:2:2b00 ns:test.test lastExtent:2:2b00 ns:test.test # extents:1 datasize?:160 nrecords?:4 lastExtentSize:2304
  • 45. padding:1 first extent: loc:2:2b00 xnext:null xprev:null nsdiag:test.test size:2304 firstRecord:2:2be8 lastRecord:2:2c58 4 objects found, nobj:4 224 bytes data w/headers 160 bytes data wout/headers deletedList: 0000001000000000000 deleted: n: 1 size: 1904 nIndexes:1 test.test.$_id_ keys:4 ", "ok" : 1, "valid" : true, "lastExtentSize" : 2304 } > bye sa@wks:~$ Note that while MongoDB generally does a lot of pre-allocation, we can remedy this by starting mongod with --noprealloc and --smallfiles.
  • 46. Collections / Namespaces Needs to be known, plain and simple ...
  • 47. What is a Capped Collection? Why use it? • Size: • Time (TTL Collections):
  • 48. Can I rename a Collection? Yes. Using help(); from MongoDB's interactive shell we get, amongst others, db.test.renameCollection( newName , <dropTarget> ) which renames the collection. So yes, we could do'bar'); and have the collection foo renamed to bar. Renaming a collection is an atomic operation by the way.
  • 49. What is a Virtual Collection? Why use it? It refers to the ability to reference embedded documents as if they were a first-class collection of top level documents, querying on them and returning them as stand-alone entities, etc.
  • 50. Can I use a larger Number of Collections/Namespaces? There is a limit to how much collections/namespaces we can have within a single MongoDB database. It is ~24000 namespaces per database. This is essentially the number of collections plus the number of indexes.
  • 51. How about cloning a Collection? Yes, possible. Have a look at mongoexport and mongoimport.
  • 52. Can I merge two or more Collections into one? Yes, we read from all collections we want to merge and use insert() to write it into our single target collection. This can be done on the server (using MongoDB's interactive shell) or from a client.
  • 53. How can I get a list of Collections in my Database? We can use getCollectionNames() as shown below in lines 8 and 9. Yet another possibility is shown in lines 23 to 28. Of course, since every collection is also a namespace, we can find them aside indexes in lines 11 to 21: 1 sa@wks:~$ mongo 2 MongoDB shell version: 1.2.4 3 url: test 4 connecting to: test 5 type "help" for help 6 > db 7 test 8 > db.getCollectionNames(); 9 [ "fs.chunks", "fs.files", "mycollection", "system.indexes", "things" ] 10 > db.system.namespaces.find(); 11 { "name" : "test.system.indexes" } 12 { "name" : "test.fs.files" } 13 { "name" : "test.fs.files.$_id_" } 14 { "name" : "test.fs.files.$filename_1" } 15 { "name" : "test.fs.chunks" } 16 { "name" : "test.fs.chunks.$_id_" } 17 { "name" : "test.fs.chunks.$files_id_1_n_1" } 18 { "name" : "test.things" } 19 { "name" : "test.things.$_id_" } 20 { "name" : "test.mycollection" } 21 { "name" : "test.mycollection.$_id_" } 23 > show collections 24 fs.chunks 25 fs.files 26 mycollection
  • 54. 27 system.indexes 28 things 29 > bye 30 sa@wks:~$
  • 55. How do I delete a Collection? db.collection.drop() but there is no undo so beware.
  • 56. What is a Namespace with regards to MongoDB? Collections can be organized in namespaces. These are named groups of collections defined using a dot notation. For example, we could define collections blog.posts and blog.authors, both reside under the namespace blog but are two separate collections. Namespaces can then be used to access these collections using the dot notation e.g.; will return all documents from the collection blog.posts but nothing from the collection blog.authors. Namespaces simply provide an organizational mechanism for the user i.e. the collection namespace is flat from the database point of view which means that blog.authors really just is a collection on its own and not some collection authors grouped under some namespace blog. Again, the collection namespace is flat from the database point of view i.e. technically speaking blog.authors is no different than foo or -- grouping just helps the humans keep track ...
  • 57. How can I get a list of Namespaces in Database? One way to list all namespaces for a particular database would be to enter MongoDB's interactive shell: sa@wks:~$ mongo MongoDB shell version: 1.2.4 url: test connecting to: test type "help" for help > db.system.namespaces.find(); { "name" : "test.system.indexes" } { "name" : "test.fs.files" } { "name" : "test.fs.files.$_id_" } { "name" : "test.fs.files.$filename_1" } { "name" : "test.fs.chunks" } { "name" : "test.fs.chunks.$_id_" } { "name" : "test.fs.chunks.$files_id_1_n_1" } { "name" : "test.things" } { "name" : "test.things.$_id_" } { "name" : "test.mycollection" } { "name" : "test.mycollection.$_id_" } > db.system.namespaces.count(); 11 > bye sa@wks:~$ The system namespace in MongoDB is special since it contains database system information (read metadata). There are several collections like for example system.namespaces which for example can be used to get information about all the namespaces with some database.
  • 58. Statistics / Monitoring Because pilots need to know ...
  • 59. The Server Status, what does it tell? sa@wks:~$ mongo --quiet type "help" for help > db.serverStatus(); { "uptime" : 6695, "localTime" : "Sun Apr 11 2010 11:22:19 GMT+0200 (CEST)", "globalLock" : { "totalTime" : 6694193239, "lockTime" : 45048, "ratio" : 0.000006729414343397326 }, "mem" : { "resident" : 3, "virtual" : 138, "supported" : true, "mapped" : 0 }, Most of it is obvious like for example uptime. The globalLock part is interesting. totalTime is the same as uptime but in microseconds. lockTime is the amount of time the global lock has been held i.e. the total time spend waiting for write queries until a lock has been assigned and thus a write could be made. One may ask what is the point of having both, uptime and totalTime? Well, totalTime will rollover faster since it is in microseconds, at some point they diverge. The rollover is coordinated between totalTime and lockTime. mem units are in MiB, all of them. resident, what is in physical memory (also known as RAM), virtual is the virtual address space, mapped is the space memory mapped, and supported is if memory info is supported on our platform.
  • 60. "connections" : { "current" : 2, "available" : 19998 }, "extra_info" : { "note" : "fields vary by platform", "heap_usage_bytes" : 146048, "page_faults" : 57 }, "indexCounters" : { "btree" : { "accesses" : 0, "hits" : 0, "misses" : 0, "resets" : 0, "missRatio" : 0 } }, "backgroundFlushing" : { "flushes" : 111, "total_ms" : 2, "average_ms" : 0.018018018018018018, "last_ms" : 0, "last_finished" : "Sun Apr 11 2010 11:21:45 GMT+0200 (CEST)" }, connections tells us how many client connections we can open against mongod, more precisely, current tells us how many existing client connections to mongod there are right now and available shows us how many we got left. Within the extra_info part we have heap_usage_bytes which is the main memory needed by the database.
  • 61. "opcounters" : { "insert" : 16513, "query" : 1482263, "update" : 141594, "delete" : 38, "getmore" : 246889, "command" : 1247316 }, "asserts" : { "regular" : 0, "warning" : 0, "msg" : 0, "user" : 0, "rollovers" : 0 }, "ok" : 1 } > bye sa@wks:~$ The opcounters part is also pretty interesting. insert, query, update, and delete are self-explanatory but getmore and command are probably not. When we do a query, we get results in batches. The first batch is counted in query, all subsequent in getmore. commands are things like count, group, distinct, etc. And yes, taking those numbers and dividing them by time (delta or total) will give us operations/time e.g. operations per second or operations since mongod got started. In fact, there is a Munin plugin ( which does use this.
  • 62. Schema / Configuration Sorry folks, no can do, lack of time ... go to
  • 63. Indexes / Search / Metadata Sorry folks, no can do, lack of time ... go to
  • 64. Map / Reduce Sorry folks, no can do, lack of time ... go to
  • 65. GridFS / Data Size Store tons of data reliable and smart ...
  • 66. What is GridFS? Basically a collection of normal documents. We have two collections, one for metadata (fs.files) and one consisting of chunks of data (fs.chunks). The GridFS spec provides a mechanism for transparently dividing a large file among multiple documents. This allows us to efficiently store large objects, and in the case of especially large files, such as videos, permits range operations (e.g., fetching only the first n bytes of a file). What can we do with GridFS Store ridcoulous amounts of data in a smart way.
  • 67. Why use GridFS over ordinary Filesystem Storage? If we use the filesystem we would have to handle backup/replication/scaling ourselves. We would also have to come up with some sort of hashing scheme ourselves plus we would need to take care about cleanup/sorting/moving because filesystems do not love lots of small files. With GridFS, we can use MongoDB's built-in replication/backup/scaling e.g. scale reads by adding more read-only slaves and writes by using sharding. We also get out of the box hashing (read UUID (Universally Unique Identifier)) for stored content plus we do not suffer from filesystem performance degradation because of a myriad of small files. Also, we can easily access information from random sections of large files, another thing traditional tools working with data right off the filesystem are not good at. Last but not least, we can keep information associated with the file (who has edited it, download count, description, etc.) right with the file itself.
  • 68. Scalability / Fault Tolerance / Load Balancing Sorry folks, no can do, lack of time ... go to mongodb.html#faqs_scalability_fault_tolerance_load_balancing
  • 69. Miscellaneous Sorry folks, no can do, lack of time ... go to
  • 70. Use Case This should have been my major part ◦ locking (read transactions) ◦ asynchronous as opposed to synchronous operations ◦ numbers (double precision) Again, lack of time ... go to
  • 71. Summary Part 1 Tell them what you told them ... simple as that ...
  • 72. Introduction Part 2 Before starting with mongodb specific topics it's important to know that we don't dislike relational databases, we know they are good for many things but we also know that web applications success is mainly based on their performance and speed so that's what we're running after and that's why we're all here.
  • 73. Existing Technologies • MongoKit (Nicolas Clairon): ◦ Great for completely unstructured model programming. It has structure validation but I’ve never used it, I prefer to use mongokit on models that may be constantly changing their structure. • mongoengine (Harry Marr): ◦ It allows you to define schemas for documents and query collections using django-like syntax. • django-mongodb-engine (Alberto Paro and myself): ◦ This is a real Django backend based on django-mongodb and mongoengine, adapted to work with django- nonrel and mongodb without changing anything in the code.
  • 74. SQL to MongoDB Query Translation.... "What matters is who adapts faster to the changing conditions" - Charles Darwin The first we should remember when passing from SQL databases to NoSQL ones is that models were made to model data but, models can be modeled too, what I mean is that people use to adapt databases features to their models instead of adapting models to databases. I'll try to mention some of the common quesitons found in the m-l: • Lets start with JOINS. Why JOINS? Because we don’t have those in MongoDB and we might need them so, we have to figure out what’s the best workaround for this. The best thing you can do here is forget about JOINS, you wont have them we are not talking about highly relational databases we are talking about non relational ones so there can't be joins between 2 collections if there's no relation between them. One of the things we did was remodeling the way we stored data. We embedded what could be embedded and did 2 or more queries where embedding was not possible. • What about ForeignKeys, do we have those? Yes, or kind off. We have DBRef which is a kind of ForeignKey but I personally wouldn't use refs in mongodb. As I said, MongoDB is not about referencing and collection relations it is about performance based on dynamism. • If MongoDB barely has references you could guess that many to many is insignificant, instead of that I would start thinking on dicionaries/maps and lists/arrays. • And last but not least, If you really need to do a query that joins 2 collections based on a field reference that should handle a many to many relation then you have map/reduce.
  • 75. Keeping things lazy... Yes, because we’re lazy people so we do lazy things ... It is important when getting orms to work with mongodb that we keep things lazy to avoid bottle necks in our web applications. Mongodb doesn't have many to many relations but it can have lists and dictionaries saved. For example class User(models.Model) nickname = models.CharField(max_length=255) full_name = models.CharField(max_length=255) friends = ListField() groups = ListField() In the User model we have 2 ListFields that may cause some slow downs in our web application, the first one is a list containing ids/names of the user friends and the second one containing the groups user is related to so, think of a user that have many friends and that is related to many groups (a popular one), that's a lot of data transfer and many instantiations for our code because each object/id in the ListField should be instantiated. Maybe this might sound obvios but trust me, nothing is obvious when doing web programming.
  • 76. Keeping Relations or Embedding? This is a common question when moving from relational databases to non-rel ones. Should we keep our models related or embed smallest ones into the biggest ones?. The answer is NO, you shouldn't keep them related. For Example, A common situation (or commonly used to show how mongodb works) is a blog engine with posts and comments. Lets see how we could handle comments (not threaded) in our blog engine:
  • 77. Using References: class Comment(models.Model): post = models.ForeignKey(Post) user = models.ForeignKey(User) text = models.CharField(max_length=255) my_comment, created = Comment.objects.get_or_create(post=my_post, user=my_user, text=my_text, defaults={})
  • 78. Without references: class Post(models.Model) .... comments = ListField() post.comments.append({ ‘user’ : user, ‘text’ : text}) The first example is the most used because is the way we're used to think when we write our models but, the second one is the right one when talking about nosql databases because references make things slower. The bad thing about embedding our comments like that is that we have to worry about our 4mb Document limit so if we are really popular on the net and many people comes to our blog and comments our posts, that might be a problem for us, even though, This is great, I mean, we have removed a model from our app so it should be easier to maintain, shouldn't it? but, what is user supposed to be? Is it an embedded user object? is it a ForeignKey? what is it? How should we handle users there? It again depends on how you'd like to do things, for example It is possible to save the username as it should be showed and then when the comments are loaded just show the username, for those wanting to know more about this user then it is possible to do that just by clicking on its username it'll load the user's personal info. Here are some examples:
  • 79. Light and fast (For registered users): post.comments.append({'user' : 'FlaPer87', 'text' : 'My Comment'}) Heavy and slow (For any user): post.comments.append({'user' : {'username' : 'FlaPer87', 'email' : '', 'url' : ''}, 'text' : 'My Comment'})
  • 80. Lazy relations or mongodb like ones: #Automatic serialization done in django-mongodb-engine post.comments.append({'user' : {'_app': model._meta.app_label, '_model': model._meta.module_name, 'pk':, '_type': "django"}, 'text' : 'My Comment'})
  • 81. Taking Advantage from schema-less Databases for Web Development One of the things I like more from mongodb is that it is schema-less. People use to think about schema-less dbs as a mess which they're not. Schema-less databases do have a structure the difference between them and Schema based ones is that the schema-less structures are dynamic, this means that they can be modified at anytime and they're not typed, you can think about schema-less dbs as (just like mongodb does) json based maps. This kind of structures can be really helpful when doing web programing, in our case they let us save any kind of data in our collections and have generic structures that changed during the time. For example, let's try to improve our Comment model (in case we decided to have some relations).
  • 82. class Comment(models.Model): post = models.ForeignKey(Post) user = GenericField() text = models.CharField(max_length=255) my_user = "FlaPer87" #Known User my_comment, created = Comment.objects.get_or_create(post=my_post, user=my_user, text=my_text, defaults={}) my_user = {'nickname' : 'FlaPer87', 'full_name' : 'Flavio Percoco Premoli', 'email' : '', 'url' : ''} #Anonymous User my_comment2, created = Comment.objects.get_or_create(post=my_post, user=my_user, text=my_text, defaults={}) Using a GenericField we'll be able to save anything into that attr and we'll have to do our checks and controls code side. In this case the Schema-less collection helped us to get/save the anonymous users information without having to create a record in our Users table or without forcing the user to register.
  • 83. Summary Part 2 • Re-model your models • Be Lazy to be faster • Forget about relations, they will slow you down • Remember that dynamism is better than restrictions