This document provides an overview of MongoDB including:
- MongoDB is an open-source document database that is schemaless and document-oriented.
- It has advantages like rich querying, horizontal scalability, high availability, and flexibility in schemas.
- The document includes information on MongoDB's data model, querying capabilities, indexing, availability through replication, and scaling through sharding.
- Case studies are presented showing how companies like Mailbox, Visual China, and Youku use MongoDB for applications processing large amounts of data.
4. What’s Mongo?
MongoDB (from "humongous") is an open-source document database, and the
leading NoSQL database. Written in C++
The most SQL-like NoSQL.
Mongo is a Open, Schemaless, Document-Oriented NoSql data base with Rich
Query, High Performance, High Availbility, High Scalibility, High Flexibility
5. 1. Document Data Model. Document, BSON.
2. Rich Query Model. Full Index, Various Query Type.
3. Idiomatic Drivers. Over 17 language drivers support.
4. Horizontal Scalability. Easy to append capacity
5. High Availability. HA, Journal, Auto-Recover.
6. In-Memory Performance. Memeory-Mapped Files, read/write in RAM.
7. Flexibility. Schema-free, multi-datacenter deployments, tunable consistency, widly
used across many industries.
10. Query Type
1. Key-value
2. Range queries.
3. Text Search AND, OR, NOT etc.
4. Aggregation count, min, max, average etc.
5. MapReduce
11. Cursor
Query returns a cursor
Iterate the cursor to get results
Return 101 results or size less than 1M bytes,
overrided by batchSize or limit, not exceeds 16M
14. Index
1.
Single Field Indexes
2.
Compound Indexes.
3.
Array Indexes.
4.
Geospatial Indexes.
5.
Hash Indexes.
1.
Unique Indexes
6.
Text Search Indexes (V2.4, Beta)
2.
Spars Index
15. Index
At least 8KB for each index.
Negative performance impact for write operations. Expensive for high
write-to-read ratio collection.
benefit high read-to-write ratio collections.
Consumes disk space and memory. Carefully tracked and plan
23. Basic Concepts
• Config Servers
Shards
Replica
Mongos Set
Contain APP requests
a group of mongod
Exist in sets of three
Process fractions of
global requests to
processes
Maintain metadata
Direct data
Are replica
Includes sets in
shards Primary and
Are mongod instances
production
Secondarys to clients
Direct results
Can be queried
Exist as 1+
directly by clients (not
Are mongos instances
recommended)
Cache metadata
29. Schema Design
•
Remember, "schemaless" doesn't mean you don't need to design your schema!
•
•
•
•
•
•
•
Considerations to avoid the pitfalls of MongoDB schema design:
1. Avoid growing documents
3. Pay attention to BSON data types
5. Field names take up space
6. Consider using _id for your own purposes
7. Can you use covered indexes?
8. Use collections and databases to your advantage
•
•
Test everything
Schema design effect performance
Schema design effect infrastructure: RAM > indexes + hot data = better performance
30. MongoDB for MDS – Sharding Strategy
• When need shard?
–
–
–
your data set approaches or exceeds the storage capacity of a single MongoDB instance.
the size of your system’s active working set will soon exceed the capacity of your system’s maximumRAM.
a single MongoDB instance cannot meet the demands of your write operations, and all other approaches have not
reduced contention.
• The considerations for sharding
–
–
–
–
Multiple ways to model a domain problem
Understand the key uses cases of your app
Balance between ease of query vs. ease of write
Random I/O should be avoided
• Meeting behavior and sharding consideration(From 10G)
–
–
–
–
Schedule meeting - ~800K meetings write/day
~20% instant meetings
Scalability best practice: Don’t scale by using replication. Scale by using local read nodes.
Recommend to implement local write to meet JOIN meetings use case requirements
31. Cross DC latency Testing
Local vs Remote Write/Read Latency Test:
Scenario:
Create two shards, each with three member replica sets. Make sure that Primary node of one runs on local DC(SJ), where as Primary
of the second runs on remote DC(TX). Run small number of writes from local DC to Replica1 Primary and then run the same against
Replica2 Primary. Writeconcern = majority. Average object size is 1500 bytes. (ping time 46 ms from local DC(SJ) to remote DC(TX).
Local vs Remote Insert Tests (YCSB test):
32. Replication delay cross DC
•
•
Repication Lag between data centers:
Scenario: On the local DC(SJ), where the replication Primary is running, insert 500 records at a time, upto a total of 550,000 records.
Record the record count and current timestamp at the end of every 500 insertions. Note that this is a single threaded operation and only
one process is inserting these records. On the remote DC(TX), where the 3rd secondary is running (this node is the least nearest of all
the secondaries and so, is not part of the initial write), in a loop keep getting the db.collection.count() and whenever the count returns a
multiple of 500, record the count and the current timestamp. Use the data collected on Primary and remote secondary, compute the
replication delay.
33. MongoDB for MDS – Sharding
Goals:
- write to a shard primary node with physical proximity to the application server
- keep the shard primary node in close proximity to the application server [monitor the primary node of the replica set and if possible, restore the primary t
- reduce 'scatter/gather' on reads - use smart shard keys
Solution:
Add a geo-location based field in the schema, create a shard index based on that field, assign a tag to each shard and assign specific shard index field ra
e.g., Say we can add a 'DC' field into our collection. Assuming that the application somehow knows the data center it is running on, it can use this value for
Associate the tag ranges to specific tagged shard.
Inferred Technical Requirements
1. MongoDB Sharding (shard keys: region + siteId + userId, region + siteId + meetingUUID) to support 3 regions
(US, EMEA, APAC)
2. Sharding by siteId + userId or siteId + meetingUUID allows hosts from the same company (siteId), same region
to create meetings in different shards. if we need to scale horizontally, the shard config will add another shard
for the same siteId
3. Based on shard keys, we can support the requirements of local writes, local reads
4. Replication requirement - replicating 600,000 meetings/day within 15 minutes between 2 nodes (remark: early
benchmarking shows 11M meetings data replicated across 3 sites within 4 minutes)
5. Availability requirement - a primary node fails over to a secondary node within the same data center = < 30
sec; a primary node fails over to a secondary in a different data center = < 10 minutes
34. MongoDB使用案例
•
•
BillRun 计费系统
奥弗•科恩发布下一代的开源计费解决方案BillRun ,此方案利用MongoDB作为其后端存储。此计费系统已经运行于以色列发展最快的移动运
营商的产品环境,每个月能处理超过500M的呼叫数据记录CDR。
•
•
•
•
•
视觉中国
存储comments/feed/full text search
问题:
Fail-over失效,由于没有正确配置replica set,至少1 primary+2 sencondary+n arbiter.
Out of Memory导致宕机 --增加内存,使用正确驱动(非开发版)
•
•
优酷
优酷的在线评论业务已部分迁移到MongoDB,运营数据分析及挖掘处理前在使用Hadoop/HBase;
•
•
•
•
奇虎360
Document>100Million
问题 Time out (数据超过内存,随机读写,moving chunk时间)
Solution: 增大内存(甚至用SSD),节省空间使用(schema refactor);调整balancer工作时间,避免高峰
•
•
•
•
Mailbox
100 Million Messages Per Day, store email and related data by MongoDB
https://tech.dropbox.com/2013/09/scaling-mongodb-at-mailbox/
Lesson: write lock contention Solution: separate hot collection to standalone cluster, sharding
•
•
•
Other
百度开放云-云数据库 非关系型数据库用了mongoDB有很多中小开发者基于mongodb进行开发
Amazon E2: MongoDB后台数据库,如果其上应用data
MongoDB起源于2007年10gen公司的一个项目,该项目的目的是创建一个类似于谷歌AppEngine的Paas平台,用来自动管理软硬件基础设施,让开发者将精力集中在程序设计上,但是这样也剥夺了开发人员很多的自主权,反响不是很好。原本的paas平台由应用服务和数据库组成,发现人们对数据库更感兴趣,于是专注于数据库部分,也就是现在的MongoDB
Dwight Merriman & Kevin Ryan
MongoDB成为2013年大数据领域的创业新贵。这家成立于2007年的企业在近期获得了2.31亿美元的融资,也因此成为首个身价超过10亿美元的开源创业企业。目前,业内对该公司资产的估值高达12亿美元
To support hash based sharding, MongoDB provides a hashed index type
The sparse property of an index ensures that the index only contain entries for documents that have the indexed field
To support hash based sharding, MongoDB provides a hashed index type
The sparse property of an index ensures that the index only contain entries for documents that have the indexed field