3. About IGN’s Social Platform An API to connect gamer community with editors, games, other gamers, and help lay the foundation for premium content discovery as well as UGC In beta since Sept 2010 5M+ activities 20K UVs a day, ~100K PVs a day
4. Architecture REST based API, built in Java Entities are People,MediaItems, Activities, Comments, Notifications, Status Interfaces across IGN.com as well as other social networks Caching tier based on memcached MySQL and MongoDB as persistence PHP/Zendfront end
5. MongoDB Usage Activity Streams : ActivityStrea.ms standard Activity Caching :(more on this later!) Activity Commenting Points : Also extend to badges Blocklists, Ban lists Notifications : System notifications Analytics : Activity snapshot for a user
6. Alternatives MySQL Obvious alternative, being used for storing person data, game data, relationships Did not work for activities Massive joins to filter newsfeeds, i.e. activities from friends Fairly normalized schema for activities Too many changes to the schema as requirements changed and new types of activities came into picture. Alter table started to take hours. Optimization led to large number of indexes, slowing down the writes
7. Alternatives Voldemort Used for the initial release, Sept 2010 Fast and simple implementation of Amazon Dynamo Did not work out for long We needed the ability to query the data Needed more than Key-Value pairs No in-place updates out of the box, had to write custom code to handle concurrent update conflicts (read-repair). Not a lot of developer velocity when compared to MongoDB
8. Other alternatives Cassandra Learning curve, lack of querying Did not want to bite more than we could chew CouchDB Map-reduce queries, views REST-based API is good, but performance gets affected by a chatty, HTTP interface for a database
9. Configuration Server: 1 Master, 2 Slaves (load balanced thru Netscalar) 2 extra slaves which are not queried (replicate!!) Version 1.6.1 Client: Java Driver (2.1) Ruby Driver (1.2) Mappers: Morphia for Java Connections per host : 200, #hosts = 4 Oplog Size: 1GB, about 2.5 hours Syncdelay: 60s (default) Hardware: 2 core, 6 GB virtualized machine
10. Maintenance Data defragmentation Slaves – by running it on different port Master – by having a downtime Collection trimming The scripts block during remove Bulk removes kills the slaves, spiking CPU 100%
11. Monitoring Nagios TCP Port Monitoring Disk space monitoring CPU monitoring Munin Mongo connections Memory usage Ops/second Write Lock % Collection Sizes (in terms of # of documents)
12. Backup or prepping for O Shit! NetApp Filter based, snapshots Make sure to do {fsync:1} and {lock:1} on one slave Hourly dumps via cronjob Using mongodump Incremental backup via the oplog Replay the oplog instead of relying on a snapshot Delayed slaves Not recommended as it almost guarantees data loss proportional to the delay, which is inversely proportional to the time-to-react
13. Tools to be familiar with mongostat Look at queue lengths, memory, connections and operation mix db.serverStatus() Server status with sync, pagefaults, locks, index misses atop iostat db.stats() Overall info at the database level db.<coll_name>.stats() Overall info at the collection level db.printReplicationInfo() Info about the oplog size and time db.printSlaveReplicationInfo() Info about the master, the last sync timetamp, and how behind the slave is from the master
14. Challenges with ActivityStreams Lots of data! Large amount of data coming out as a result Reverse sorting The data has to be sorted in reverse natural order ($natural : -1), and we do not use capped collections Aggregation of similar activities Impacts pagination Fetching self activities (profile), and newsfeed (self + others) Filtering based on the activity type People want to see Game Updates or Blog updates from their friends Hydration of activities for dynamic data The thumbnail and level of the actor may change Comments When an activity is rendered, the initial comments and count has to be pulled ($slice) TODO: Rant about missing $size operator
15. ActivityStreams Each activity has an ACTOR Each actor has a TYPE Each actor performs an action, that action is called a VERB Each VERB can act upon many Objects, called ACTIVITYOBJECTS Some VERBs may involve a Target, called ACTIVITYTARGET Every entity (Actor, ActivityObject, ActivityTarget) has links to define it Examples : A writes ‘Hello!’ on B’s wall Actor => A, ActivityObject=> ‘Hello!’ of type WALL_POST, ActivityTarget=> B, VERB => POST A follows a game B Actor => A, ActivityObject=> B of type MEDIA_ITEM, ActivityTarget=> null, VERB => FOLLOW ………and it gets complicated as we go down the rabbit hole!
16. Caching using MongoDB Caching the entire streams A bad idea (or bad implementation?) The expired objects sat in the db, bloating the database The removal did not free up space, so we ran out Use Mongo as a cache-key-index Cache the streams in Memcached For invalidation, keep the index of the memcached keys in MongoDB. Works!
17. What we’ve learned Keep an eye on Page Faults Index misses Queue lengths Database sizes on disk due to reuse vs. release Use .explain() Watch for nscanned and indexBounds Use limit() when using find While updating, try to load that object in memory so that its in the working set (findAndModify) Try to keep the fields being selected at a minimum Replicate and denormalize instead of using writeconcerns
18. Near term Plans Move to replica sets Move relationship graphs to MongoDB Shard the relationships based on the userId Run multiple mongo processes, splitting out collections among multiple databases
19. Wishlist Respect indexes in $or queries A $size operator for arrays $inc when doing $addToSet Defragmentation when removing data Concurrency – too many write lock conditions A decent start/stop script Load balancing in the driver (round robin) for reads
20. We are hiring Software Engineers to help us with exciting initiatives at IGN Technologies we use RoR, Java (no J2EE!), Spring, PHP/Zend, JQuery HTML5, CSS3, Sencha Touch, PhoneGap MongoDB, memcached, Solr http://corp.ign.com