Jeremy Edberg is an operations manager at Reddit. He can be contacted via email at jedberg@reddit.com or through his profiles on Reddit, Twitter, Facebook, LinkedIn, and his personal website www.edberg.org/jeremy where he goes by the nickname "Information Cowboy."
Every year, Americans hold a silly event called “The Superbowl”. It is watched by millions.\n\nThere is a myth that says that during halftime,\n
Everyone goes to the bathroom and it causes a drop in water pressure \n\nWhile this is a myth, where does it come from? It comes from the fact that oftentimes there IS an explination for sudden operational changes.\n\nAnd web operations is no exception.\n
There was another sporting event recently, the world cup. Watched by possibly a billion people around the world, speaking a 100 languages\n\nAnd during that event, reddit kept getting slow. Luckily, we were aware of the world, and knew there was a major sporting event. But why was our site getting slow?\n\n\n
Sure, when a goal was scored, there were a bunch of new comments, but nothing out of the ordinary. So what was going on, besides a bump in traffic?\n\nWell, it turns out that soccer (football) is a global event, and it turns out our website was not at all optimized for that many different language speakers. Luckily, we were aware of the global nature of the event, and we were able to focus in on the issue.\n
\n
On June 25th, 2009, Michael Jackson passed away. It was a seminal day for the internet, the reason being that the story broke online, and took over an hour to be confirmed by any “traditional” media. It also happened amazingly quickly.\n
Within 2 minutes of his death, our servers looked like this. Alerts were going off left and right, and we scrambled to figure out what was going on.\n\nThe standard diagnostics failed us, revealing just a huge burst in traffic, but no cause. Because the story was so new, it hadn’t hit the front page of reddit yet, and we didn’t think to check our own new story queue, despite the fact that we are a “news” site.\n\nChecking the usual news sites for an external event showed nothing.\n\nFinally, one of us had the idea to check TMZ, an online entertainment tabloid. And sure enough, there it was -- their lead story was the death of Michael Jackson, still completely uncovered by any traditional outlet.\n\nIt was only because we read trashy entertainment tabloids that we were able to figure out the problem.\n
Some of the fine work of the reddit community\n
Another one of my favorite operations stories is the power grid management in England.\n\nThey actually have to watch TV soap operas, because when the show ends, 1.5 million people go and plug in their electric tea kettles, creating a huge surge in demand for power.\n\nAnd then a few minutes later, they have to reverse everything they just did so they don’t overload the grid.\n
reddit loves nerdy humor too.\n
On September 11, 2001, a horrible tragedy occurred. Much has been said, but I want to talk about something else today, an interesting, almost funny story to come out of that day.\n\n\n
The eBay operations center is a windowless room in the headquarters building in San Jose. It has 10 of monitors all over the walls showing graphs and stats on pretty much anything an ops manager would be interested in.\n\nAll of a sudden, the graphs started dropping. Transactions were down, bids were down, new auctions where down. It’s like everyone had just stopped using the internet.\n\nAnd that was exactly what had happened. Unfortunately, there was no news feed into the ops center. There was no twitter or anything else, so if weren’t paying attention, breaking news would pass you by.\n\nIt took over an hour before someone came into the ops room and asked if anyone wanted a break to check on their relatives in NY, and they finally realized what was happening.\n\nThe next day they had cable installed into the ops center, and set one of those monitors to CNN Headline news 24/7\n\n
Here’s a picture of the internet I found on reddit. Who would have thought that it can fit in a box.\n
This is 365 main. For those that don’t know, it is the “Web 2.0” datacenter in San Francisco. It has sites like Craigslist, Yelp, Typepad, Digg (not reddit, we were next door at the time).\n\nIn San Francisco, the datacenter cooling plans assume that the outside temperature will never exceed 72. On a warm summer day, of course, this plan failed, and the datacenter shut down.\n\nOf course, since all the Web 2.0 news sites that would cover such an outage were located in the building, it was difficult to get the news. A bunch of ops people all over San Francisco were scrambling to find out why their site was suddenly completely unavailable.\n\nFinally, word started to travel around town by mouth, folks at different companies calling each other asking if their site was down.\n\nEventually a mob formed at the door (shown here), and it took hours for everyone to get in and fix their machines that hadn’t booted. \n\nFor anyone who has been in a datacenter, you know how hard that would be with *every* customer there*.\n
reddit also likes 80’s TV shows.\n
One last quick anecdote.\n\nA couple of weeks ago was one of the biggest security conferences of the year, Blackhat and Defcon.\n\nDuring the conference, there was a notable drop in attack traffic.\n\nIf we hadn’t known about the conference, we would have thought that the attackers had come up with a totally new novel way of attacking us that we couldn’t detect.\n
Thanks for listening!\n\nIf anyone has any questions\n\n(next slide)\n
\n
You can contact me in one of these ways\n\nthank you.\n