O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Building Lanyrd

How we built and scaled Lanyrd, using Python, Django, Solr, r

  • Seja o primeiro a comentar

Building Lanyrd

  1. 1. Lanyrd.comBuilding Lanyrd Simon Willison BrightonPy, 9th August 2011 http://lanyrd.com/sgptt
  2. 2. Lanyrd.com Definitive database of professional events and speakers
  3. 3. Lanyrd.com Definitive database Social event recommendation of professional events Comprehensive speaker profiles and speakers Archive of slides, notes and video
  4. 4. A brief history
  5. 5. Casablanca! August 2010
  6. 6. • Aug 31st, 11:22: Launch! (1 linode)• Aug 31st, 12:41: Unlaunch• Aug 31st, 12:54: Read only mode• Aug 31st, 14:15: DB server (2 linodes)• Sep 1st: Limit 50 on dashboard• Sep 1st: disable-dashboard setting
  7. 7. • Sep 3rd: dConstruct (and Twitter bot)• Sep 4th: TechCrunched (read only :( )• Sep 5th: 3 large EC2 + 1 RDS• Sep 6th: Downgrade to 3 small EC2
  8. 8. December photo: @niqui
  9. 9. • Dec 8: Calacanis + Scoble at the same time! • Upgrade to next size of RDS • (Sometimes scaling vertically does the job)
  10. 10. • Jan 26th: Solr powered dashboard • Replicated to 2, then 3 servers
  11. 11. lanyrd.com badges.lanyrd.net Load balancer (nginx) HTTP cache (varnish) Database (MySQL RDS) app server app server app server(django/mod_wsgi) (django/mod_wsgi) (django/mod_wsgi) search master search slave search slave Redis (data structures + (solr) (solr) (solr) message queue) logging worker worker (MongoDB) (celery) (celery)
  12. 12. Solr + Haystack
  13. 13. apache > lucene > solr Search the site with Solr Search Main Wiki Powered by Lucid Imagination Last Published: Sat, 04 Jun 2011 12:23:42 GMT About Welcome Who We Are Welcome to Solr Documentation PDF Resources What Is Solr? Related Projects Get Started News May 2011 - Solr 3.2 Released March 2011 - Solr 3.1 Released 25 June 2010 - Solr 1.4.1 Released 7 May 2010 - Apache Lucene Eurocon 2010 Coming to Prague May 18-21 10 November 2009 - Solr 1.4 Released 20 August 2009 - Solrs first book is published! 18 August 2009 - Lucene at US ApacheCon 09 February 2009 - Lucene at ApacheCon Europe 2009 in Amsterdam 19 December 2008 - Solr Logo Contest Results 03 October 2008 - Solr Logo Contest 15 September 2008 - Solr 1.3.0 Available 28 August 2008 - Lucene/Solr at ApacheCon New Orleans 03 September 2007 - Lucene at ApacheCon Atlanta 06 June 2007: Release 1.2 available 17 January 2007: Solr graduates from Incubator 22 December 2006: Release 1.1.0 available 15 August 2006: Solr at ApacheCon US 21 April 2006: Solr at ApacheCon 21 February 2006: nightly builds 17 January 2006: Solr Joins Apache Incubator What Is Solr?
  14. 14. More Like This Faceting Stored (non-indexed) fields Highlighting Spelling Suggestions Boost Find the needle youre looking for. Download DocumentationSearch doesnt have to be hard. Haystack lets you write your search code Sprinting to 1.1-final Posted on 2010/11/16 by Danielonce and choose the search engine you want it to run on. With a familiar API Though this site has sat out ofthat should make any Djangonaut feel right at home and an architecture that date, there has been a lot of work put into Haystack 1.1. Asallows you to swap things in and out as you need to, its how search ought of writing, there are eight issuesto be. blocking the release. I aim to have those down to zero by the end of the week.Haystack is BSD licensed , plays nicely with third-party app without needingto modify the source and supports Solr , Whoosh and Xapian . Once those eight are done, I will be releasing 1.1-final. The RC process really didnt do muchGet started last time and this release has been a long time in coming. This1. Get the most recent source. release will feature:2. Add haystack to your INSTALLED_APPS.3. Create search_indexes.py files for your models. Vastly improved faceting4. Setup the main SearchIndex via autodiscover. Whoosh 1.X support!5. Include haystack.urls to your URLconf. Document & field boost6. Search! support
  15. 15. Model-oriented search• Define search_indexes.py (like admin.py) for your application• Hook up default haystack search views• Write a quick search.html template• Run ./manage.py rebuild_index
  16. 16. add a conference you are signed in as simonw, do you want to sign out? calendar conferences coverage profile search SearchWe found 3 results for “django” FILTER BY django Search type Sessions 3Your current filters are…TYPE: Sessions TOPIC: NoSQL PLACE: United States Clear all filters FILTER BY topic NoSQL and Django Panel EVENT DjangoCon US 2010 NoSQL 3 TIME 9th September 2010 09:00-10:00 SPEAKERS Jacob Burch Django 2 Cassandra 1 Step Away From That Database EVENT DjangoCon US 2010 TIME 8th September 2010 11:20-12:00 FILTER BY SPEAKERS Andrew Godwin place Apache Cassandra in Action United States 3 EVENT Strata 2011 Multnomah 2 TIME 1st February 2011 13:30-17:00 Oregon 2 SPEAKERS Jonathan Ellis Portland 2 Santa Clara 1 California 1
  17. 17. class BookIndex(indexes.SearchIndex): text = indexes.CharField(document=True, use_template=True) speakers = indexes.MultiValueField() topics = indexes.MultiValueField() def prepare_speakers(self, obj): return [a.user.t_id for a in obj.authors.exclude( user = None ).select_related(user)] def prepare_topics(self, obj): return list(obj.topics.values_list(pk, flat=True))
  18. 18. search/indexes/books/ book_text.txt{{ object.title }}{{ object.tagline }}{% for author in object.authors.all %} {{ author.display_name }} {{ author.user.t_screen_name }}{% endfor %}{% for topic in object.topics.all %} {{ topic.name_en }}{% endfor %}
  19. 19. Staying fresh• Search engines usually don’t like accepting writes too frequently • RealTimeSearchIndex for low traffic sites• ./manage.py update_index --age=6 (hours) • Uses index.get_updated_field()• Roll your own (message queue or similar...)
  20. 20. Replication Solr MasterSolr Slave Solr Slave Solr Slave
  21. 21. Smarter indexingclass Article(models.Model): needs_indexing = models.BooleanField( default = True, db_index = True ) ... def save(self, *args, **kwargs): self.needs_indexing = True super(Article, self).save(*args, **kwargs)
  22. 22. index = site.get_index(model)updated_pks = []objects = index.load_all_queryset().filter( needs_indexing=True)[:100]if not objects: returnfor object in objects: updated_pks.append(object.pk) index.update_object(object)index.load_all_queryset().filter( pk__in = updated_pks).update(needs_indexing = False)
  23. 23. nginx + Solr replication trickupstream solrmaster { server { server; listen 8983;} location /solr/update {upstream solrslaves { proxy_pass http://solrmaster; server; } server; location /solr/select { server; proxy_pass http://solrslaves;} } }
  24. 24. add a conference you are signed in as simonw, do you want to sign out? calendar conferences coverage profile searchYour contacts calendar yours 24 contacts 182 SimonWeve found 182 conferences your Twitter contacts are Willisoninterested in. Your profile pageTODAY Café Scientifique: Exploring Attend 21 the dark side of star Track formation with the Herschel From our blog Space Observatory Welcoming Sophie United Kingdom / Brighton Barrett to team 21st June 2011 Lanyrd Astronomy Science Today we have a very special announcement (and for once, 4 contacts tracking its not a new feature!) We would like to welcome the super-wonderful Sophie Barrett to the Lanyrd team. 21 Usability Professionals Attend Session schedules in Association – International Track your calendar Conference You can now subscribe to event schedules in your calendar of United States / Atlanta choice. Stay up to date at the 21st–24th June 2011 event with the schedule in the Usability User Experience pocket where you need it. 1 contact speaking and 3 contacts tracking Venues (and venue maps)
  25. 25. # Original implementationtwitter_ids = [11134, 223455, 33221, ...] # fetch from Twitterattendees = Attendee.objects.filter( user__t_id__in = twitter_ids).filter( conference__start_date__gte = datetime.date.today())
  26. 26. # Current implementationtwitter_ids = [11134, 223455, 33221, ...] # fetch from Twittersqs = SearchQuerySet()sqs = sqs.models(Conference)or_string = OR .join(twitter_ids)sqs = sqs.narrow(attendees:(%s) % or_string)
  27. 27. Redis
  28. 28. Commands Clients Documentation Community Download IssuesRedis is an open source, advanced key-value store. It is often What people are sayingreferred to as a data structure server since keys can contain Comparison of CouchDB, Redis, MongoDB, Casandra, Neo4J &strings, hashes, lists, sets and sorted sets.strings hashes lists sets others http://j.mp/l32SqM via @DZoneLearn more → @__NeverGiveup Oh YAY, oui tu me redis ! *-* Hm, on srejoint àTry it Download it 14h au bahut ? :oReady for a test drive? Check this interactive Redis 2.2.10 is the latest stable version. JE L REDIS JE FOLLOW BACK SUR @Fuckement_TLtutorial that will walk you through the most Interested in legacy or unstable versions?important features of Redis. Check the downloads page. une question : "How to use ServiceStack Redis in a web application to take advantage of pub / sub paradigm" http://t.co/EOgyLU1 #redis #web Nice - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Membase vs Neo4j comparison http://bit.ly/l32SqM from @kkovacs More... Sponsored by This website is open source software developed by Citrusbyte. The Redis logo was designed by Carlos Prioglio.
  29. 29. simonw-follows:{144,21345,12328...}europython-attendees:{344,21345,787...}contact_ids = redis.sinter( simonw-follows, europython-attendees)
  30. 30. add a conference you are signed in as simonw, do you want to sign out?Lanyrd.com calendar conferences coverage profile search EuroPython 2011 Youre speaking The European Python Conference AT THIS EVENT 19 –26 JUNE 2011 Florence in Italy 97 attending http://ep2011.europython.eu/ @europython PEOPLE View the schedule on Lanyrd #europython 80 tracking PEOPLE Save to iCal / iPhone / Outlook / lanyrd.com/ccdpc (short URL) GCal TELL YOUR FRIENDS! Tweet about this event 119 speakers Andreas Alan Anna Schreiber Franzoni Ravenscroft Topics @onyame @franzeur Django Andrew Alessandro Anselm Kruis Godwin Dentella Plone @andrewgodwin Pyramid Andrii Alex Martelli Antonio Cuni @antocuni Python Mishkovskyi @mishok13 Twisted Ali Afshar Armin Rigo Armin Edit topics
  31. 31. Celery
  32. 32. Home Download Community Documentation Code Background Processing Distributed Asynchronous/Synchronous Concurrency Background Processing Distributed Periodic Tasks Retries Asynchronous/Synchronous Concurrency Periodic Tasks Retries Distributed Task Queue Celery 2.2 released! By @asksol on 2011-02-01. Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports A great number of new features, scheduling as well. including Jython, eventlet and gevent support. Everything is detailed in the The execution units, called tasks, are executed concurrently on a single Changelog, which you should have read or more worker servers using multiprocessing, Eventlet, or gevent. before upgrading. Tasks can execute asynchronously (in the background) or synchronously (wait until ready). Users of Django must also upgrade to django-celery 2.2. Celery is used in production systems to process millions of tasks a day. This release would not have been Celery is written in Python, but the protocol can be implemented in possible without the help of any language. It can also operate with other languages using contributors and users, so thank you, webhooks. and congratulations! The recommended message broker is RabbitMQ, but limited support for Redis, Beanstalk, MongoDB, CouchDB, and databases (using Celery 2.1.1 bugfix SQLAlchemy or the Django ORM) is also available. release By @asksol on 2010-10-14. Celery is easy to integrate with Django, Pylons and Flask, using the All users are urged to upgrade. For a list django-celery, celery-pylons and Flask-Celery add-on packages. of changes see the Changelog. Example Users of Django must also upgrade to This is a simple task adding two numbers: django-celery 2.1.1.
  33. 33. Tasks?• Anything that takes more than about 200ms • Updating a search index • Resizing images • Hitting external APIs • Generating reports
  34. 34. Trivial example• Fetch the content of a web page from celery.task import task @task def fetch_url(url): return urllib.urlopen(url).read() >>> result = fetch_url.delay(‘http://cnn.com/’) >>> html = result.wait()
  35. 35. Python and MongoDB EuroPython 2011 Italy / Florencetutorial 19th–26th June 2011 TELL YOUR FRIENDS! Tweet about thisA session at EuroPython 2011 session Andreas Jung WHEN CEO, ZOPYX Ltd Time 14:30–18:30 CET Date 20th June 2011MongoDB is the new star of the so-called NoSQL databases. UsingPython with MongoDB is the next logical step after having used SESSION HASH TAGPython for years with relational databases. #sftzhThis talk will give an introduction into MongoDB and demonstrate SHORT URLhow MongoDB can be be used from Python. lanyrd.com/sftzhMore information can be found under: OFFICIAL SESSION PAGEhttp://www.zopyx.com/resources/python-mongodb-tutorial-at... ep2011.europython.eu/conf View the schedule More sessions at EuroPython 2011 on Python Topics MongoDB Add coverage to this session Python http://www.slideshare.net/ajung/python-mo Edit topics A URL to coverage such as videos, slides, podcasts, handouts, sketchnotes, photos etc. SCHEDULE INCOMPLETE? Add Add another session
  36. 36. Add coveragehttp://www.slideshare.net/ajung/python-mongo-dbtrainingeurop...Link title Python and MongoDB tutorialPython mongo db-training-europython-2011 EuroPython 2011 Italy / Florence 19th–26th June 2011Type of coverage Link Audio Liveblog Write-up Sketch notes Photos Slides Transcript Notes Video HandoutCoverage previewFrom SlideShare:
  37. 37. The task itself...• Tries using http://embed.ly/ to find a preview• Fetches the HTTP headers and first 2048 bytes• If HTML, attempts to extract the <title>• If other, gets the file type and size from headers
  38. 38. Behind the scenes...ar = enhance_link.delay(url)poll_url = /working/%s/ % signed.dumps({ task_id: ar.task_id, on_done_url: on_done_url,})if ajax in request.POST: return render_json(request, { ok: True, poll_url: poll_url, })else: return HttpResponseRedirect(poll_url)
  39. 39. And when it’s done...from celery.backends import default_backend...task_id = request.REQUEST.get(id, )result = default_backend.get_result(task_id)
  40. 40. Configuration# Carrot / Celery: queue uses RedisCARROT_BACKEND = "ghettoq.taproot.Redis"BROKER_HOST = "" # redis serverBROKER_PORT = 6379BROKER_VHOST = "6"# Task results stored in memcached, so they can# expire automaticallyCELERY_RESULT_BACKEND = "cache"CELERY_CACHE_BACKEND = "memcached://;..."
  41. 41. Tricks
  42. 42. Phantom load testing• Deploy a new architecture on a brand new EC2 cluster• Leave your existing site on the old cluster• Invisibly link to the new stack from an <img width=1 height=1> element on your live site (not for very long though)• (sensible alternative: find a way to replay log files)
  43. 43. cache_version
  44. 44. add a conference you are signed in as simonw, do you want to sign out? calendar conferences coverage profile searchDjango conferences Django Django events looking for participants coverage 1 Django event is looking for participants 52 videos Most recent added 3 weeks agoON NOW EuroPython 2011 52 slide decks 19 Most recent added 4 Italy / Florence hours ago 19th–26th June 2011 Django Plone Pyramid Python Twisted 3 audio clips Most recent added 1 week ago 27 write-upsSEPTEMBER DjangoCon US 2011 6 Most recent added 12011 United States / Portland 6th–8th September 2011 week ago 11 handouts Django Open Source Python Most recent added 18 hours ago 17 PyCON FR 2011 3 notes France / Rennes Most recent added 10 17th–18th September 2011 hours ago Django Python By countryOCTOBER PyCon DE 2011 Ireland 1 4
  45. 45. class Conference(models.Model): ... cache_version = models.IntegerField(default = 0) def save(self, *args, **kwargs): self.cache_version += 1 super(Conference, self).save(*args, **kwargs) def touch(self): Conference.objects.filter(pk = self.pk).update( cache_version = F(cache_version) + 1 )
  46. 46. {% cache 36000 conf-topics conference.pk conference.cache_version %} <ul class="tags inline-tags meta"> {% for topic in conference.topics.all %} <li><a href="{{ topic.get_absolute_url }}">{{ topic }}</a></li> {% endfor %} </ul>{% endcache %}
  47. 47. Bulk invalidationfrom django.models import Ftopic.conferences.all().update( cache_version = F(cache_version) + 1)
  48. 48. Signing
  49. 49. Pass data through an untrustedsource with confidence that it hasnt been tampered with
  50. 50. Signing uses• "Unsubscribe" links in emails • lanyrd.com/un/ImN6VyI.ii0Hwm7p71DEcGfaVzziQaxeuu?redirect_to=URL protectionSigned cookies "You are logged in as simonw" withouthitting the database
  51. 51. Signing in Django 1.4from django.core import signingsigning.dumps({"foo": "bar"})signing.loads(signed_string)response.set_signed_cookie(key, value...)response.get_signed_cookie(key)
  52. 52. Hashed static assetfilenames in S3/CloudFront
  53. 53. global.js global.ed81d119.jscdn.lanyrd.net/js/global.ed81d119.js
  54. 54. Benefits• Far futures expiry headers • Cache-Control: max-age=315360000 • Expires: Fri, 18 Jun 2021 06:45:00 -0000 GMT• Guaranteed updated CSS in IE• Deploy new assets in advance of application• Old versions stick around for rollbacks
  55. 55. ./manage.py push_static• Minifies JavaScript and CSS• Renames files to include sha1(contents)[:6]• Pushes all assets to S3
  56. 56. Profiling and debugging production systems
  57. 57. UserBasedExceptionMiddlewarefrom django.views.debug import technical_500_responseimport sysclass UserBasedExceptionMiddleware(object): def process_exception(self, request, exception): if request.user.is_superuser: return technical_500_response(request, *sys.exc_info())
  58. 58. mysql-proxy• Very handy lua-customisable proxy for all of your MySQL traffic• Worst documented software ever• log.lua - logs out ALL queries • https://gist.github.com/1039751
  59. 59. django_instrumented• (Unreleased) code I wrote for Lanyrd• Collects various runtime stats about the current request, stashes a profile JSON in memcached• Writes out the profile UUID as part of the HTML• A bookmarklet to view the profile
  60. 60. mongodb logging• Super-fast inserts, log everything!• Capped collections• Structured queries• Ask me about it in a few months
  61. 61. For the future...• Much better profiling, monitoring and alerts• Varnish in front of everything• Replicated MySQL for analytics + upgrades
  62. 62. Questions?
  63. 63. Thank you!http://lanyrd.com/sgptt