Доклад построен на опыте разработки платформы реал-тайм мессенджера с характеристиками:
* 100 000+ одновременно подключенных пользователей
* 100+ серверов
* REST API для ботов
Структура доклада:
* Зачем разрабатывать мессенджер?
* Актуальные протоколы обмена сообщениями
* Архитектурные подходы к разработке мессенджера
* Библиотеки и инструменты
* Проблемы и подводные камни
WebCamp 2016: Python. Вячеслав Каковский: Real-time мессенджер на Python. Особенности разработки back-end.
1.
2. Me!
@kakovskyi
Python Developer at SoftServe
Contributor of Atlassian HipChat — Python 2, Twisted
Maintainer of KPIdata — Python 3, asyncio
2
3. Agenda
● What is 'instant messenger'?
● Related projects from my experience
● Messaging protocols
● Life of messaging platform
● Lessons learned
● Summary
● Further reading
3
5. What is 'instant messenger'?
● online chat
● real-time delivery
● short messages
5
6. What is 'instant messenger'?
● history search
● file sharing
● mobile push notifications
● video calling
● bots and integrations
6
7. Related projects from my experience
● Hosted chat for teams and enterprises
● Founded in 2009 by 3 students
● 100 000+ connected users
● 100+ nodes
● REST API for integrations and bots
● Built with Python 2 and Twisted
7
8. Messaging protocols
Protocol is about:
● Message format
● Allowed types of messages
● Limitations
● Routine
○ How to encode data?
○ How to establish/close connection?
○ How to authenticate?
○ How to encrypt?
8
10. XMPP
● XMPP - signaling protocol
● BOSH - transport protocol
● Started from Jabber in 1999
● XML as a message format
● Stanza - basic unit in XMPP
● Types of stanzas:
○ Message
○ Presence
○ Info/Query
10
11. XMPP
● Extensions defined by XEPs (XMPP Extension
Protocols):
○ Bidirectional-streams Over Synchronous
HTTP (BOSH)
○ Serverless messaging
○ File transfer and etc.
11
19. WebSocket-based solutions
● WebSocket - transport protocol
● Standardized in 2011 by W3C
● Full-duplex communication channel
● JSON as a message format
● Custom message types
19
26. WebSocket and Python
● Clients:
○ Autobahn
○ aiohttp
○ Tornado-based example
○ Vanilla websocket-client
● JS-client: SocketIO
26
27. Life of messaging platform
● Authentication
● Access control checks
● Delivery
○ Messages
○ User's presence
○ Push notifications
● History retrieval
● History search
27
28. Life of messaging platform
● Parsing
○ Protocol
○ Message content
● Dealing with file uploads
○ Security checks
○ Thumbnails distribution
● Multi-session support
● Reconnection handling
● Rate-limiting
28
29. Life of messaging platform
● Server keeps connections open for every client
● High amount of long-lived concurrent connections
● Multithreaded approach isn't efficient due to overhead
● Requires usage of a select implementation on backend:
○ poll
○ epoll
○ kqueue
● Usage of asynchronous Python frameworks is preferred
for high loaded solutions
29
30. Life of messaging platform
● Authentication
○ OAuth2
○ Run encryption operations in a separate Python thread
○ Cache users identities with Redis/Memcached
● Access-control checks
○ Make the checks lightweight and cheap
○ Raise an exception when operation isn't permitted
30
EAFP: Easier to ask for forgiveness than permission
31. Delivery
● Make message delivery fault-tolerant
● Limit size of a message
● Filter content of messages:
○ Users like to send chars that break all the things
● Reduce presence traffic, it could be a bottleneck for large chats
● Use asynchronous broker for delivery when a user is offline
(email or push)
○ Celery
○ RQ
○ Amazon Simple Queue Service
○ Huey
31
32. Life of messaging platform
● Push notifications
■ Vendors
● Amazon SNS
● APNS
● Google Cloud Messaging
● Firebase Cloud Messaging
■ Python tools
● PyAPNs
● Python-GCM
● Pusher
● Be careful with device registration
● Make delivery of pushes fault-tolerant
32
33. History retrieval
● Return last messages for every chat instantly
○ Use double writes
■ In-memory queue only for last messages
■ Persistent storage for all the things
● Majority of history retrievals is for the last days
○ Let's optimize the case
● Index messages by date
33
34. History search
● ElasticSearch is the default solution for full-
text search
● @a_soldatenko: What is the best full text
search engine for Python?
● Add timing for search requests
34
35. Parsing
● Protocol
○ Avoid to use Pure Python parsers
■ ujson
■ lxml
○ Run benchmarks against your typical cases
● Message content
○ Be careful with regular expressions
■ re2
■ pyre2
○ Alternative parsers in Python
35
36. Dealing with file uploads
● Security checks
○ File upload vulnerabilities
○ Image upload
■ Decompression bomb
■ Other vulnerabilities with Pillow
○ Amazon S3 as file storage
■ boto
■ aiobotocore
■ botornado
● Thumbnails distribution
○ Delegate that to S3
○ Requested by a client even if not needed
36
37. Life of messaging platform
● Multi-session support
○ Set expiration time
○ Be ready to handle up to 4x sessions per user simultaneously
■ Desktop
■ Mobile
■ Tablet
■ Laptop
● Reconnection handling
○ Spin a proxy layer between messaging server and clients
● Rate-limiting
○ Limit amount of operations per user/group for heavy stuff
○ Leaky bucket
○ Throttling
37
38. Lessons learned
● Bursty traffic
○ Load testing is a must, but not always enough
■ Locust
■ Yandex Tank
● Reconnect storm could be a big deal
○ We should handle that on platform and client-side
● AWS issues make bad customers experience
○ Put nodes in Multi-AZ
38
39. Lessons learned
● Incidents prevention is cheaper than resolution
○ Grab stats and metrics about your services and
storages
■ Redis for per-chat stats
■ StatsD
■ Grafana
○ Be notified when something starts going wrong
■ Elastalert
■ Monit
■ DataDog
39
40. Lessons learned
● Don't stick with one language/stack
○ Python is great, but for some cases Go, Ruby or
PHP are more suitable from product side
○ Avoid business logic duplication in several repos,
spin a service and just call the endpoint
● Releasing new features only for certain groups makes
product management easier
○ LaunchDarkly
40
41. Lessons learned
● Don’t F**k the Customer
○ Provide unit/integration tests with every PR
○ Have development environment same as prod
○ Have staging environment same as prod
○ Make deployments fast
○ Rollback faster
○ Have a fallback plan
41
43. Summary
● Select a messaging protocol which aligns with your needs
● WebSocket + JSON could be the thing for new projects
● Usage of asynchronous frameworks is preferred
● Execute blocking operations in a separate thread
● Collect metrics for common services operations
● Caching saves a lot of time
● Use C or Cython-based solutions for CPU-bound tasks
● Have fast release/deploy/rollback cycle
● Python is great, but don't hesitate to pick other tools
43
44. Further reading
● How HipChat Stores and Indexes Billions of Messages Using ElasticSearch
● @kakovskyi: Maintaining a high load Python project for newcomers
● HipChat: Important improvements to staging, presence & database storage
● HipChat and the little connection that could
● Elasticsearch at HipChat: 10x faster queries
● Atlassian: How IT and SRE use ChatOps to run incident management
● A Study of Internet Instant Messaging and Chat Protocols
● What Is Async, How Does It Work, And When Should I Use It?
● Leaky Bucket & Tocken Bucket - Traffic shaping
● A guide to analyzing Python performance
● Why Leading Companies Dark Launch - LaunchDarkly Blog
● WebCamp 2016: Asyncio-stack for web development (soon)
44