  1. Тема доклада Тема доклада Тема доклада KYIV 2019 .Net Core in production By Leonid Molotiievskyi .NET CONFERENCE #1 IN UKRAINE
  2. 2 About me • Hands-on software architect and technological consultant • Good at splitting a monolith to microservices • Built a huge enterprise financial solution from scratch • Technical guy who believes that right people decisions are more important than technological ones • Speaker and mentor
  3. 3 Spoilers about what we are going to talk Agenda Context overview Environment that we used to live with Scaling How did we scale our services?
  4. 4 Hell for the DevOps teamDo we solve the right problem? Useful advices The things that can help you to resolve the problem Lessons learned How can we benefit in future? Q&A Questions and answers
  5. Context overview Several statements about the project
  6. 6 Context overview • Financial domain • 25+ microservices • Team 70+ people • 20+ environments • Three versions in support one in development
  7. 7 Solution overview: managing workflows
  8. Scaling How did we scale our services?
  9. 9 Notification service
  10. 10 Solution? - And a set of dummy queues left after descale/redeploy appear
  11. 11 Gateway: infinite redirect - Where do we store keys for cookies?
  12. 12 Gateway: infinite redirect solution
  13. Hell for the DevOps team What technology decisions helped us to survive
  14. 14 Each morning… • Dev/Staging/Prod cluster is down • RabbitMq/Mongo/Consul/Prometheus is not operational • The fire-fighter team is on the duty
  15. 15 Greedy service
  16. 16 Queues are growing… - 1 • “TTL time is too small” or?
  17. 17 Queues are growing… - 2 • A queue has a set of consumers • Service A consumes the message • Service A starts processing the message • Heath check of consumer fails due to high load of service A/network issue/OOM killed/etc. • Duplicated message appear in the queue
  18. 18 OOM Killed issue • .Net Core 2.2 doesn’t respect docker limits: • ” Server GC was designed with the assumption that the process using Server GC is the dominant process on the machine. By default it uses as many heaps as there are # of processors on the machine.”
  19. 19 Let’s fix issue by upgrade to .Net Core 3.0?
  20. 20 Socket file descriptor leak in HttpClient
  21. 21 Docker: no space left on the device level=info msg="[8] System error: write /sys/fs/cgroup/docker/01f5670fbee1f6687f58f3a943b1e1bdaec26 30197fa4da1b19cc3db7e3d3883/cgroup.procs: no space left on device"
  22. 22 Reason:
  23. 23 Prometheus is down
  24. Useful advices What can prevent nasty situations
  25. 25 What can help you to find them? Configured monitoring to track: • Memory consumption • CPU consumption • Number of threads on worker node • Number of open socket descriptors per node/pod • Connection refused errors • Correlation Ids in logs • Number of messages in queues • Number of consumers for queues
  26. 26 Use the standard health check middleware
  27. 27 Setup environment in the way… • Infrastructure services must have HA setup • Deploy at least two instances of each service • Setup monitoring and alerting • To be sure that “temporary data” disappear after redeployment • To not configure something manually
  28. Lessons learned What we get from it
  29. 29 Lessons learned • ”Do it as simple as possible” principle doesn’t work. “Do it in the smart way” - works • Think about application scaling from the begging • Know about open issues inside your target framework • Do not blame DevOps team, try to help them to find out what is the reason
  30. 30 Follow me @lmolotii on Q&A