Easy Taxi está presente em mais de 30 países e tem milhões de usuários, entre passageiros e taxistas. Seu aplicativo roda em dezenas de plataformas móveis e suporta milhares de acessos simultâneos. A aplicação nasceu na nuvem da AWS e faz pleno uso de todos os seus recursos. Nesta apresentação avançada, exploramos a arquitetura da Easy Taxi e analisamos as estratégias de otimização disponíveis para os aplicativos implementados na nuvem AWS.
2. Otimizando Servidores Web
e seus componentes
Davi Menezes & Robert Fuente
Cloud Technical Account Manager | AWS Support
3. Different strategies for better performance
• Leverage newer hardware and software.
• Apply more resources through auto scaling.
• Offload the heavy lifting to someone else.
• Optimize the web server stack.
5. Optimizations by definition are app-specific
• Test and validate together with the application itself.
• There is no substitute to production data.
• Make it an integral part of the application itself.
– E.g. Elastic Beanstalk .ebextensions
7. First understand your workload
• What are we serving?
– Number of transactions
– Transaction size
– Back-end resource consumption
• How much can we do today?
– Theoretical benchmark
– Actual production load (observability / data-driven)
• What is the bottleneck resource?
– “Choose instance type for the bounding resource”
– Workload Analysis vs. Resource Analysis
https://youtu.be/7Cyd22kOqWc
10. CloudWatch Metric Anatomy
• Statistical aggregation
– Min
– Max
– Sum
– Average
– Count
• One data point per minute.
• Can trigger actions via
alarms.
11. Micro metrics vs. Macro metrics
• Agent-based monitoring
• Available in
Amazon Linux
• Provides highly-granular,
server-specific insights
Source: http://demo.munin-monitoring.org/
12. Coming from a variety of sources
Customer generated
• Kernel and Operating System
• Web Server
• Application Server/Middleware
• Application code
• Instance networking
AWS generated
• Amazon CloudFront
• Amazon Elastic Load Balancing
• Amazon CloudWatch
• Amazon Simple Storage Service
17. ELB Connection Behavior
• No true limits on influx of connections
– But may require preemptive scaling (aka Pre-warming)
• Leverages HTTP Keep-Alives
• Configurable Idle Connection Timeout
• HTTP Session Stickness & Health-checking
– Fast Registration
• SSL Off-loading and Back-end authentication
18. ELB access logs
HTTP log entries
• Only one side of picture.
• Can’t log custom headers or
format logs.
• Logs are delayed.
• Drill down to instance level
responsiveness, but can’t dive
into latency outliers
0
5
10
15
20
25
30
35
Processing Time
response_processing_time
request_processing_time
backend_processing_time
bytes
19. ELB Key Metrics
• Latency and Request Count
• Surge Queue and Spillover
• ELB 5xx and 4xx
• Back-end Connection Errors
• Healthy and Unhealthy Host Counts
21. int cfd,fd=socket(PF_INET,SOCK_STREAM,IPPROTO_TCP);
struct sockaddr_in si;
si.sin_family=PF_INET;
inet_aton("127.0.0.1",&si.sin_addr);
si.sin_port=htons(80);
bind(fd,(struct sockaddr*)si,sizeof si);
listen(fd,512);
while ((cfd=accept(fd,(struct sockaddr*)si,sizeof si)) != -1) {
read_request(cfd); /* read(cfd,...) until "rnrn" */
write(cfd,"200 OK HTTP/1.0rnrn"
”Bem-vindo ao AWS Summit SP 2015.",19+27);
close(cfd);
}
http:80
fd=socket(PF_INET,SOCK_STREAM,IPPROTO_TCP)
bind(fd,(struct sockaddr*)si,sizeof si)
listen(fd,512)
accept(fd,(struct sockaddr*)si,sizeof si)
# of open
file descriptors
22. The last TCP mile
• Accept Pending Queue
– man listen(2): “(…) backlog argument defines the maximum length to which the
queue of pending connections for sockfd may grow.”
– Recv-Q & Send-Q – TCP is stream oriented
• man accept(2): Blocking vs. Non-blocking sockets
24. Queuing at the TCP layer first
• ECONNREFUSED
man listen(2):
“if the underlying protocol supports
retransmission, the request may be ignored
so that a later reattempt at connection
succeeds” – aka: TCP Retransmit
25. Scaling in the Linux Networking Stack
• Connection States
– man netstat(8)
• Backlog Maximum Length
– Waiting to be accepted: /proc/sys/net/core/somaxconnn
– Half-Open connections: /proc/sys/net/ipv4/tcp_max_syn_backlog
– CPU's input packet queue: /proc/sys/net/core/netdev_max_backlog
26. TCP is a Window based protocol
• TCP Receive Window
“considered one of the most important TCP tweaks” (ugh!)
– BDP = avail. bandwidth (KBps) X RTT (ms)
• Choose an EC2 Instance
with proper Bandwidth
27. TCP Initial Congestion Window
• RFC 3390 – Higher Initial Window
– ip route (…) initcwnd 10 (kernel <2.6.39)
• Disable Slow Start (net.ipv4.tcp_slow_start_after_idle)
• Google Research
– “propose to increase (…) to at least ten segments (about 15KB)
Pub: “An Argument for Increasing TCP's Initial Congestion Window”
+/* TCP initial congestion window */
+#define TCP_INIT_CWND 10
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=(…)
commited to the
kernel 2.6.39 (May 2011)
28. TCP Buffers & Memory Utilization
• Buffering
– Use case: sending/receiving large amounts of data
– Auto-tunable by the kernel
– However, has bounds: min, default, and max.
– Tune: net.ipv4.tcp_rmem/wmem (in bytes)
• Sockets demand on page allocation
– Tune: net.ipv4.tcp_mem (in pages)
30. About TIME-WAIT state
• TIME-WAIT Assassination RFC
• Increase your port range
– net.ipv4.ip_local_port_range
– A ballpark of your rate of connections per second:
(ip_local_port_range / tcp_fin_timeout)
leads to about 500 connections per second !
“The TIME_WAIT state is our friend and is there to help us (i.e., to let old
duplicate segments expire in the network). Instead of trying to avoid the state,
we should understand it.”
Vincent Bernat - (vincent.bernat.im)
32. • Clients behind NAT/Stateful FW
• will get dropped
*99.99999999% of time
should never be enabled
* Probably 100% but there may be a valid case out there
TL;DR: Do *not* enable net.ipv4.tcp_tw_recycle
Linux’s TCP protocol man page
do not recommend
36. Easy Taxi : O seu estilo de pedir táxi!
S. Korea
Saudi Arabia
Pakistan
Brazil
Argentina
Peru
Mexico
Venezuela
Colombia
Ecuador
S. Africa
Namibia
Angola
Botswana
Kenya
Tanzania
Egypt
Morocco
Tunisia
Nigeria
Ghana
Ivory C.
Algeria
Hong Kong
Taiwan
Indonesia
Malaysia
Philippines
Singapore
Thailand
Vietnam
Present
Coming soon
Bolivia
Uruguay
Puerto
Rico
Panama
Costa Rica
Guatemala UAE
Jordan
Chile
India
• Um dos maiores
aplicativos de táxi do
mundo;
• Lançado no Rio de
Janeiro, presente em
mais de 30 países;
• O mesmo app para
todos os países;
• TI em São Paulo,
Brasil
• Milhões de clientes e
centenas de milhares
de taxistas
37. Arquitetura
• Mais de 400k requisições por minuto
• 100+ instâncias EC2 em produção
distribuídas em diferentes availability
zones em Virtual Private Clouds, diversos
Elastic Load Balancing
• RDS clusters, SQS, ElastiCache (Redis),
CloudSearch, CloudWatch...
• Serviços Gerenciados permitem que
nossos sys admins possam ser mais
produtivosAvailability Zone Availability Zone
API API API
… API API API
…
Mongo Mongo
38. Erros 400 no ELB
• Identificou-se um aumento de erros 400 no ELB;
• Em conjunto com o suporte enterprise da AWS, realizamos um
Deep dive nos logs de acesso do ELB usando Elasticsearch
• Verificamos que os eventos estavam correlacionados a usuários
mobile de operadoras que usavam NAT em suas conexões 3g;
• Tcpdump para trace de pacotes revelaram que conexões estavam
sendo silenciosamente descartadas;
39. Resultado das análises
• Depois das analises descobrimos que estávamos com as configuração abaixo
em nossos servidores
– net.ipv4.tcp_tw_recycle & net.ipv4.tcp_tw_reuse habilitados
• Quando se ativa recycle, o kernel tenta tomar decisões baseadas no timestamp
usado pelos hosts remotos. Ele tenta achar o último timestamp usado por cada
host remoto que tenham uma conexão em TIME_WAIT, e ira permitir o
reaproveitamento do socket se o timestamp tiver corretamente incrementado,
mas se o timestamp usado pelo host não tiver aumentado corretamente o
pacote será descartado pelo kernel.
• Muitos de nossos clientes conectam através de operadoras que usam NAT.
Com a alta taxa de acesso entrando do mesmo IP passamos a ter o kernel
recusando essas conexões devido a inconsistência no timestamp, resultando
um Bad Request (400) no ELB.
40. Conclusão
• A ajuda do suporte enterprise foi de extrema importância para
encontramos a solução para o nosso caso
• Se não tivéssemos todos os logs e os dados que levantamos
para a análise, teria sido extremamente difícil e
provavelmente não teríamos conseguido chegar a conclusão
do que estava acontecendo.
44. • Keep an eye on the somaxconn limits
• Understand resources utilization by the webserver
– Process Isolation vs. Blast Radius
– Avoid Resources Saturation & Starvation
The backlog is back, again!
45. • man tcp(7) – tcp_defer_accept:
Webserver only awakes when there is data available!
• Reduce the burden on the webserver’s process
• TCP Socket is already established (i.e. no SYN flood)
Telling the webserver when to start
Nginx
• listen [deferred]
Apache
• AcceptFilter http data
• AcceptFilter https data
46. • man sendfile(2)
“copying is done within the kernel”
• I.e. no use of User Space
Using the Zero-copy pattern
Nginx
• sendfile on
Apache
• EnableSendFile on
47. HTTP Keep-Alive
Nginx
• keepalive_timeout 75s
• keepalive_requests 100
Apache
• KeepAlive On
• KeepAliveTimeout 5
• MaxKeepAliveRequests 100
Ensure it matches your ELB timeout setting; otherwise…
look into your ELB’s 5XX metric
48. “The small-packet problem”
Flush() (tcp_cork)
• flush() analogy
• The application needs to “uncork”
the stream
• sendfile() is a must
Auto in Apache (+sendfile option)
Set tcp_nopush to false in NGINX
Nagle’s algo (tcp_nodelay)
• The initial problem:
“congestion collapse”
• write() vs. writev()
• Onto the wire asap
Always On in Apache
Set tcp_nodelay flag in NGINX
49. “The small-packet problem”
Flush() (tcp_cork)
• flush() analogy
• The application needs to “uncork”
the stream
• sendfile() is a must
Auto in Apache (+sendfile option)
Set tcp_nopush to false in NGINX
Nagle’s algo (tcp_nodelay)
• The initial problem:
“congestion collapse”
• write() vs. writev()
• Onto the wire asap
Always On in Apache
Set tcp_nodelay flag in NGINX
/* TCP_NODELAY is weaker than TCP_CORK, so that
* this option on corked socket is remembered, but
* it is not activated until cork is cleared.
*
* However, when TCP_NODELAY is set we make
* an explicit push, which overrides even TCP_CORK
* for currently queued segments.
*/
53. Quick review
• Keep the connection for as long as possible.
• Minimize the latency.
• Increase throughput.
• Most importantly, research what settings make
most sense for your environment.
55. Last thoughts
• Monitor everything.
• Tune your server to your workload.
• Improvement must be quantifiable.
• Experiment and continuously re-validate!
And most importantly,
REMEMBER: