Intermittent Management API timeouts for Apps Hosted in AWS

Problem statement

Random timeouts experienced when calling the Auth0 Management API from application or API hosted in AWS. The typical manifestation is:
“My Java/Python/Golang application is receiving intermittent connection-reset errors while talking to the Auth0 Management API”

Example errors:

HTTPSConnectionPool(host='[domain.auth0.com](http://domain.auth0.com/)', port=443): Read timed out. (read timeout=5.0)
I/O error on POST request for "[https://domain.auth0.com/api/v2/endpoint":](https://domain.auth0.com/api/v2/endpoint%22:) [domain.auth0.com](http://domain.auth0.com/):443 failed to respond; nested exception is org.apache.http.NoHttpResponseException: [domain.auth0.com](http://domain.auth0.com/):443 failed to respond
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='[domain.auth0.com](http://domain.auth0.com/)', port=443): Read timed out. (read timeout=5.0)
read tcp 10.3.44.61:53066->104.16.82.103:443: read: connection reset by peer

Cause

The root cause has been identified as unexpected timeout behavior from the AWS NAT gateway. Specifically, the NAT gateway will timeout any idle connections after 350 seconds; if a client later attempts to reuse that connection, the NAT gateway will respond with a TCP RST, confusing the client application.

The AWS documentation explicitly states:

“When a connection times out, a NAT gateway returns an RST packet to any resources behind the NAT gateway that attempt to continue the connection (it does not send a FIN packet).”
Before migrating to new network edge, the prior architecture, the corresponding AWS NLB implemented a 350-second timeout, but with friendlier behavior that most applications could handle gracefully.

The customers new edge implements a similar timeout after 400 seconds. This longer timeout period will not issue any keep alive traffic to preserve the connection until after the timeout in the NAT gateway has already expired. Thus, if the client application has not sent any traffic or keep alive segments to an Auth0 endpoint within this 350-second window, the NAT gateway will silently sever that connection, which leads to the timeout errors that are being seen.

Solution

There are two possible solutions to prevent these timeout errors:

  • At the application-level, shortening TCP keepalive timeout < 350 seconds via socket options.
  • At the OS-level, shortening TCP keepalive timeout < 350 seconds. For typical Linux hosts, including EC2 instances, this can typically be done by setting net.ipv4.tcp_keepalive_time to < 350 seconds in their host sysctl configuration.