Intermittent Management API timeouts for apps hosted in AWS

Problem statement

We see random timeouts when calling the Auth0 Management API from our application or API hosted in AWS. The typical manifestation is:
“My Java/Python application is receiving intermittent connection-reset errors while talking to the Auth0 Management API”

Example errors:

HTTPSConnectionPool(host='[domain.auth0.com](http://domain.auth0.com/)', port=443): Read timed out. (read timeout=5.0)

I/O error on POST request for "[https://domain.auth0.com/api/v2/endpoint":](https://domain.auth0.com/api/v2/endpoint%22:) [domain.auth0.com](http://domain.auth0.com/):443 failed to respond; nested exception is org.apache.http.NoHttpResponseException: [domain.auth0.com](http://domain.auth0.com/):443 failed to respond

urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='[domain.auth0.com](http://domain.auth0.com/)', port=443): Read timed out. (read timeout=5.0)

read tcp 10.3.44.61:53066->104.16.82.103:443: read: connection reset by peer

Troubleshooting

Here is the list of items we can check:

  1. Which SDK are you using
  2. How do you make the calls to the Management API? If you can send the error stack for one of the timeouts, along with some of the timestamps, that would be great. Also, is your application hosted locally, or is it in the cloud e.g. AWS?
  3. Is your python/java/etc script relying on long-lived connections, or are you opening and closing the socket for each request?

Cause

We have identified the root cause as unexpected timeout behavior from the AWS NAT gateway. Specifically, the NAT gateway will timeout any idle connections after 350 seconds; if a client later attempts to reuse that connection, the NAT gateway will respond with a TCP RST, confusing the client application. The AWS documentation explicitly states:

“When a connection times out, a NAT gateway returns an RST packet to any resources behind the NAT gateway that attempt to continue the connection (it does not send a FIN packet).”

Before we migrated to our new network edge, our prior architecture, the corresponding AWS NLB implemented a 350-second timeout, but with friendlier behavior that most applications could handle gracefully.

Our new edge implements a similar timeout after 400 seconds. This longer timeout period will not issue any keepalive traffic to preserve the connection until after the timeout in the NAT gateway has already expired. Thus, if the client application has not sent any traffic or keepalive segments to an Auth0 endpoint within this 350-second window, the NAT gateway will silently sever that connection, which leads to the timeout errors you are seeing.

Solution

There are two possible solutions to prevent these timeout errors:

  • At the application level, shortening TCP keepalive timeout < 350 seconds via socket options.
  • At the OS level, shortening TCP keepalive timeout < 350 seconds. For typical Linux hosts, including EC2 instances, this can typically be done by setting net.ipv4.tcp_keepalive_time to < 350 seconds in your host sysctl configuration.