Intermittent Management API Timeouts for Apps Hosted in AWS

Last Updated: Nov 11, 2024

Overview

Random timeouts were experienced when calling the Auth0 Management API from an application or API hosted in AWS. The typical manifestation is:

My Java/Python/Golang application is receiving intermittent connection-reset errors while talking to the Auth0 Management API

Example errors:

> 
HTTPSConnectionPool(host='domain.auth0.com', port=443): Read timed out. (read timeout=5.0)
> 
I/O error on POST request for "https://domain.auth0.com/api/v2/endpoint": domain.auth0.com:443 failed to respond; nested exception is org.apache.http.NoHttpResponseException: domain.auth0.com:443 failed to respond
>
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='domain.auth0.com', port=443): Read timed out. (read timeout=5.0)
>
read tcp 10.3.44.61:53066->104.16.82.103:443: read: connection reset by peer
>

Applies To

  • Management API
  • AWS
  • Apps Hosted in AWS

Cause

The root cause has been identified as unexpected timeout behavior from the AWS NAT gateway. Specifically, the NAT gateway will timeout any idle connections after 350 seconds; if a client later attempts to reuse that connection, the NAT gateway will respond with a TCP RST, confusing the client application. The AWS documentation explicitly states:

“When a connection times out, a NAT gateway returns an RST packet to any resources behind the NAT gateway that attempt to continue the connection (it does not send a FIN packet).”

Before migrating to new network edge, the prior architecture, the corresponding AWS NLB implemented a 350-second timeout, but with friendlier behavior that most applications could handle gracefully.

The customers’ new edge implements a similar timeout after 400 seconds. This longer timeout period will not issue any keep alive traffic to preserve the connection until after the timeout in the NAT gateway has already expired. Thus, if the client application has not sent any traffic or keep alive segments to an Auth0 endpoint within this 350-second window, the NAT gateway will silently sever that connection, which leads to the timeout errors that are being seen.

Solution

There are two possible solutions to prevent these timeout errors:

  • At the application level, shortening TCP keepalive timeout < 350 seconds via socket options.
  • At the OS level, shortening TCP keepalive timeout < 350 seconds. For typical Linux hosts, including EC2 instances, this can typically be done by setting net.ipv4.tcp_keepalive_time to < 350 seconds in their host sysctl configuration.