Google App Engine - Intermittent 502 / connection reset by peer-CodePudding

Running a nodejs server on Google App Engine (GAE) flex instance I have clients getting intermittent 502 errors from my app. The requests never hit my node service but they seem to coincide with nginx logs relating to a connection reset by peer:

[error] 34#34: *25817 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 1.2.3.4, server: , request: "POST /endpoint HTTP/1.1", upstream: "http://172.17.0.1:8080/endpoint"
[error] 34#34: *27919 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 1.2.3.4, server: , request: "POST /endpoint HTTP/1.1", upstream: "http://172.17.0.1:8080/endpoint"
[error] 34#34: *28746 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 1.2.3.4, server: , request: "POST /endpoint HTTP/1.1", upstream: "http://172.17.0.1:8080/endpoint"
[error] 34#34: *28747 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 1.2.3.4, server: , request: "POST /endpoint HTTP/1.1", upstream: "http://172.17.0.1:8080/endpoint"
[error] 34#34: *24022 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 1.2.3.4, server: , request: "POST /endpoint HTTP/1.1", upstream: "http://172.17.0.1:8080/endpoint"
[error] 34#34: *29214 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 1.2.3.4, server: , request: "POST /endpoint HTTP/1.1", upstream: "http://172.17.0.1:8080/endpoint"

What could be causing this behavior? the CPU / Memory load is not nearing resource limits though it does seem to happen more frequently when the server is under some load.

CodePudding user response：

When deploying to Google App Engine a load balancer is placed in front of the instances. This load balancer has a HTTP keep alive setting of 600 seconds.

The load balancer then connects to an nginx service on the instance which uses a keep alive of 650 seconds, it even has a helpful comment in the config saying it needs to be higher to prevent a race condition.

# GCLB uses a 10 minutes keep-alive timeout. Setting it to a bit more here
# to avoid a race condition between the two timeouts.
keepalive_timeout 650;

Finally nginx reverse proxies to your node app which uses a default keep alive of... 5 seconds

This causes a race condition between the timeouts (duh) and you need to set the timeout of your node server to be higher than 650 seconds. If you are using expressjs that looks like this:

const app = express();
const server = app.listen(process.env.PORT);

//nginx uses a 650 second keep-alive timeout on GAE. Setting it to a bit more here to avoid a race condition between the two timeouts.
server.keepAliveTimeout = 700000; 

//ensure the headersTimeout is set higher than the keepAliveTimeout due to this nodejs regression bug: https://github.com/nodejs/node/issues/27363
server.headersTimeout = 701000;

You can check out Analyze ‘Connection reset’ error in Nginx upstream with keep-alive enabled for a technical (TCP level) explanation of why the upstream servers need larger timeouts.