Connection timed out exception - client or server problem?

If I get a “[WARN ] i.g.h.a.AsyncHandlerActor - Request ‘login’ failed: java.net.ConnectException: connection timed out:” showing up in the log, is it possible that there’s a configuration problem or limitation with Gatling/my machine, or is it more likely/definitely the server I’m hitting? (I’m doing 100 users/s over 60s.) I’ve been keeping an eye out as I increase the number of users for address in use errors in case I run out of ports, but haven’t seen any.

That’s your server that can’t open more connections.
You might want to check your keep-alive timeout.

The keepalive timeout? Increasing that is usually just fighting the symptoms, not the cause.

What is the maximum size of your connection pool? How many connections are in use when that timeout starts showing up? Are any other resources on that server being overloaded?

100 users / second over 60 sec is 6000 virtual users. It’s not unlikely that if you’re testing only a single server you are actually overloading it in some way.

No, I was suggesting the exact opposite: lowering it if it’s being set to a big value. 5 sec is a typical value, 60 sec is improper.

If you start tuning that lowering it even further can be beneficial. The TCP connection handshake typically completes in a fraction of a second when both ends are servers inside the same datacenter. If it does not complete in half a second or so it’s a sign that something is wrong. In that case failing fast is better than allowing the open connection handles to eat threads and memory resources while waiting for something that you know is unlikely to be successful.

Of course you should only do that if you know what time it usually takes. That means putting a representative load on the system and measuring. If you know that the handshake completes in less than 300 milliseconds in 99.5% of the responses lowering it to 400 ms or so should be fine.

I was talking about the keepalive, not the connect timeout. 5sec is a decent value so that a connection can be used for fetching the page and then the resources (assuming they’re on the same domain), then forcefully closed by the server. Of course, tuning this depends on your use case.

@Michelle: Both Floris and my suggestions make sense. You have a connection handle starvation, which is likely to be because you’re using blocking IO (like old servlet model where 1 open connection takes 1 thread from the servlet pool). It could have many causes, such as:

  • server takes too long to answer (Floris’ suggestion), so that your servlet thread pool is empty
  • your keep-alive connections linger too long because the server doesn’t quickly discard them

See if this page helps:

http://engineering.chartbeat.com/2014/01/02/part-1-lessons-learned-tuning-tcp-and-nginx-in-ec2/

It helped me with my connect timeout issues - the server detected a SYN flood attack and started to drop SYNs