keep-alive and large number of users

Hello,

I’m testing a very simple scenario using gatling 2.1.4 with 3000 users (with 20 second pauses). When keep-alive is enabled on both gatling side and server side (15 seconds on the server side) I’m getting timeouts, bad response times and generally I think inaccurate results.

Exactly same situation with 3000 users and keep-alive disabled either on gatling side or server side the response times are 10x better and with no timeouts.

I suspect the problem is the connection pooling used for keep-alive connections. Is it possible to change size for the connection pool ?

Martin

Not a Gatling issue, your your system under test can’t deal with 3.000 concurrent open connections. Gatling doesn’t have such limitations (it’s non blocking, except for DNS resolution that relies on the standard JDK impl which is blocking), as long as OS is fine (typically file descriptor limits).

I think my system is fine I have no problem in opening 3000 connections (and keep them open). For example this is wrk doing 3000 keep-alive connections. netstat -tnp showing 3000+ open connections, which is 2000 more than gatling running 3000 users with 20s pauses.

wrk -c3000 -t100 -d 60 --latency http://192.168.1.51
Running 1m test @ http://192.168.1.51
100 threads and 3000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 541.67ms 40.16ms 708.17ms 91.16%
Req/Sec 55.73 22.98 169.00 75.10%
Latency Distribution
50% 533.97ms
75% 546.50ms
90% 560.47ms
99% 700.18ms
224821 requests in 1.00m, 1.57GB read
Socket errors: connect 0, read 0, write 0, timeout 29544
Requests/sec: 3745.35
Transfer/sec: 26.70MB

Also the problem gets even bigger with longer server side keep-alive timeouts.

From what I see, the usage is a bit different: you ask wrk to keep 3000 open connections, even when you don’t use them, it it probably reconnects in the background (so basically, your keep-alive timeout doesn’t matter much). With Gatling, you have 3000 users, that might need reconnecting when trying to perform a request after a pause. And response time accounts for reconnection. I also wonder how much wrk tells to that it had to retry to reconnect because some tentative fails. Just wondering, I’m not familiar enough with wrk intrinsics.

If you can share a reproducer, I could investigate.