I am hitting a basic healthcheck api on ELB with gatling. I am intermittently getting time out issues.
18:23:37.694 [WARN ] i.g.h.e.GatlingHttpListener - Request ‘XYZ’ failed for user 37
io.gatling.http.client.impl.RequestTimeoutException: Request timeout to xyz.com/18.104.22.168:80 after 60000 ms
18:23:37.694 [WARN ] i.g.h.e.r.DefaultStatsProcessor - Request ‘XYZ’ failed for user 37: i.g.h.c.i.RequestTimeoutException: Request timeout to xyz.com/22.214.171.124:80 after 60000 ms
To understand why this is happening, here is a quick summary of how the server processes information:
- An incoming connection is established
- A thread is assigned to process that request
- If there are no available threads to process the request, the connection goes into a holding queue
- If the holding queue is full, the connection is prematurely closed
- Requests in the holding queue are processed in the order in which they were received
- If they spend too long in the holding queue, the request may time out before ever being serviced
The fact that you are seeing timeouts means that there are periods of time when requests are taking too long. It could be because the system was busy servicing other requests. Or it could be an unreliable network, those are still sometimes a thing.
Assuming it’s not a network issue, there are several solutions to your problem. In order of recommended implementation:
- See if you can increase the number of concurrent threads available on the application server. For example, if they have a thread pool size of 100, try increasing it. Be careful, because too many threads with not enough memory can be detrimental.
- Adjust your load level to match what the system is capable of. More on that below.
- If for whatever reason, you can’t do either of the above, you could always increase the timeout values in your gatling.conf so the requests are not timing out anymore
To match your load level to your system capacity, build a test with a scenario that simply hits the health check endpoint. Embed that in a simulation that does a closed model ramp (rampConcurrentUsers) over a very long duration, say from 0 to 100 users over 30 minutes, and then look at the requests per second. You should find that the requests per second peak within seconds or at most a minute or two of simulation start, and then remains mostly flat. The response time starts out flat for a very short period of time, and then as the number of concurrent users grows, the response time starts to grow. You want to find the inflection point, the point where the requests per second were at their max but just before the response time per request started to rise. If you stay in that sweet spot, you should never see any timeouts.
I share that because it’s useful to know. But now let’s look at your specific situation:
The endpoint is a health check. A health check should be a nearly instantaneous response. There should only ever be one “client” of that service, and that is the infrastructure monitoring, for purposes of load balancing and/or alerting. As such, there should seldom if ever be more than one outstanding request at a time to that endpoint. Which begs the question: do you really need to load test that endpoint?
First of all, thank you for such a detailed explanation, its really helpful.
Now about why load test ‘healthcheck’. Its my mistake that I did not give the context in my original question. I am doing POC for Gatling in my org and trying to benchmark Jmeter/Gatling with respect to APIs we have. My aim is to move us to Gatling, provided everything goes well. The first API I picked was ‘healthcheck’ so that the complexities of transactional apis are avoided. The test that I am doing is just a comparison of how Gatling performs with respect to Jmeter, given similar resources.
Coming to the original problem. I had mentioned that I am hitting an ELB. On checking ELB metrics during failures in my test, I see request reaching ELB dropping, so application capacity is not an issue. The ‘Surge Queue Length’ metric for ELB also shows ideal value of 0, so does the metric of ‘Spillover Count’. Thus I ruled out application side issue. My guess was issue should be at ELB or Gatling. Now since I ran similar load with Jmeter and did not face any such issue, my first instinct was that it must be Gatling issue, hence I posted the question.
However, I have not ruled out ELB completely and have raised support ticket for same. I really hope that it is ELB issue and I can convince my org to move to Gatling. This being an intermittent issue is not helping my cause.
I am really grateful for the approach you described to find system capacity. Being new to Gatling, I was really struggling to find out the best way to model injection profiles and I had tried almost all models with varying levels of success.
Will update my findings here once I have more clarity. Thanks for all the inputs.
Okay, it’s a POC. You want to figure out why you see these problems with Gatling but not JMeter.
One possible reason is that JMeter uses a Closed Model Injection Profile. If you configure JMeter to do 100 threads, you have at most 100 concurrent open connections.
Gatling supports both a closed model, and an open model. In the open model, you inject so many users per second. If that count is higher than what the system can sustain, they will queue up. If they queue up too much, they will take so long to complete that they will time out. Which sounds like exactly what you are seeing.
You can’t really compare apples-to-apples between JMeter and Gatling if you do not use the same injection model. So double check: are you using rampConcurrentUsers and constantConcurrentUsers (the Closed model injectors), or are you using rampUsersPerSec and constantUsersPerSec (open model). If you are using the open model, switch to closed model, and try again.