Incorrect Request failed count shown

Prasanna · December 10, 2024, 11:02am

When we are trying to do stress testing on some apis using gatling, we are seeing that gatling is showing incorrectly failed requests count as for those requests when we see the logs in Kibana we are seeing it getting passed with a success status code.

We are unable to understand why this behaviour is happening. Can anyone please guide me on this? We are trying to run the script using the below params for 1 hour-:

stressPeakUsers(6000).during(Duration.ofMinutes(15))

slandelle · December 10, 2024, 11:05am

Please follow the requirements that were requested from you when you created this post.

make sure you’re using the latest version of Gatling (3.13.1 as of now)
provide a fail for us to reproduce your problem

Prasanna · December 10, 2024, 11:36am

I am using the latest version. For the repro, what specific are we looking at as API requests I am trying are all internal so are not exposed outside VPN?

slandelle · December 10, 2024, 12:58pm

For example a sample demo that counts the received requests.

Then, are you sure you don’t have TCP or TLS errors such as connect timeout or handshake timeout? In this case, it’s likely that you have a gap between the number of requests Gatling tried to send and the number of requests that could be processed by you application because the failing component, your edge/load balancer/etc, is in between.

Prasanna · December 10, 2024, 1:11pm

We check the count as shown in the attached Gatling report with kibana requests. We also individually tried to look the failed request as every request has unique traceId on Kibana, we found it to be succeeded. In every API request, we have pause of 15 seconds. Out of 13K failed request shown in the report throwing 500 errors, in kibana we are seeing only 44 request in total to be failed and rest succeeded.

Prasanna · December 10, 2024, 1:14pm

Also, we checked the load balancer as well but the count of requests throwing 500 seems to be in sync with Kibana one but not with gatling. A sample script is →

scenario(UUID.randomUUID() + " Create a process")
				.exec(session -> {
					session = session.set("JWT_TOKEN", token);
					session = session.set("IDS_SESSION_ID", idsSessionId);
					System.out.println("JWT Token is: " + session.getString("JWT_TOKEN"));
					session = session.set("BPMN_PROCESS_ID", bpmnProcessId);
					System.out.println("bpmn process id is: " + session.getString("BPMN_PROCESS_ID"));
					return session;
				}).pause(PAUSE_TIME)
				.exec(http("Create process instance")
				.post(HTM_BASE_URL + "/htm-usertask/api/v1/processes/#{BPMN_PROCESS_ID}/instances")
				.body(StringBody(requestBody))
				.header("x-infa-tid", "PSKH_#{BPMN_PROCESS_ID}_" + UUID.randomUUID())
				.header("Authorization", "Bearer #{JWT_TOKEN}") // Use session variable
				.header("IDS-SESSION-ID", "#{IDS_SESSION_ID}") // Use session variable
				.header("Accept", "application/json")
				.header("Content-Type", "application/json")
				.check(status().is(201))
				.check(jsonPath("$.processInstanceKey").saveAs("PROCESS_INSTANCE_KEY"))
								.check(bodyString().saveAs("RESPONSE_BODY"))
				).exitHereIfFailed()
				.pause(PAUSE_TIME).exec(session -> {
					if (session.contains("PROCESS_INSTANCE_KEY") && nonNull(session.getString("PROCESS_INSTANCE_KEY"))) {
						System.out.println("PROCESS_INSTANCE_KEY is: " + session.getString("PROCESS_INSTANCE_KEY"));
					}
					return session;
				}).pause(PAUSE_TIME)
				.tryMax(MAX_TRY).on(exec(http("Fetch candidate user tasks of process instance")
						.get(HTM_BASE_URL + "/htm-usertask/api/v1/user-tasks?scope=CANDIDATE&sortByField=DUE_DATE&status=CREATED&sort=DESC&offset=0&limit=50&processInstanceId=#{PROCESS_INSTANCE_KEY}")
						.header("x-infa-tid", "PSKH_#{PROCESS_INSTANCE_KEY}_" + UUID.randomUUID())
						.header("Authorization", "Bearer #{JWT_TOKEN}") // Use session variable
						.header("IDS-SESSION-ID", "#{IDS_SESSION_ID}") // Use session variable
						.header("Accept", "application/json")
						.header("Content-Type", "application/json")
						.check(status().is(200))
						.check(jsonPath("$.totalCount").saveAs("TASKS_TOTAL_COUNT"))
						.check(jsonPath("$.objects[*].taskId").findAll().saveAs("TASK_OBJECTS"))
				)).exitHereIfFailed()
				.pause(PAUSE_TIME).exec(session -> {
					if (session.contains("PROCESS_INSTANCE_KEY") && nonNull(session.getString("PROCESS_INSTANCE_KEY")) &&
							session.contains("TASKS_TOTAL_COUNT") && nonNull(session.getString("TASKS_TOTAL_COUNT"))) {
						System.out.println("Total user tasks for process instance key: " + session.getString("PROCESS_INSTANCE_KEY") + " is: "
								+ session.getString("TASKS_TOTAL_COUNT") + " taskIds: " + session.get("TASK_OBJECTS"));
					}
					return session;
				}).pause(PAUSE_TIME).foreach("#{TASK_OBJECTS}", "taskObj")
				.on(
						exec(session -> {
							if (session.contains("taskObj") && nonNull(session.getString("taskObj"))) {
								System.out.println("Processing Item ID: " + session.getString("taskObj"));
							}
							return session;
						}).pause(PAUSE_TIME)
								.exec(http("Assign Tasks")
								.patch(HTM_BASE_URL + "/htm-usertask/api/v1/user-tasks/#{taskObj}/assign")
								.body(StringBody(assigneddBody))
								.header("x-infa-tid", "PSKH_#{taskObj}_" + UUID.randomUUID())
								.header("Authorization", "Bearer #{JWT_TOKEN}") // Use session variable
								.header("IDS-SESSION-ID", "#{IDS_SESSION_ID}") // Use session variable
								.header("Accept", "application/json")
								.header("Content-Type", "application/json")
								.check(status().is(204))
								).exitHereIfFailed()
								.pause(PAUSE_TIME)
								.exec(http("Complete Tasks")
								.patch(HTM_BASE_URL + "/htm-usertask/api/v1/user-tasks/#{taskObj}/complete")
								.body(StringBody(requestBody))
								.header("x-infa-tid", "PSKH_#{taskObj}_" + UUID.randomUUID())
								.header("Authorization", "Bearer #{JWT_TOKEN}") // Use session variable
								.header("IDS-SESSION-ID", "#{IDS_SESSION_ID}") // Use session variable
								.header("Accept", "application/json")
								.header("Content-Type", "application/json")
								.check(status().is(204))
								).exitHereIfFailed().pause(PAUSE_TIME)
				);

slandelle · December 10, 2024, 1:17pm

So first, you do have IO issues like request timeouts and TLS handshake timeout and neither your Load Balancer nor Kibana can see these because the struggling component here is the Load Balancer itself.

Then, there’s no way Gatling is hallucinating these 500 responses.

You’re most likely not monitoring all your requests.

Prasanna · December 10, 2024, 1:20pm

But @slandelle , TLS and timeous one are shown as separate request failures in gatling right in the attached report? What I am talking about are the request throwing 500 with the → status.find.is(204), but actually found 500

trinp · December 11, 2024, 1:21am

What you are saying over here:

in kibana we are seeing only 44 request in total to be failed and rest succeeded.

Suggested ONLY that the system acknowledged the request, processed it and send out the response, meanwhile back at Gatling still returns as 500 strongly suggested that problem lies between client (Gatling host) and your Server.
Now a little deeper, based on your description, most likely your middle (proxy / LB / API gateway) got fished up which caused the connection between client and your middle to be terminated and return 500 unexpectedly.

Prasanna · December 11, 2024, 8:31am

Okay, let me check once more. Thanks @trinp and @slandelle

Prasanna · December 17, 2024, 6:20am

In order to ensure the infra and the LBs and other things are working as expected, I have tried to run a profile both using gatling and jmeter.

To my surprise, in jmeter all the requests went through with success.
But in gatling I faced same issue where even though all the requests in kibana actually passed but still in Gatling report I can see some failed.

Attaching both reports

Prasanna · December 17, 2024, 6:20am

Also attaching jenkins report

trinp · December 17, 2024, 6:45am

now I started to curious about your Jmeter script.
Are you sure the way you handle in Jmeter is the same with Gatling?
Also, in your script I notice that:

System.out.println(...)

Did you comment println during the test ? Debugging guide for Gatling scripts
I saw that the request rate from Gatling is different from Jmeter (27 vs 18), you have put the exitHereIfFailed too. Are both workload set up same ?

slandelle · December 17, 2024, 7:14am

Once on this same forum, someone was experiencing unexpected results with Gatling.
At some point, he analyzed Wireshark TCP dumps. It turned out that Gatling was not buggy and that the issue was a race condition in AWS ALB. AWS fixed it.

I think you’re in a similar situation. Gatling is just the messenger.
Your Gatling and JMeter tests don’t do the same thing. As @trinp noticed, they don’t produce the same throughput, so if your issue is a race condition, it might not trigger with JMeter.

This kind of investigation requires some efforts and can only be performed with access to your infrastructure. That’s something only you can investigate, or with consulting.

Prasanna · December 17, 2024, 1:30pm

Okay, for simplicity sake i just kept a simple injection profile which is 1000 atOnce users in gatling and keeping 1000 user threads with no rampup in Jmeter. Even though Jmeter throughput is higher all requests are succeeding there but that is not the case in gatling. If issue would have been on LB or Infra side then shouldn’t Jmeter also be facing that?

slandelle · December 17, 2024, 1:46pm

Sorry but we really can’t play riddles. One would need your JMeter and Gatling scripts and access to an environment where they can be executed.

trinp · December 18, 2024, 12:46am

I doubt your Jmeter script behavior is different from Gatling:

You are using Thread Group in jmeter to segregate the workflow, meanwhile your Gatling script executes all APIs. It is a very different behavior.
To add up, you may have 1000 Users running the same thread group (that specific API) in Jmeter script, then all these 1000 Users move on to the next one, which is a very very happy case in load test. While in gatling, your users have to execute all the APIs, which is a realistics one.
Now, you may want to modify your Jmeter script again, put all the request in one thread group, then run again. Also, please disable all println and exitHereIfFailed to ensure that both are identical.
If you did as I said, but did not get any better results, you may want to check out Wireshark as @slandelle said.

Prasanna · December 22, 2024, 11:46am

Okay Sure, thanks, let me try

system · January 21, 2025, 11:47am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
502 errors in final gatling report but response graph shows nothing failed Gatling (Open-Source)	4	149	January 31, 2019
Request number is not right in report Gatling (Open-Source)	4	112	May 11, 2016
Requests Gatling (Open-Source)	2	137	March 18, 2015
Report on HTML error codes / type of failed requests? Gatling (Open-Source)	0	146	June 21, 2013
Why Gatling always make/show one more request compare to what was actually requested in code? Gatling (Open-Source)	5	121	September 27, 2017

Incorrect Request failed count shown

Related topics