Incorrect Request failed count shown

When we are trying to do stress testing on some apis using gatling, we are seeing that gatling is showing incorrectly failed requests count as for those requests when we see the logs in Kibana we are seeing it getting passed with a success status code.

We are unable to understand why this behaviour is happening. Can anyone please guide me on this? We are trying to run the script using the below params for 1 hour-:

stressPeakUsers(6000).during(Duration.ofMinutes(15))

Please follow the requirements that were requested from you when you created this post.

  • make sure you’re using the latest version of Gatling (3.13.1 as of now)
  • provide a fail for us to reproduce your problem

I am using the latest version. For the repro, what specific are we looking at as API requests I am trying are all internal so are not exposed outside VPN?

For example a sample demo that counts the received requests.

Then, are you sure you don’t have TCP or TLS errors such as connect timeout or handshake timeout? In this case, it’s likely that you have a gap between the number of requests Gatling tried to send and the number of requests that could be processed by you application because the failing component, your edge/load balancer/etc, is in between.

We check the count as shown in the attached Gatling report with kibana requests. We also individually tried to look the failed request as every request has unique traceId on Kibana, we found it to be succeeded. In every API request, we have pause of 15 seconds. Out of 13K failed request shown in the report throwing 500 errors, in kibana we are seeing only 44 request in total to be failed and rest succeeded.

Also, we checked the load balancer as well but the count of requests throwing 500 seems to be in sync with Kibana one but not with gatling. A sample script is →

scenario(UUID.randomUUID() + " Create a process")
				.exec(session -> {
					session = session.set("JWT_TOKEN", token);
					session = session.set("IDS_SESSION_ID", idsSessionId);
					System.out.println("JWT Token is: " + session.getString("JWT_TOKEN"));
					session = session.set("BPMN_PROCESS_ID", bpmnProcessId);
					System.out.println("bpmn process id is: " + session.getString("BPMN_PROCESS_ID"));
					return session;
				}).pause(PAUSE_TIME)
				.exec(http("Create process instance")
				.post(HTM_BASE_URL + "/htm-usertask/api/v1/processes/#{BPMN_PROCESS_ID}/instances")
				.body(StringBody(requestBody))
				.header("x-infa-tid", "PSKH_#{BPMN_PROCESS_ID}_" + UUID.randomUUID())
				.header("Authorization", "Bearer #{JWT_TOKEN}") // Use session variable
				.header("IDS-SESSION-ID", "#{IDS_SESSION_ID}") // Use session variable
				.header("Accept", "application/json")
				.header("Content-Type", "application/json")
				.check(status().is(201))
				.check(jsonPath("$.processInstanceKey").saveAs("PROCESS_INSTANCE_KEY"))
								.check(bodyString().saveAs("RESPONSE_BODY"))
				).exitHereIfFailed()
				.pause(PAUSE_TIME).exec(session -> {
					if (session.contains("PROCESS_INSTANCE_KEY") && nonNull(session.getString("PROCESS_INSTANCE_KEY"))) {
						System.out.println("PROCESS_INSTANCE_KEY is: " + session.getString("PROCESS_INSTANCE_KEY"));
					}
					return session;
				}).pause(PAUSE_TIME)
				.tryMax(MAX_TRY).on(exec(http("Fetch candidate user tasks of process instance")
						.get(HTM_BASE_URL + "/htm-usertask/api/v1/user-tasks?scope=CANDIDATE&sortByField=DUE_DATE&status=CREATED&sort=DESC&offset=0&limit=50&processInstanceId=#{PROCESS_INSTANCE_KEY}")
						.header("x-infa-tid", "PSKH_#{PROCESS_INSTANCE_KEY}_" + UUID.randomUUID())
						.header("Authorization", "Bearer #{JWT_TOKEN}") // Use session variable
						.header("IDS-SESSION-ID", "#{IDS_SESSION_ID}") // Use session variable
						.header("Accept", "application/json")
						.header("Content-Type", "application/json")
						.check(status().is(200))
						.check(jsonPath("$.totalCount").saveAs("TASKS_TOTAL_COUNT"))
						.check(jsonPath("$.objects[*].taskId").findAll().saveAs("TASK_OBJECTS"))
				)).exitHereIfFailed()
				.pause(PAUSE_TIME).exec(session -> {
					if (session.contains("PROCESS_INSTANCE_KEY") && nonNull(session.getString("PROCESS_INSTANCE_KEY")) &&
							session.contains("TASKS_TOTAL_COUNT") && nonNull(session.getString("TASKS_TOTAL_COUNT"))) {
						System.out.println("Total user tasks for process instance key: " + session.getString("PROCESS_INSTANCE_KEY") + " is: "
								+ session.getString("TASKS_TOTAL_COUNT") + " taskIds: " + session.get("TASK_OBJECTS"));
					}
					return session;
				}).pause(PAUSE_TIME).foreach("#{TASK_OBJECTS}", "taskObj")
				.on(
						exec(session -> {
							if (session.contains("taskObj") && nonNull(session.getString("taskObj"))) {
								System.out.println("Processing Item ID: " + session.getString("taskObj"));
							}
							return session;
						}).pause(PAUSE_TIME)
								.exec(http("Assign Tasks")
								.patch(HTM_BASE_URL + "/htm-usertask/api/v1/user-tasks/#{taskObj}/assign")
								.body(StringBody(assigneddBody))
								.header("x-infa-tid", "PSKH_#{taskObj}_" + UUID.randomUUID())
								.header("Authorization", "Bearer #{JWT_TOKEN}") // Use session variable
								.header("IDS-SESSION-ID", "#{IDS_SESSION_ID}") // Use session variable
								.header("Accept", "application/json")
								.header("Content-Type", "application/json")
								.check(status().is(204))
								).exitHereIfFailed()
								.pause(PAUSE_TIME)
								.exec(http("Complete Tasks")
								.patch(HTM_BASE_URL + "/htm-usertask/api/v1/user-tasks/#{taskObj}/complete")
								.body(StringBody(requestBody))
								.header("x-infa-tid", "PSKH_#{taskObj}_" + UUID.randomUUID())
								.header("Authorization", "Bearer #{JWT_TOKEN}") // Use session variable
								.header("IDS-SESSION-ID", "#{IDS_SESSION_ID}") // Use session variable
								.header("Accept", "application/json")
								.header("Content-Type", "application/json")
								.check(status().is(204))
								).exitHereIfFailed().pause(PAUSE_TIME)
				);

So first, you do have IO issues like request timeouts and TLS handshake timeout and neither your Load Balancer nor Kibana can see these because the struggling component here is the Load Balancer itself.

Then, there’s no way Gatling is hallucinating these 500 responses.

You’re most likely not monitoring all your requests.

But @slandelle , TLS and timeous one are shown as separate request failures in gatling right in the attached report? What I am talking about are the request throwing 500 with the → status.find.is(204), but actually found 500

What you are saying over here:

in kibana we are seeing only 44 request in total to be failed and rest succeeded.

Suggested ONLY that the system acknowledged the request, processed it and send out the response, meanwhile back at Gatling still returns as 500 strongly suggested that problem lies between client (Gatling host) and your Server.
Now a little deeper, based on your description, most likely your middle (proxy / LB / API gateway) got fished up which caused the connection between client and your middle to be terminated and return 500 unexpectedly.

Okay, let me check once more. Thanks @trinp and @slandelle

In order to ensure the infra and the LBs and other things are working as expected, I have tried to run a profile both using gatling and jmeter.

To my surprise, in jmeter all the requests went through with success.
But in gatling I faced same issue where even though all the requests in kibana actually passed but still in Gatling report I can see some failed.

Attaching both reports

Also attaching jenkins report

now I started to curious about your Jmeter script.
Are you sure the way you handle in Jmeter is the same with Gatling?
Also, in your script I notice that:

System.out.println(...)

Did you comment println during the test ? Debugging guide for Gatling scripts
I saw that the request rate from Gatling is different from Jmeter (27 vs 18), you have put the exitHereIfFailed too. Are both workload set up same ?

Once on this same forum, someone was experiencing unexpected results with Gatling.
At some point, he analyzed Wireshark TCP dumps. It turned out that Gatling was not buggy and that the issue was a race condition in AWS ALB. AWS fixed it.

I think you’re in a similar situation. Gatling is just the messenger.
Your Gatling and JMeter tests don’t do the same thing. As @trinp noticed, they don’t produce the same throughput, so if your issue is a race condition, it might not trigger with JMeter.

This kind of investigation requires some efforts and can only be performed with access to your infrastructure. That’s something only you can investigate, or with consulting.

Okay, for simplicity sake i just kept a simple injection profile which is 1000 atOnce users in gatling and keeping 1000 user threads with no rampup in Jmeter. Even though Jmeter throughput is higher all requests are succeeding there but that is not the case in gatling. If issue would have been on LB or Infra side then shouldn’t Jmeter also be facing that?

Sorry but we really can’t play riddles. One would need your JMeter and Gatling scripts and access to an environment where they can be executed.

I doubt your Jmeter script behavior is different from Gatling:

  • You are using Thread Group in jmeter to segregate the workflow, meanwhile your Gatling script executes all APIs. It is a very different behavior.
  • To add up, you may have 1000 Users running the same thread group (that specific API) in Jmeter script, then all these 1000 Users move on to the next one, which is a very very happy case in load test. While in gatling, your users have to execute all the APIs, which is a realistics one.
  • Now, you may want to modify your Jmeter script again, put all the request in one thread group, then run again. Also, please disable all println and exitHereIfFailed to ensure that both are identical.
  • If you did as I said, but did not get any better results, you may want to check out Wireshark as @slandelle said.

Okay Sure, thanks, let me try

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.