Gatling run not finishing

We are intermittently seeing a problem with Gatling (2.0.0 RC3) not finishing at the end of a test run.

Instead it continues with a very small number of users (maybe just one) still in the ‘running’ state. Here’s an example:

Hi Andy,

It looks like an issue in AsyncHttpClient that sometimes fails to notify.
Can you reproduce with RC4 (I did some changes in AHC that got shipped in Gatling 2RC4)?

Cheers,

Stéphane

Thanks, we’ll test with RC4 tomorrow and let you know.

Unfortunately it still occurs with RC4.

With further testing we’ve seen that:

  • It isn’t related to the “too many open files” error (we increased the hard limit, and no longer get that error message, but the issue still occurs)
  • It happens even with a single scenario (no need for six like in the earlier example)

By giving unique IDs to each of 4500 scenarios, and including that ID in the POST request, and seeing which one didn’t appear in the list of “OK” / “KO” whilst Gatling was waiting for the single user to complete, I could see which specific user had the problem (user 2612 in one particular test, but presumably different every time). I also captured the traffic between the Gatling client and the webserver using tcpdump. However I couldn’t find that ID in the resulting capture file, suggesting that it didn’t even get as far as sending the request body.

Do you have any other thoughts for how we might investigate this?

Cheers,

Andy

Sad news :frowning:

There’s still a chance that there’s still such a bug in Gatling, but we honestly hunt them down.
My main suspect is some race condition in AsyncHttpClient, but I’ll be looking for needle in a haystack…
I’ll investigate on monday, sorry for the delay.

These sorts of problems are always the hardest ones to hunt down!

We’ve enabled debug logging in Gatling (conf/logback.xml: added a ch.qos.logback.core.FileAppender with ch.qos.logback.classic.filter.LevelFilter set at DEBUG; the ch.qos.logback.core.ConsoleAppender has a ch.qos.logback.classic.filter.LevelFilter set at INFO; the root level is DEBUG; we also have io.gatling.http logger set to TRACE).

With this, we can sometimes see the issue even when there are no timeout exceptions. I suppose something about the load incurred by extra debug logging makes such race conditions more likely to occur.

We’ve seen that:

  1. Gatling sometimes sends a request more than once. We identify a request using a unique attribute that is included in the POST body and also as a custom header, so we can find requests both in debug logs and packet traces. The identifier is created using a Feeder that contains a next: definition with a synchronised counter increment, so we believe it is a duplicated/repeated request rather than two different requests that for some reason have the same Feeder attribute.

In these cases, there are always log entries relating to “Using cached Channel”, then “Channel Closed”, then “Trying to recover request”, then “Request using non cached Channel”. However, this pattern also occurs even where a request is only sent once. Perhaps it recognises that the channel was closed and so resends on a new channel, but actually the request occasionally did manage to get sent across the wire.

  1. Gatling sometimes doesn’t send a request at all (as confirmed by the packet trace).

In these cases, the log entries always contain “Using cached Channel”, then “Channel Closed”, but for some reason there is never a subsequent “Trying to recover request”.

We also see several instances of “java.nio.channels.ClosedChannelException: null”, although we see these even when the simulation does correctly end. These come in a variety of forms; one example is:

java.nio.channels.ClosedChannelException: null
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:433) [netty-3.9.4.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:373) [netty-3.9.4.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93) ~[netty-3.9.4.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) [netty-3.9.4.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) ~[netty-3.9.4.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) [netty-3.9.4.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) ~[netty-3.9.4.Final.jar:na]
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) ~[netty-3.9.4.Final.jar:na]
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) ~[netty-3.9.4.Final.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_25]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.7.0_25]
at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]

Note that our requests all include a “Connection: Close” header, introduced using a “headers(Map(“Connection” → “Close”))” entry in the exec(http()…) section.

Hopefully this may help in tracking down this issue…

Hi Andy,

I’m currently investigating the issues (there’s several of them).

There’s a race condition somewhere for sure, but first, “Using cached Channel” means that the channel (connection/socket) was pooled, which shouldn’t happen if all your requests include a “Connection: Close” header.

That made me realize that AsyncHttpClient was only honoring the response headers for determining if a channel can be cached, not the request ones. I’ve just fixed this. Until we ship a new version of AHC, you can also disable connection pooling in gatling.conf (set allowPoolingConnections and allowPoolingSslConnections to false, don’t forget to uncomment).

Investigating the other issues.

Thanks a lot for your help!
Cheers,

Stéphane

“Gatling sometimes sends a request more than once.” + “Trying to recover request” happens because “maxRetry” is set to 4 in gatling-default.conf.
This is a mistake, this parameter should be set to 0.

Hi Andy,

I’ve pushed a new version of AHC (I haven’t found the “lost” user issue cause yet, but still, Gatling would behave way better for your use case).
Could you give latest Gatling snapshot a try, please? https://oss.sonatype.org/content/repositories/snapshots/io/gatling/highcharts/gatling-charts-highcharts/2.0.0-SNAPSHOT/gatling-charts-highcharts-2.0.0-20140915.123359-59-bundle.zip

Cheers,

Stéphane

Hi Stéphane,

We’ve tested 2.0.0RC5 and the simulation never fails to complete. Thanks a lot for the fix!

One slight anomaly though: when debug logging is enabled in Gatling, we occasionally see that:

  • some requests (headers only) are only logged once (in c.n.h.c.p.n.r.NettyConnectListener) whereas the vast majority are logged twice (in c.n.h.c.p.n.h.HttpProtocol and c.n.h.c.p.n.r.NettyConnectListener)
  • some responses (headers only) aren’t logged, although the vast majority are

However, a packet trace on the wire shows that all requests and responses were sent. Are there any known issues with debug logging of request/response headers? (This isn’t a major issue for us, just strange)

Cheers,

Andy

That’s great news!
Still, there’s still a race condition somewhere, it’s just that now Connection:close is properly dealt with, it’s way less likely to happen.

In the meantime, I’ve found a likely race condition causing Timeouts to trigger Remotely closed Exception. Not sure if that would fix the user loss issue though.

Regarding logged headers, yes, there’s room for improvement. Not all requests are logged (e.g. CONNECT ones with a secure proxy) and some headers are missing. The reason is that we can currently only log what’s passed to the HTTP engine (AsyncHttpClient), and that it automatically adds some missing headers. That’s something I intend to improve in a future version.

Hi Stéphane Landelle ,

As per the above discussion I tried with 2.0.0RC5,But still I am getting java.util.concurrent.TimeoutException: Request timed out.Can you please give a quick solution

Thanks