Gatling 3.9.3 | j.n.c.ClosedChannelException still occurring

Hello Team,

We are using Gatling open source for our project and keep running into ClosedChannelException at high loads (200-500 requests per second). We do not observe this error or any other network related exceptions at lower loads of 50 to 100 requests per second.

We have done extensive tests to identify the root cause but not able to identify the root cause of this issue. Along with ClosedChannelException we often see few other errors like:
i.n.h.s.SslHandshakeTimeoutException: handshake timed out afte
Request timeout after 60000 ms
j.i.IOException: Premature close
j.n.NoRouteToHostException: No route to host
etc

We tried to isolate the problem using following tests:

  1. Manually running tests at 100 constant user per second from 4 different physical machines and still ran into closed channel exception.
  2. We tried to upgrade our GoCD Agent to a AWS c5nlarge machine and ran, and still we get the same exceptions.

After trying above ways we discussed this with our DevOps team and they suggested to check if number of requests that you sent from gatling to SUT are same or not through Kibana… and we found that if there are total 1000 exceptions then kibana shows 1000 less requests. Does that mean gatling isn’t sending the requests or are these requests getting lost somewhere else?

We aren’t able to prove that this isn’t a gatling issue.

For replicating this issue we even tried sending same load to a public website (demoblaze.com) and following are the results:
400 users per second for 30 seconds - 11 to 42% error rate - Ran this test from a macbook pro 16 inch (2.6 ghz 6 core intel core i7 and 16 GB RAM)

Request timeout after 60000 ms 3373 (67.03%)
i.n.h.s.SslHandshakeTimeoutException: handshake timed out afte 741 (14.73%)
r 10000ms
j.n.c.ClosedChannelException 296 ( 5.88%)
j.i.IOException: Premature close 257 ( 5.11%)
Request timeout to demoblaze.com/216.239.38.21:443 after 60000 101 ( 2.01%)
ms
Request timeout to demoblaze.com/216.239.36.21:443 after 60000 100 ( 1.99%)
ms
Request timeout to demoblaze.com/216.239.34.21:443 after 60000 96 ( 1.91%)
ms
Request timeout to demoblaze.com/216.239.32.21:443 after 60000 68 ( 1.35%)
ms

Seeking your help in this regard, please share your wisdom on ClosedChannelException and why does it still occur in 3.9.3 version?

class BlazeSimulation extends Simulation {

  val httpProtocol =
    http.baseUrl("https://demoblaze.com/")
      .acceptHeader("text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
      .acceptLanguageHeader("en-US,en;q=0.5")
      .acceptEncodingHeader("gzip, deflate")
      .userAgentHeader("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:16.0) Gecko/20100101 Firefox/16.0")

  val scn = scenario("Scenario")
    .exec(http("Home").get("/"))

  setUp(
    scn.inject(constantUsersPerSec(400).during(30))
  ).protocols(httpProtocol)
}

Ran on Gatling Cloud (load generated from Virginia):

0 failure as you can see :man_shrugging:

Appreciate your help, but could you please suggest if closed channel exception in my case is due to gatling open source? What could be the reason for so many closed channel exceptions? Is it gatling or the SUT?

Also when we used shared connection feature of gatling we were able to reduce closed channel exceptions by huge margin (before shared connection for 35 to 40 percent errors but after shared connection observed less than 1% closed channel exceptions)

Kindly guide so that we can start looking into correct direction. Also I read at various places that closedchannnelexceptions should not occur with latest versions of gatling but I am still seeing with 3.9.3…any suggestions here?

ClosedChannelException means the socket gets closed while the request is flying. This typically happens when a network component (wifi, routers, etc) or the SUT is dropping the connections under load.

When reusing connections, you greatly reduce the IO burden on the network (very few TCP connections and TLS handshakes) so you get way less of these errors, but then your test might not match your real world use case.

As I don’t get any error when running from Gatling Cloud while you do on your side, I suspect your network is at fault.

1 Like

Well in my case demoblaze isn’t the system under test it’s our internal system. I shared demoblaze as I could replicate exceptions there too.
As you mentioned it could be network or system under test that could be causing these exceptions and not the gatling itself, are there some tunings that could help eliminate these errors or any aws ec2 machine that you could suggest that would reduce these errors? Like we are using c5n large (network optimized) for now but still see issues. Should we try a larger machine or some other ways that could help in eliminating these exceptions?

Have you see this:

Did try these, didn’t help.

Also I read that closed channel exceptions should not occur with latest gatling but I still do see them… Is there a retry logic I can use to decrease closed channel exception?

Also I read that closed channel exceptions should not occur with latest gatling but I still do see them…

Either you read crap, or you misunderstood a bug fix where ClosedChannelException would happen in some conditions where it shouldn’t.
ClosedChannelException is still a thing and will always be as it describes a perfectly expected real world network state: write request, server closes the connection before sending the complete response.

Is there a retry logic I can use to decrease closed channel exception?

2 possibilities:

  1. there’s a bug in Gatling
  2. you’re blaming the messenger and your network or SUT does have an issue

As the reproducer you provided works very fine for us, I’m very inclined to think that it’s the latter.
That you don’t experience these failures with other tools doesn’t prove anything: they could do something different such as:

  • sharing and reusing connections (which is also possible with Gatling, but different test case that might not make sense for you)
  • silently retry. The only case where we silently retry is when the connection was obtained from the keep-alive pool, as the sequence “server closes - send a request - client receives the close order” is perfectly normal and expected.
1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.