10k performance issues

Hi.

In order to try to achieve 10k req/s, I've been running Gatling 2.2.1 in parallel, distributed over 8 AWS m4.xlarge instances.
The tests keep failing well before I reach this mark, with multiple timeouts (connect and read timeouts).

Outpu:t "(...)j.n.ConnectException: Connection timed out (and read timeouts)"

All the instances' kernel parameters are tuned according to http://gatling.io/docs/2.2.1/general/operations.html (see below for limits and sysctl configuration I'm using).

Any clues what might be causing such limitation? One thing I noticed is that, once Gatling starts reporting a massive amount of these exceptions, the active connections drop to almost 0 (measured on server side).

The simulation looks something like this:

import scala.concurrent.duration._
import scala.math.ceil

import io.gatling.core.Predef._
import io.gatling.http.Predef._
import io.gatling.jdbc.Predef._
import io.gatling.core.structure.ScenarioBuilder
import io.gatling.core.structure.PopulationBuilder
import io.gatling.http.request.builder.HttpRequestBuilder

class TestFooBar extends Simulation {
    var httpConfig = http.disableWarmUp.baseURL("https://www.foobar.com")
    val date_format = new java.text.SimpleDateFormat("yyyy-MM-dd")
    var date_string = date_format.format(new java.util.Date())

    val full_duration = 20 minutes
    val rps_scale = 1f

    def createSimpleUrlScenario(url: String, users: Int) : PopulationBuilder = {
      val scn = scenario(url).exec(
        http(url)
          .get(url)
      )
      .inject(
        rampUsersPerSec(1) to(ceil(users * rps_scale).toInt) during(full_duration)
      )
      .throttle(
        jumpToRps(1),
        reachRps(ceil(users * rps_scale).toInt) in (full_duration)
      )
      .protocols(
        httpConfig
      )

      return scn
    }

    def createSimpleUrlScenario(urlBuilder: HttpRequestBuilder, users: Int) : PopulationBuilder = {
      val scn = scenario(urlBuilder.toString).feed(csv("foo/tokens.csv").circular).exec(
        urlBuilder
      )
      .inject(
        rampUsersPerSec(1) to(ceil(users * rps_scale).toInt) during(full_duration)
      )
      .throttle(
        jumpToRps(1),
        reachRps(ceil(users * rps_scale).toInt) in (full_duration)
      )
      .protocols(
        httpConfig
      )

      return scn
    }

    setUp(
      createSimpleUrlScenario("/ajax/service1", 5),
      createSimpleUrlScenario("/ajax/service2", 46),
      createSimpleUrlScenario("/ajax/service3", 32),
      createSimpleUrlScenario("/ajax/service4", 29),

      createSimpleUrlScenario(
        http("/ajax/service5")
        .get("/ajax/service5")
        .headers(Map(
          "Cookie" -> "${token}")), 15)

    )
    .maxDuration(120 minutes)
}

sysctl

Use traceroute to figure out what machines are in your network path (I once had a NAT bottleneck). Use iperf between a load injector and your application to find out the maximum network bandwidth with the instance mentioned. Use top to see CPU usage (there is a problem with the irq load balancer on Fedora based EC2 machines and an uneven distribution across cores).

I started reading the sysctl documentation (https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt) to properly configure a Linux box. I began some documentation (https://github.com/bbc/notes-on-perf-testing/blob/master/ch/tuning/sysctl.md) - but lost interest. Would anyone want me to continue with this?. I remember getting 30K RPS using Gatling and a local Go web server (probably should have documented it).

Aidy

I’ve checked that. Bandwidth (client and server side) and cpu load were way below max. capacity.

Check your server’s application logs and web server logs. It may be that your server is unable to handle the load you are subjecting it.

Start with smallish req/s, run for a few minutes, check for timeouts. Then double the req/s until you start to see timeouts.
Doing this binary search, you can find a rps throughput that your server is capable of handling.

One thing worth mentioning is that, right after I start getting these exceptions, if I try to open a session with the target host / endpoint (using curl for example) I have no issues at all.

Are you running curl from one of the Gatling hosts, or from a different one?
I suspect the latter. The former shouldn’t work.

You’re most likely running out of local ports because your scenario involves tons of connection opening and closing.

Check netstat and you should be seeing tons of CLOSE_WAIT.
Note that FrontLine would give you this information in the dashboard.

Hi,

Are you running curl from one of the Gatling hosts, or from a different one? → Running on the same host.
Using netstat / lsof I can see that I don’t even reach 3k open sockets.
I tried curl loader just to see how it behaved - I was able to achieve a much higher number of connections with a similar test, using a single instance for curl-loader instead of 8 for gatling.

I’m still open for suggestions.

I suspect that with curl-loader, you’re reusing connections (your virtual users have a long life), which you don’t with your current Gatling set up where every virtual user starts, opens a connection, sends a request, gets the response, closes the connection and dies.
If so, you’re like comparing HTTP/1.0 vs HTTP/1.1.

most likely…

curl-loader:

FRESH_CONNECT is used to define on a per url bases, whether the connection should be re-used or closed and re-connected after request-response cycle. When 1, the TCP connection will be re-established. The system default is to keep the connection and re-use it as much as server and protocol allow it. Still the system default could be changed by the command-line option -r.

monitor connection rate from netstat (on both sides for completeness [active vs. passive openings]):

my_client $ while true;do date;netstat -s|grep openings;sleep 1;done

Wed 8 Jun 17:08:06 BST 2016

70901 active connections openings

38 passive connection openings

Wed 8 Jun 17:08:08 BST 2016

70901 active connections openings

38 passive connection openings

my_server$ while true;do date;netstat -s|grep openings;sleep 1;done

Wed 8 Jun 16:54:34 BST 2016

169 active connections openings

17524 passive connection openings

Wed 8 Jun 16:54:35 BST 2016

169 active connections openings

17524 passive connection openings

other cross checks/ validation can be reasoning about how many concurrent/active connections you would expect with your gatling simulation:

avg response time of the urls, for example, 20ms

total inject rate users per second 100 ( about what is in your simulation)

average concurrent/active users = 0.02*100 = 2 , ie. not a lot.

with 1 request per user and an inject rate of around 100 per second you are only going to reach 100 rps.

check ss -nat|wc -l

vs. your port range net.ipv4.ip_local_port_range=1025 65535

I think you have done this but mentioned open connections only not all connections.

FWIW, we have a similar scenario for our internal tests on FrontLine:

class FastOpenCloseWorkload extends Simulation {

val maxUps = 10000
val rampDuration = 1 minute

val httpProtocol = http
.baseURL(“http://alpha.frontline:8080”)
.acceptEncodingHeader(“gzip, deflate”)
.check(status.is(200))

val scn = scenario(“scenario1”)
.exec(http(“json”).get("/json"))

setUp(
scn.inject(rampUsersPerSec(20) to (maxUps) during (rampDuration),
constantUsersPerSec(maxUps) during (2 minutes)))
.protocols(httpProtocol)
}

The application under test is a Go sample that serves a 1kb JSON payload.
Load is generated from one single Gatling instance.
App and Gatling injector are on 2 different hosts on a dedicated switch.
Hosts specs (bare metal): Kernel 4.5.5 64 bits, Fedora 23, Intel Xeon Processor E5-1620 v3 (Four Core HT, 10MB Cache, 3.5GHz Turbo)

As you can see in the attached FrontLine captures, everything’s fine and Gatling has no problem with your load model (10.000 new virtual users per second, each performing 1 request, connections being open and closed every time).