for quite a few days I am trying to get my Gatling test to run using Amazon’s Elastic Load Balancer. My goal is to run 1000 HTTP requests/second distributed among two EC2 instances, which shouldn’t be too hard.
With HAProxy, my tests are running fine, but with the ELB I saw some interruptions when more than 650 requests/second are sent:
After while, a high number (>4000) of connections is stuck in state SYN_SENT, until all of them are cleared with a ConnectException. Afterwards the connections are successful for a few seconds, until this issue occurs again.
First I tried changing the number of users (currently I am ramping up from 200 - 1000 users per second, each making one request only), and changing the keep-alive setting, both without success.
The known problems with the ELB have been ruled out (DNS refresh issues, ELB scaling / pre-warming) together with Amazon’s support team.
Now I switched to the Apache Benchmark tool instead of Gatling, and using this tool with a similar scenario (600 threads, 1 Mio requests), the test passes fine through the ELB.
Isn’t that strange?
Gatling + HA Proxy = OK
Gatling + ELB = ConnectException
AB + ELB = OK
I have been running TCP dump and took a closer look at the TCP connections, for now I can only say that Apache’s tool seems to act more parallel, it is using >600 threads after all, because at the beginning all connections are established with SYN’s, whereas Gatling is - at least is seams to me - serializing the opening of connections.
Now I have this theory that something Gatling might be stuck because of a missing SYN_ACK which is why no further connections are handled. Is this reasonable?
Perhaps anyone else has an idea what I can try out to debug this issue. Any help is appreciated.
it’s not a like for like test : you are mixing up “open” and “closed” workloads, or at least I don’t have enough information to determine that the 2 workloads are the same.
open == gatling usersPerSecond(n) / tsung / iago, based on arrival rate input
closed == based on concurrent threads/connections/users/widgets/etc, wrk/ab, gatling atOnceUsers(n)+looping scenario, and all other tools … in brief.
so for a like for like test try to start with:
wrk with 600 connections - note the requests per second and the connection stats (see the nestat one liner in a previous post for counting currency in the tcp stack) interactiely as the test proceeds.
gatling with atOnceUsers(600) with a script that loops forever
that would test the closed case.
you should record the rps(request per second), response time(avg or percentile), and counts of tcp connections in each state (syn_sent,estab,etc)
why? because gatling could be creating far more(or less) concurrent connections than ab, causing a real/perceived issue/difference.
can you enable graphite and set up nc(netcat) with awk as per previous post for realtime gatling monitoring. I’ll add in the connection stats and will forward.
I can’t say there isn’t a problem with Gatling , but at the same time, can’t say there is either.
thank you very much for the detailed explanation.
I have been able to reproduce ab's (with parameter -k) and wrt's
behavior, but only when reusing connections, when recreating
connections, only ab does succeed.
But first of all, for reference in case someone else is searching for
this topic, this is what I did:
I have modified my scenario to have the users repeat endlessly with
.forever() { }
Set allowPoolingConnections = true (this is default) and use
.inject(atOnceUsers(600)) to create all users.
Then while running the test, I have been continuously printing
connection stats running command:
while true ; do sleep 1; date +"%T"; sudo netstat -na|grep tcp|awk
'{print $NF}'|sort|uniq -c|sort -nr; echo ; done
This shows that there is a fixed number of 600 open connections.
Btw. How can I set a timeout when using atOnceUsers -- or do I have to
set the total number of requests instead?
Problems when not reusing connections:
As mentioned above, when I set allowPoolingConnections to false,
connections will be closed and opened for every request.
With ab (by default not using keep-alive) this is not a problem, I can
make 2 million requests (having something between 270-430 established
connections at a time).
With Gatling netstat shows around 600 established connections and I run
into the same problems as before: After about 65400 requests (not a
coincidence that it is close to 65535?)) all 600 users are stuck, and
netstat shows exactly 600 connections in state "SYN_SENT". Could this be
some kind of OS bottleneck to which only Gatling is susceptible?
So far I increased the number of open files and I did not get any other
error message which would point to e.g. the number of ports.
For now it seems as if I had to reuse connections for my load testing,
unless someone has a suggestion about what else to try.
It looks like after ~65400 requests, your target server closes all the connections (probably because the ELB changed IP), Gatling tries to open new connections but it somehow fails (server never acks the SYN packets, hence SYN_SENT).
No idea what causes this. Are you sure you disabled DNS caching? Do you get the same behavior when enabling keep-alive on ab?
Then, what I don’t get is why you wouldn’t run out of ephemeral ports with ab but you would with Gatling when not using keep-alive.
I've been working with Amazon's support on that topic and they confirmed
that the IPs are correct and ELB has scaled correctly.
Also I traced the DNS lookups to make sure the IPs are not stale and the
lookup takes place (yes, caching is disabled).
Could it be the ephemeral ports? Is there any way to check that?
I've also been thinking whether I might be blocked by the ELB... but
then I have assuming Amazon's support would have noticed that. Also I am
running the tests from an EC2 instance - might be a difference where the
requests come from.
is this still unresolved in terms of determining for sure where the issue is? Ie. it looks like the ELB but amazon won’t accept that unless further proof provided?
there’s more options for diagnostics if so.
FYI, I have some on AsyncHttpClient whose issue is that ELB replies 503 when a region fails and never closes the connections, so they stay in the pool. ELB is quite a hairy beast…
yes, the issue is still not resolved. The high number of connections
being opened and closed is certainly the cause, but I couldn't find the
bottleneck. For now I have switched to a limited number of connections
instead.
Which diagnostics do you have in mind?