Spikes at regular 1 minute intervals

Hi

We are running a ~500TPS test against a mock server (an ELB with a single NodeJS instance) and seeing a response time spikes at regular 1 minute intervals (see attached graph). The mock has a fixed delay on each request of 150ms and the response time spike 99th and max are around 1 sec. The ELB logs are reporting a maximum delay of ~160ms so it looks like its coming from the injector. I’ve looked through the GC logs and pauses are less than 50ms and there are no obvious signs of memory pressure. I ran jHiccup and pauses were also below 50ms. A DNS trace showed lookups every 30 seconds so I tried a TTL of 0 and ran directly to an ELB instance using the IP and both made no difference to the spikes. I profiled Gatling and ran strace but didn’t find any evidence of the delay.

I’ve attached the test I ran. Interestingly, if I remove the forever loop the spikes go away. The design of this test with the loop and pause is a simulation of our production clients so we need to do this.

Given all of the above, I’m wondering if its something internal to Gatling.

Any thoughts?

Regards
Dom

SpikeDemo.scala (660 Bytes)

And the version… 2.2.2

Isn’t that your ELB’s IP changes every minute, so hostname has to be resolved again?

And that all your previously keep-alive TCP connections are now dead and you have to open them again?
FrontLine would tell :wink:

I don’t think so as I ran the test against a single ELB instance using the IP.

Plus one scaled the ELB IPs don’t tend to change.

Is it ok if we run your test hitting your server?

Yes, go ahead.

Hi,

First time seing this. We are also having issues reproducing, AWS thinks we are trying to DoS your server and blocks us…

What response do you get? I may need your injector IP to whitelist it

A mix of unknown host (90%), timeouts (9.99%), and unknown errors (<0.01%). I won’t be able to test again until tomorrow though.

I can see only successful requests in the ELB logs so you are getting through with some load. It also looks like you are using an AWS instance in Germany for your injector. So I’m assuming the load isn’t being blocked and therefore you may be overloading the injector. We are using a c4.xlarge injector and we needed to tune the network settings to run this load successfully.

Weird, I was running a c4.2xlarge to be sure. Didn’t fully tune just to see the impact and check if any bottlenecks could be avoided just by tuning. Maybe I tuned too lightly… still, if I remember correctly, if DoS protection occurs, you wouldn’t be able to see it anyway, as it would happen on our side.

Can I load again tomorrow?

I’ve never seen DoS protection on ELBs in any tests I’ve run myself. This comes from their documentation ‘ELB accepts only well-formed TCP connections. This means that many common DDoS attacks, like SYN floods or UDP reflection attacks will not be accepted by ELB and will not be passed to your application’. So the test shouldn’t have been blocked as the connections would have been well formed.

But if you still think DoS protection is a possibility I can raise a ticket - like you say, if the protection is before the ELB it wouldn’t be in the ELB log.

I think the key things we changed were ulimit, fs.file-max & net.ipv4.ip_local_port_range.

You can test tomorrow.
I’ll leave it alone until you’re done.

Well, I’m beginning to think limitations (if any) shouldn’t happen at this rate… so under tuning is the explanation that makes most sense.

Had some time to do some testing, here’s what I got using the latest Amazon Linux AMI, Server JRE 1.8.0_92, on a c4.2xlarge.

I was finally able to reproduce with proper tuning:

However, adding “session required pam_limits.so” to /etc/pam.d/sshd almost killed every spikes, could also be luck:

So, I checked with FrontLine, but only had two spikes, which I then pinned to match metrics:

Look at the connections metrics:

The two spikes didn’t match the two 1 second connection opening time.

Looking at the injector profile, I could observe the same behavior:

But then, Stéphane told me Async Http Client does flush its inner pool every minute, which explains the huge connection drop spike we see (-10k). In my case, it does not match increased connection opening time, but I suspect it does in your case.

I launched a last test with the JAVA_OPTS “-Dgatling.http.ahc.pooledConnectionIdleTimeout=30000” to force the flush every 30s instead:

Looks like the injector is able to breath more, as the opened sockets never exceed 6k (then 10k by the end of the run). Stéphane is looking into AHC for any improvement he can do, in the mean time, you can try reducing the idle timeout.

Great work - thanks for putting the effort in.

I ran the test with -Dgatling.http.ahc.pooledConnectionIdleTimeout=30000 and the spikes now occur ever 30 seconds so it does look like you’ve found the cause. I’m assuming this means there is some internal blocking during the flush that’s adding latency to new connections. Is this your understanding also? It looks like the spikes aren’t improved with a 30 second flush so we’re interested in any possible solution for this.

internal blocking during the flush that’s adding latency

Good point. Why would flushing connections every minute add latency to the response times?

adding “session required pam_limits.so” to /etc/pam.d/sshd

Are you finding that your ulimit increases are not being respected without enabling the pam_limits.so module?

> internal blocking during the flush that's adding latency

Good point. Why would flushing connections every minute add latency to the
response times?

Nah. Don't assume that every latency spike is caused by something blocking.

What happens here is that, suddenly, Gatling has to process tens of
thousands of entries in its connection pool, then emit 10.000 connection
close orders, which means sending 10.000 FIN TCP messages over the wire
that will have to be processed and answered by the remote peer (the ELB,
here). At this point, there's no way to tell where the issue is. I could
perfectly be the ELB being overwhelmed by this sudden FIN flood.

Maybe reducing the cleaner period and the idle timeout to 5s or 10s would
improve things. Maybe not, as there's probably a good balance to be found
between pool entries processing and FIN emission.

Anyway, I don't think we'll have time to investigate this use case
ourselves any time soon outside a support contract.
Contribs welcome :slight_smile:

Cheers,

*Stéphane Landelle*
*GatlingCorp CEO*
slandelle@gatling.io

In this case it seems Async Http Client is putting an unrealistic load on the remote server as real uses wouldn’t behave like this i.e. with real users connections would be closed as evenly as they are established (and used) instead of coming in bursts.

If we tune down the idle timeout to mitigate the issue then again we are not realistically holding connections open on the server. If we could decouple the idle timeout from the cleaner period so we can clean often but timeout at a realistic interval that would be better. Even then, the connection close events won’t be realistically distributed, but it would be closer.

We’ll have a look into it on our side.

In this case it seems Async Http Client is putting an unrealistic load on
the remote server as real uses wouldn't behave like this i.e. with real
users connections would be closed as evenly as they are established (and
used) instead of coming in bursts.

Actually, what's unrealistic is your server behavior (but maybe it's just a
proof of concept and it's not properly configured) as it trusts/relies on
clients to deal with idle keep-alive connection timeout. That's something
you never do for web traffic. A proper server would never let 10.000
connections alive and idle for that long.