Handshake timed out after 10000ms with huge load

Hi, I have an issue when reaching a certain Users such as this:
i.n.h.s.SslHandshakeTimeoutException: handshake timed out after 10000ms
This is my inject pro5

.inject(
                  rampConcurrentUsers(1).to(5000).during(2500),
                  constantConcurrentUsers(5000).during(SteadyLength),
                  rampConcurrentUsers(5000).to(1).during(500)
)

Usually the error appear at 900 seconds timestamp - 1800 Users reached, then REST API following fails to be executed, mostly with 500, 503, 504 errors.
.
The odd thing is that when I change the target pool to 2000 with 1000 seconds ramp time, everything went smoothly without any issues, I checked the log and see that with same timestamp at 400 seconds, the total connections made in both pro5 was in a huge difference (around 1500 Users vs 6000 Users) what could have happened in this pro5 to cause this difference?

TLS handshakes are nowadays very expensive CPU-wise, because of the long keys and complex ciphers.
Your server is most likely saturated, Gatling is just the messenger here.

Hi @slandelle , thanks for your perspective view on this matter,
Can you also explain in the last paragraph I wrote there ? I just want to understand more how Gatling create Virtual Users in a thread, hopefully it won’t be a hard topic to understand

when I change the target pool to 2000 with 1000 seconds ramp time, everything went smoothly without any issues, I checked the log and see that with same timestamp at 400 seconds, the total connections made in both pro5 was in a huge difference (around 1500 Users vs 6000 Users) what could have happened in this pro5 to cause this difference?

If you reduce the number of users per second, you’re obviously going to reduce the number of TLS handshakes per second.
Moreover, when you start experiencing failures, depending on how your scenario is design, your users who failed to perform (TCP connect + TLS handshake) are going to try again on the next request.

1 Like

Hi @slandelle , I think this is a weird behavior, as I tested around with this scenario

2000 Users, ramping rate 2 Users / sec (ramping time 1000 seconds), duration 1 hour
5000 Users, ramping rate 2 Users / sec (ramping time 2500 seconds), duration 1 hour

for 2000 Users case, everything is still fine and 5000 Users things started to crash at 15 minutes.
With log enabled, the huge difference are still there (1500 Users opened vs 6000 Users opened at time stamp )
Is the total Users of Gatling calculated beforehand and all pushed into server at the mean time ?
Please explain me more on this.

Sorry, but we can’t help without a reproducer. Moreover, if this is about the third party gRPC plugin, that’s not something we support here.

Hi @slandelle , I only cover REST HTTP in this topic, gRPC was in the past :blush: After rough discussions we have figured out where the problem is, I have learnt a lot from this, thank you.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.