[GATLING 2] SSL handshake_failures

After upgrading to gatling 2 (the snapshot kindly created by Stephane - 2.0.0.20130605) we’re seeing quite a few SSL handshake failures.
We used to get som in gatling 1.5.1 too, but we get far more now, so it is probably a client issue.

Stefan

12:38:19.294 [WARN ] i.g.h.a.AsyncHandler - Request ‘Service call’ failed
javax.net.ssl.SSLException: Received fatal alert: handshake_failure
at sun.security.ssl.Alerts.getSSLException(Alerts.java:208) ~[na:1.7.0_04]
at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1639) ~[na:1.7.0_04]
at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1607) ~[na:1.7.0_04]
at sun.security.ssl.SSLEngineImpl.recvAlert(SSLEngineImpl.java:1776) ~[na:1.7.0_04]
at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:1080) ~[na:1.7.0_04]
at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:884) ~[na:1.7.0_04]
at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:758) ~[na:1.7.0_04]
at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624) ~[na:1.7.0_04]
at org.jboss.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1225) [netty-3.6.6.Final.jar:na]
Wrapped by: java.net.ConnectException: Received fatal alert: handshake_failure to https://whereever.no/service
at com.ning.http.client.providers.netty.NettyConnectListener.operationComplete(NettyConnectListener.java:103) ~[async-http-client-1.7.18.20130605.jar:na]
at org.jboss.netty.channel.DefaultChannelFuture.notifyListener(DefaultChannelFuture.java:427) [netty-3.6.6.Final.jar:na]
at org.jboss.netty.channel.DefaultChannelFuture.notifyListeners(DefaultChannelFuture.java:413) [netty-3.6.6.Final.jar:na]
at org.jboss.netty.channel.DefaultChannelFuture.setFailure(DefaultChannelFuture.java:380) [netty-3.6.6.Final.jar:na]
at org.jboss.netty.handler.ssl.SslHandler.setHandshakeFailure(SslHandler.java:1417) [netty-3.6.6.Final.jar:na]
at org.jboss.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1293) [netty-3.6.6.Final.jar:na]
at org.jboss.netty.handler.ssl.SslHandler.decode(SslHandler.java:913) [netty-3.6.6.Final.jar:na]
at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425) [netty-3.6.6.Final.jar:na]
at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303) [netty-3.6.6.Final.jar:na]
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) [netty-3.6.6.Final.jar:na]
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) [netty-3.6.6.Final.jar:na]
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) [netty-3.6.6.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) [netty-3.6.6.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) [netty-3.6.6.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) [netty-3.6.6.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109) [netty-3.6.6.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312) [netty-3.6.6.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90) [netty-3.6.6.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) [netty-3.6.6.Final.jar:na]
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [netty-3.6.6.Final.jar:na]
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [netty-3.6.6.Final.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) [na:1.7.0_04]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) [na:1.7.0_04]
at java.lang.Thread.run(Thread.java:722) [na:1.7.0_04]

As you see with this stacktrace, this is pure regular Netty code. Could you provide some more information about your env so that I can open an issue there, please?
JDK, OS, application under stress and everything you think meaningful.

Have you tried upgrading your JDK?

Did you pass a custom trust or key manager?

Hi,

The certificate that is used by the server is issued using a certificate java doesn’t trust by default. We’ve not set up a custom trust store, however I don’t see any warnings - which is kinda weird, right?

We’ve seen this issue both on windows and linux and different versions of java (1.7.0_19 linux and 1.7.05-b06 windows)

Any ideas?

I think I know why you see more of those: we have disabled AHC’s retry feature by default (see gatling.conf file: https://github.com/excilys/gatling/commit/ce9e33e5f32c9bf3e3c1e5769c310aba94be4a9b).

So I guess the problem was always here but was hidden.

Yep. That helped. Probably a threading issue.

So, no longer handshake failures, nor 505, nor slowdowns?

BTW, do your 505 problems also happen on HTTP?
I realized that zero-copy is disabled over HTTPS.

I’ve only seen it happen using https

When I ran through Charles I used http so that might have been the reason why it worked

Stefan

No more ssl handshake issues. The 505 thing is a different issue.

I’ll have to clean up this part of AHC. Stay tuned.
And thanks for your patience…

I’ve been investigating the ssl issue a bit

I’ve looked at the ahc and netty code, but it’s hard to grasp what’s going on.

Apparently one can get ssl debug info using -Djavax.net.debug=ssl,handshake on both client and server.
I’ll give it a shot within the next few days.

Stefan

Hi Stefan,

Could you sum up the situation, please?

IIRC, you had 2 different kinds of problems:

  • requests being “mixed”, bodies being send from a given user actually being sent from another one
  • SSL handshake issues
    I had you try different strategies: passing a file, and passing a byte array read from the file.

What is the outcome of the different combinations?
Are requests being also mixed over HTTP?

Cheers,

Stéphane

See inline comments

Hi Stefan,

Could you sum up the situation, please?

IIRC, you had 2 different kinds of problems:

Yes

  • requests being “mixed”, bodies being send from a given user actually being sent from another one

No exactly right. It seems like the request is not flushed correctly after passing a raw data to http.post. When the connection is reused by the same user in the next request, we get a 505. However, if we introduce a pause of 1 sec between these two requests, we never get any 505s.

  • SSL handshake issues

We’re seeing quite a few of these, but after we increased the retry count, the situation has improved a lot

I had you try different strategies: passing a file, and passing a byte array read from the file.

What is the outcome of the different combinations?

File->505
Byte array->ok

We ended up introducing the short pause to keep the code clean

Are requests being also mixed over HTTP?

Yes indeed

Cheers,

Stéphane

Sorry for being unclear before. Hth.

Stefan

Btw, requests are not being mixed when I pass them through Charles

AHC body parts code is eminently complex: strategies are different according to the part type (file or string or bytes) and the protocol (in HTTP, it tries to perform zero-copy and write directly into the socket from the file Channel, but it doesn’t over over HTTP).
I’ll try to clean up all this and come up with a prototype.

Thank you for summing up, I’m doing too many things and started losing track.

Stay tuned,

Stéphane

hhm, I tried reproducing the ssl handshake issue using a way simpler simulation without multipart requests - but couldn’t. Maybe the 505 issue is related after all? I’ll keep you posted.

I just produced a new timestamp: 2.0.0.20130628

You should no longer get the NPE.

Perfect. Works like expected. Thanks.