2.0.0-RC2 Increasing heap size issue

Hi Guys,

I am running into memory issues running my script with version 2.0.0-RC2

My simulation is running from a Windows 2008 server virtual machine.

During test execution heap usage keeps increasing up to a point CPU usage goes through the roof and my tests fails.

My scenario looks like this:

setUp(scn.inject(constantUsersPerSec(1) during (600) // warm up of the SUT
,rampUsersPerSec(1) to(150) during(3600)
,constantUsersPerSec(150) during (172800))).protocols(httpProtocol)

Some relevant script information:

The script uses ELFilebodys with are fed with values from a number of feeders.

val httpProtocol = http
.baseURL(“https://some-host.com”)
.acceptHeader(""“application/soap+xml, application/dime, multipart/related, text/*”"")
.acceptEncodingHeader(""“gzip, deflate”"")
.userAgentHeader(""“Mozilla/4.0 (compatible; MSIE 6.0; Windows NT)”"")
.check(regex(“faultstring”).notExists)
.check(regex("[^1185]<").notExists)
.check(regex("[^1059]<").notExists)
.check(regex(“errorNumber”).notExists)
.check(regex("[^1140]<").notExists)
.check(regex(“HostErrorCode”).notExists)
.check(regex(“faultcode”).notExists)
.disableCaching

.randomSwitch(
12.0 → exec(GetLocalDateTime_1A),
8.0 → randomSwitch(
90.0 → exec(GetSeatMap_1A),
8.0 → exec(GetSeatMap_COD),
2.0 → exec(GetSeatMap_GAT)
),

10.0 → randomSwitch(
90.0 → exec(ListDocument_1A),
8.0 → exec(ListDocument_COD),
2.0 → exec(ListDocument_GAT)
),

23.0 → exec(ListOperationalEligibility_1A),
15.0 → randomSwitch(
91.0 → exec(ListPassenger_1A),
9.0 → exec(ListPassenger_COD)
),

2.0 → exec(ProvidePassengerHandlingInformation_1A),
3.0 → exec(ProvidePassengerInformation_1A),
8.0 → randomSwitch(
90.0 → exec(UpdatePassenger_1A),
8.0 → exec(UpdatePassenger_COD),
2.0 → exec(UpdatePassenger_GAT)
),
1.0 → exec(ProvideTravelDocuments_1A),
3.0 → exec(GetEligibilityForMobileBP_AF),
14.0 → exec(publishConfirmation),
1.0 → exec(UpdateDocument_1A)
)

JVM arguments:

-Xms2g
-Xmx6g
-XX:NewSize=1g
-XX:+HeapDumpOnOutOfMemoryError
-XX:+AggressiveOpts
-XX:+OptimizeStringConcat
-XX:+UseFastAccessorMethods
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:+CMSClassUnloadingEnabled
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1

I can share a heapdump and the full script with you via dropbox if you like.

Any help is appreciated since I’m trying to convince my management to use Gatling as a replacement for LoadRunner :slight_smile:

Cheers

Daniel

Hi Daniel,

Could you add -verbose:gc also to the gc args?
ideally have (not all are always required but more information upfront provides a better chances of resolving in 1 pass:

heap dump
any exceptions logged (OOM exception) in logs
3 thread dumps as the problem kicks off
verbose gc log
any os stats (ie. unlikely but you have a 6G heap and 4G on the box…)

can you try to simplify the script to see if any one part of it is causing the problem? to isolate the problem a bit more if possible.

eg. maybe it’s all the nested switches, or the notExists, or the type of requests you have.

thanks,
Alex

Hi Alex, thanks for your reply

I will try to deliver what you asked for as much as possible but I’m a bit in a time squeeze because I’m on the critical path for go-live of the SUT :slight_smile:

For now I have modified the script by putting a .during() loop around the .randomSwitch and changing the scenario to

setUp(scn.inject(rampUsers(300) over (3600))).protocols(httpProtocol)

In other words I created a “closed system” simulation like LoadRunner would. This setup uses a lot less heap and CPU it seems. Would that fact point you in some direction?

I have uploaded a heapdump to Dropbox, which email address can I use to send a share invitation to?. I have also added a screendump of the VisualGC plugin of VisualVM (couldn’t find a way to export it to file), hope that helps.

Thanks for your help!

Cheers

Daniel

Hi Daniel,

… I suspect it may be

constantUsersPerSec(1) during (600) // warm up of the SUT
,rampUsersPerSec(1) to(150) during(3600)
,constantUsersPerSec(150) during (172800))

Gatling creates upfront all the users it needs for the whole test.
here it’s dominated by the 150* 172800 (2 days) = 25M in memory waiting to execute.

I saw this before a while back when I was looking at doing open model soak tests.
The Iago load tool I believe feeds users into the system only when a buffer of to-be run users gets low to avoid the memory consumption.
I’ll raise a ticket for this once confirmed in the heap dump.

Changing to closed for now is reasonable except my previous comment about the switches not providing the expected split in terms of throughput if the different cases have different durations, ymmv.

Else some other workarounds:
run a shorter test first, get confidence with that.
get a load injector with a larger amount of ram. 64bit jvm should scale up to 20-30GB. not sure if that would cover it though.
a not good hack, but run multiple simulations back to back, thus spreading the heap usage over time. Not ideal though I know.

my email is ceeaspb@gmail.com please include Stephane and Pierre though as well.

Thanks,
Alex

Hi Alex,

I have sent you and Stephane Dropbox invites, couldn’t find Piere’s email address.

As for your suspicion: I have recently used a similar scenario (open model) for endurance testing another application with even a higher load (constantUsersPerSec(500)) without facing these issues. That script contained only plain (rest) urls without ELFilebody . Also that script did not contain any nested randomSwitches.

I just remembered some other possible relevant detail: during the failed tests the SUT has some hiccups resulting in significant numbers of request timeOuts. Perhaps these sessions are not cleaned up properly?

Cheers

Daniel

Hi Daniel,
Thanks I loaded up the heap dump last night and raised a ticket for it. So it’s progressing and will be fixed.
Thanks
Alex

Hi Daniel,

My email is pdalpra at excilys dot com.
Could you sent my invite so that I have a closer look at your heap dump ?

Ok, on pre-schedule immédiatement 20M de sessions, mais quand même. Je soupçonne qu’il sauve trop de données en session.

Ouep, c’est ce que je me disais aussi, c’est pas 20 Mo de sessions qui crachent 6 Go de heap, c’est pour ça que je voudrais avoir plus de détails…

it look a while to load the dump as it;s large so to save you some time:

there’s 2 problems:

  1. there’s a fair amount of memory being used by the upfront creation of the users. but clearly it starts and runs ok for some time. all is needed in this case is a larger heap size. having said that, if there is any quick low risk wins to reduce the memory here it would be of benefit to avoid carrying it for the duration of the test.for example there’s 22M Scenarios taking 1/2 GB alone.

Class Name | Objects | Shallow Heap | Retained Heap

Hi Daniel,

Sorry, but could you compress your heap dump, please? My dropbox is only 2.5Go… :frowning:

Then, 64,319 sun.security.ssl.SSLEngineImpl instances seems crazy to me! Is there a chance that your server doesn’t have a keep alive timeout and never close the connections?

Cheers,

Stéphane

Hi Stephane, did you have a good holiday?

The compressed heap dump was in the same dropbox folder already, but still > 1Gb I’m afraid.

The server Gatling is talking to is an IBM DataPower. The hiccups I mentioned earlier seem to be caused by this component, our infrastructure team is currently investigating. The hiccups cause Gatling users to time out (response times increase up to > 60s). Currently I am running my test in “closed system mode” and it looks like that Old Gen section of the heap increases after each hiccup as if those timed out sessions are not properly cleaned up by GC.

As soon as the DataPower issue has been fixed I will do a rerun of the test in “open system mode” to see if that fixes the leak in Gatling as well.

Cheers

Daniel

Hi Stephane, did you have a good holiday?

Yep, great, thanks!

The compressed heap dump was in the same dropbox folder already, but still
> 1Gb I'm afraid.

What happens is that you shared the whole folder, so Dropbox wants me to
have enough space for the total size, not only a given file.
I have 2.5Gb, so just sharing the compressed file would work.

The server Gatling is talking to is an IBM DataPower. The hiccups I
mentioned earlier seem to be caused by this component, our infrastructure
team is currently investigating. The hiccups cause Gatling users to time
out (response times increase up to > 60s). Currently I am running my test
in "closed system mode" and it looks like that Old Gen section of the heap
increases after each hiccup as if those timed out sessions are not properly
cleaned up by GC.

As soon as the DataPower issue has been fixed I will do a rerun of the
test in "open system mode" to see if that fixes the leak in Gatling as well.

Interesting. That would mean that timed out channels still linger in the
pool?! I'll investigate tomorrow.
How is your keep alive timeout set up?

Cheers,

Stéphane

OK, just realized you’ve removed the other files.
Downloading right now, thank!

So, there might be other problems that I will investigate later, but what’s for sure is that our upfront scheduled users eats between 2 and 5 Go!!!
The good news is that it’s easy to fix.

Stay tuned.

Hey Daniel,

I’ve just pushed a first fix: https://github.com/gatling/gatling/issues/2129

Could you grab a snapshot from Sonatype in about 10min (once Travis has finished building) and give it a try, please?

Hi Stephane,

Thanks for the swift response, as always :slight_smile:

I’ve tried the Sonatype snapshot but it returns this error when I try to start the script:

java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at scala_maven_executions.MainHelper.runMain(MainHelper.java:164)
at scala_maven_executions.MainWithArgsInFile.main(MainWithArgsInFile.java:26)
Caused by: java.lang.NoSuchMethodError: io.gatling.core.util.StringHelper$.checkSupportedJavaVersion()V
at io.gatling.app.Gatling.start(Gatling.scala:88)
at io.gatling.app.Gatling$.fromMap(Gatling.scala:54)
at io.gatling.app.Gatling$.runGatling(Gatling.scala:79)
at io.gatling.app.Gatling$.runGatling(Gatling.scala:58)
at io.gatling.app.Gatling$.main(Gatling.scala:50)
at io.gatling.app.Gatling.main(Gatling.scala)
… 6 more

Can you help me out please? :slight_smile:

Cheers

Daniel

Are you sure you didn’t merge with some old jars when you unpacked?
This method is definitively here: https://github.com/gatling/gatling/blob/master/gatling-core/src/main/scala/io/gatling/core/util/StringHelper.scala#L44

Hi Stephane,

I have removed all jars from my local maven repo and rebuild everything and now it works. I see Gatling is still allocating >3Gb (of 6Gb max) at the start of the test, is this expected behaviour? I’ll let the test run through the night, let’s see if it’s still alive in the morning :slight_smile:

Cheers

Daniel

No, it’s not expected (your Xms is 2g, I wouldn’t expect more). Could you provide a heap dump on start up please?