Large overhead on first request

Hey guys I just started a new round of load testing. I'm seeing some
kind of funky behavior on the first request of a test. This is most
pronounced if I do a test with a single user. For instance if I test
one of my endpoints with 1 user and a ramp of 1 I get results
something like:

min/max/average = 363ms

Though when I do the same single request through curl or ab I see the
response time never exceeding 70ms. In all cases I can repeat the
results with some minor fluctuation. I see something similar if I run
a longer test with 60 users and a ramp of 60. In that case I get
results like:

Min 51
Max 721
Average 85

The graph shows that the first request is recorded taking 721ms with
every subsequent request being ~50ms. I wonder is the first request
being timed from the start of the test instead of the start of the
first request? I'm not sure if that's possible. I can provide my
test if it helps but it's just doing a single POST. I'm on version
1.1.3.

Chris

Hi Chris,

Good catch, thanks for reporting.

There was a problem with the HTTP_CLIENT being initialized after the first request start up time computation:
https://github.com/excilys/gatling/issues/481

Hope this was the only problem though…

Cheers,

Steph

2012/4/17 Chris Carrier <ctcarrier@gmail.com>

I’ll try to do some tests today. I’m not sure the loading of the httpClient explains a 600ms difference.

For example, Netty uses cachedThreadPools, so maybe it takes time to initialize them.

The main problem is actually that there’s no hook in Async Http Client on connection establishment start up. We can only approximate the request start time as the instant the request is passed to the HttpClient. I’ll ask the AHC guys if something can be done about this.

Cheers,

Steph

2012/4/17 Stéphane Landelle <slandelle@excilys.com>

Just to let you now: I’ve started a thread of the AHC list:
https://groups.google.com/forum/?hl=fr#!topic/asynchttpclient/DRMwydDZtDA

2012/4/18 Stéphane Landelle <slandelle@excilys.com>

Some improvement on this matter.
I was wrong about when Scala companion initialization happens (not when the class is loaded but on first call).
https://github.com/excilys/gatling/commit/bcd084cff33ee46f21fa5ade2cda66c2860450a9

On my test case, I was able to reduce the overhead of ~110 ms.
I stall have a 80 ms overhead on the first request but it seems to happen in the HTTP layer (mostly establishing connection) and I’m not really sure I can do anything about that (still looking, though).

@Chris: Could you build the sources and try it out, please? The overhead you observe is much much bigger than mine, so I wonder if you don’t have some funky things on your network, such as DNS resolution.

2012/4/18 Stéphane Landelle <slandelle@excilys.com>

Sure I'll build the source and check it out. Hopefully today.

Chris

Cool, thanks!

2012/4/19 Chris Carrier <ctcarrier@gmail.com>

I'm trying to run off the newest code but am getting this error when I
run the test:

Exception in thread "main" org.fusesource.scalate.NoValueSetException:
The value for 'simulationDate' was not set

I basically built the whole bundle then copied highcharts in and tried
to run my test that I've been using against 1.1.3. Is this
simulationDate something I'm supposed to define?

Any ideas?

Chris

What version of gatling-charts-highcharts did you drop?
1.1.3 should still be compatible with 1.1.4-SNAPSHOT, but ‘simulationDate’ was something that was removed a long time ago.

2012/4/20 Chris Carrier <ctcarrier@gmail.com>

I just copied highcharts 1.1.3 over. Here are the gatling jars I have
in my lib:

gatling-app-1.1.4-SNAPSHOT.jar
gatling-charts-1.1.4-SNAPSHOT.jar
gatling-charts-highcharts-1.1.3.jar
gatling-core-1.1.4-SNAPSHOT.jar
gatling-http-1.1.4-SNAPSHOT.jar
gatling-jdbc-1.1.4-SNAPSHOT.jar
gatling-recorder-1.1.4-SNAPSHOT.jar

I'll paste the complete stack trace in to the bottom of this email.
It's possible I screwed something up switching from the last release
to the new snapshot. I just expanded the tar for 1.1.4 and then
copied the highcharts over from 1.1.3. Then I swapped out my symbolic
link to point to the new location. I don't see any old jars hiding in
there.

Exception in thread "main" org.fusesource.scalate.NoValueSetException:
The value for 'simulationDate' was not set
        at org.fusesource.scalate.RenderContext$$anonfun$attribute$1.apply(RenderContext.scala:158)
        at org.fusesource.scalate.RenderContext$$anonfun$attribute$1.apply(RenderContext.scala:158)
        at org.fusesource.scalate.RenderContext$class.attributeOrElse(RenderContext.scala:166)
        at org.fusesource.scalate.DefaultRenderContext.attributeOrElse(DefaultRenderContext.scala:30)
        at org.fusesource.scalate.RenderContext$class.attribute(RenderContext.scala:158)
        at org.fusesource.scalate.DefaultRenderContext.attribute(DefaultRenderContext.scala:30)
        at templates.$_scalate_$page_layout_html_ssp$.$_scalate_$render(page_layout.html.ssp.scala:26)
        at templates.$_scalate_$page_layout_html_ssp.render(page_layout.html.ssp.scala:94)
        at org.fusesource.scalate.layout.NullLayoutStrategy$.layout(LayoutStrategy.scala:43)
        at org.fusesource.scalate.TemplateEngine$$anonfun$layout$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(TemplateEngine.scala:551)
        at org.fusesource.scalate.TemplateEngine$$anonfun$layout$1$$anonfun$apply$mcV$sp$1.apply(TemplateEngine.scala:551)
        at org.fusesource.scalate.TemplateEngine$$anonfun$layout$1$$anonfun$apply$mcV$sp$1.apply(TemplateEngine.scala:551)
        at org.fusesource.scalate.RenderContext$class.withUri(RenderContext.scala:446)
        at org.fusesource.scalate.DefaultRenderContext.withUri(DefaultRenderContext.scala:30)
        at org.fusesource.scalate.TemplateEngine$$anonfun$layout$1.apply$mcV$sp(TemplateEngine.scala:550)
        at org.fusesource.scalate.TemplateEngine$$anonfun$layout$1.apply(TemplateEngine.scala:547)
        at org.fusesource.scalate.TemplateEngine$$anonfun$layout$1.apply(TemplateEngine.scala:547)
        at org.fusesource.scalate.RenderContext$.using(RenderContext.scala:46)
        at org.fusesource.scalate.TemplateEngine.layout(TemplateEngine.scala:547)
        at org.fusesource.scalate.TemplateEngine.layout(TemplateEngine.scala:579)
        at org.fusesource.scalate.TemplateEngine.layout(TemplateEngine.scala:588)
        at org.fusesource.scalate.TemplateEngine.layout(TemplateEngine.scala:566)
        at com.excilys.ebi.gatling.charts.template.PageTemplate.getOutput(PageTemplate.scala:44)
        at com.excilys.ebi.gatling.charts.report.ActiveSessionsReportGenerator.generatePage$1(ActiveSessionsReportGenerator.scala:44)
        at com.excilys.ebi.gatling.charts.report.ActiveSessionsReportGenerator.generate(ActiveSessionsReportGenerator.scala:67)
        at com.excilys.ebi.gatling.charts.report.ReportsGenerator$$anonfun$generateFor$2.apply(ReportsGenerator.scala:95)
        at com.excilys.ebi.gatling.charts.report.ReportsGenerator$$anonfun$generateFor$2.apply(ReportsGenerator.scala:95)
        at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
        at scala.collection.immutable.List.foreach(List.scala:76)
        at com.excilys.ebi.gatling.charts.report.ReportsGenerator$.generateFor(ReportsGenerator.scala:95)
        at com.excilys.ebi.gatling.app.Gatling.com$excilys$ebi$gatling$app$Gatling$$generateReports(Gatling.scala:259)
        at com.excilys.ebi.gatling.app.Gatling$$anonfun$start$1.apply(Gatling.scala:120)
        at com.excilys.ebi.gatling.app.Gatling$$anonfun$start$1.apply(Gatling.scala:120)
        at scala.collection.Iterator$class.foreach(Iterator.scala:772)
        at scala.collection.immutable.VectorIterator.foreach(Vector.scala:648)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:73)
        at scala.collection.immutable.Vector.foreach(Vector.scala:63)
        at com.excilys.ebi.gatling.app.Gatling.start(Gatling.scala:120)
        at com.excilys.ebi.gatling.app.Gatling$.main(Gatling.scala:72)
        at com.excilys.ebi.gatling.app.Gatling.main(Gatling.scala)

It seems like your template “page_layout.html.ssp” (in gatling-charts) wasn’t properly recompiled and that you still have an old version in your gatling-charts-1.1.4-SNAPSHOT.jar.
As you can see below, the source file doesn’t “simulationDate” anymore:
https://github.com/excilys/gatling/blob/1.1.3/gatling-charts/src/main/resources/templates/page_layout.html.ssp

Have you properly done a clean install?

2012/4/20 Chris Carrier <ctcarrier@gmail.com>

Stupid mistake I did 'mvn package' instead of 'mvn clean package'.
I'm a bit rusty on maven.

So after getting things up and running I'm honestly having trouble
telling if the newer version is better. Things are fairly easy to
understand at low user count. For instance if I do 60 users with 60
ramp it's obvious that the first request is being recorded as ~200ms
while every other request is 5ms. In this case the 1.1.4 seems like
it might be a little better. Often the first request is ~190ms where
in 1.1.3 it's ~205ms.

But things get confusing when I crank things up. I ran 4 tests with
25k users over 60 seconds. My service handles the requests fine but
in all tests there are big humps in response times at the beginning.
And using 1.1.4-SNAPSHOT there is an even bigger spike amongst the
hump. At this point I'm not really sure if either the hump or the
spike is coming from my actual service performance. I don't see any
reason why performance would be so much worse at the beginning of a
test. The service hasn't been restarted so there's no hit for warming
up the JVM. But I'm also not ready to completely discount the
possibility that there could be some connection.

I uploaded some examples of the response time graphs to show you what
I mean. Here are two different runs with 1.1.4-SNAPSHOT:

http://imgur.com/a/OJorO

And here are results with 1.1.3:

http://imgur.com/a/rPwuv

Obviously both have really big humps up to 2000ms at the beginning of
the test but 1.1.4 also has the huge spike up to 4000ms. The majority
of the results come in at around 40ms with one smaller hump near the
end of the test that could be some GC overhead or something.

Is there any way I could hack my way around this issue to see separate
the actual results from possible inflated ones? I'm thinking
something like adding a dummy chain that fires some http request to
warm things up and then run my real chain after that. I don't know
enough about the inner workings of Gatling to know if that would help
or not.

Chris

One more clue. It does seem like things are slower at the beginning
of the test. The kind of weird thing is that when the test is running
I see the user distribution being much heavier at the beginning. This
could very possibly be some bottneck in my service. But if the
request were evenly distributed across all 60 seconds I would expect
the performance to be fairly flat since the test is just doing a
single POST on every request to the same endpoint. Here is what the
test output looks like:

Simulation ServiceSimulation started...
5 sec | Users: active=1126/25000 | Requests: OK=1307 KO=0
10 sec | Users: active=957/25000 | Requests: OK=3960 KO=0
15 sec | Users: active=1260/25000 | Requests: OK=6143 KO=0
20 sec | Users: active=969/25000 | Requests: OK=8921 KO=0
25 sec | Users: active=479/25000 | Requests: OK=11900 KO=0
30 sec | Users: active=23/25000 | Requests: OK=14849 KO=0
35 sec | Users: active=24/25000 | Requests: OK=17342 KO=0
40 sec | Users: active=26/25000 | Requests: OK=19836 KO=0
45 sec | Users: active=20/25000 | Requests: OK=22338 KO=0
50 sec | Users: active=25/25000 | Requests: OK=24829 KO=0

I actually hit the endpoint manually while the test was running to see
what the real response time is and things are definitely slow for the
first 30 seconds while the concurrent user count is really high. Then
around the 30 second mark response time went from ~2000ms down to
~40ms. I suspect this is something in my code but I'm not really sure
yet what would cause that. So the hump in the test results I sent
earlier could Gatling behaving correctly.

Maybe you should have a longer ramp: launching 25000 users over 60 secs is quite a big rush.
Not sure the JIT can handle this correctly…

2012/4/21 Chris Carrier <ctcarrier@gmail.com>

Hi Chris,

Could you post a ghist of your scenario, please?

2012/4/21 Stéphane Landelle <slandelle@excilys.com>

Gist here:

https://gist.github.com/2471958

As for the ramp doesn't Gatling evenly distribute #users over the ramp
period? So if I increase the ramp wouldn't I have to proportionally
increase the users to maintain the same load? I'd be fine with a more
organic ramp up where it starts with fewer users and gradually ramps
up to some level but I'm not sure how to accomplish that. Is there a
way I can configure things to gradually increase the number of
concurrent users? The service I'm testing is pretty fast so the only
way I can really generate load on it is to hit it with a large number
of concurrent requests.

Chris

Hi Chris,

The ramp is just the period within which the specified number of users will be launched.

If you set 1000 users with a ramp of 1s there will be one user started each millisecond until the first second has passed.

Not sure if I understood your question, hope I answered it as you were looking for.

If you want more complicated ramps you can add several configurations with different ramps and delays also

Cheers,
Romain

Yeah that's how I understood ramp. So in the end I want to push my
service to the breaking point so that I understand what breaks first.
But I agree that immediately hammering it with 500 users/second is not
very realistic. So if I wanted to get a more gradual traffic growth I
should just return more configurations in the final list? Something
like:

List(
       scn.configure.users(500).ramp(60).protocolConfig(httpConf),
       scn.configure.users(5000).ramp(60).protocolConfig(httpConf),
       scn.configure.users(25000).ramp(60).protocolConfig(httpConf)
)

Is that the recommended way to do something like this? Will the
different scenarios run sequentially? I'll play with this today to
try to get a more gradual traffic growth.

Chris

Scenarios will start at the same time.

If you want you can delay the second and third scenarios with the delay() method.

List(
scn.configure.users(500).ramp(5),
scn.configure.users(25000).ramp(25).delay(60)
)

For example :wink:

Cheers,
Romain

Hi there,

I finally gave up on finding out the reason of this first request overhead: it happens somewhere deep in the NIO layer.
I decided to add the possibility (enabled by default) to perform a warm up request:
https://github.com/excilys/gatling/issues/532

Cheers,

Stephane

2012/4/23 Romain Sertelon <bluepyth@gmail.com>