Gatling Graphite integration - active sessions

Hello

Today I was trying Gatling integration with Graphite. I am able to push data to Graphite and have it rendered using Grafana

Here is the image of Grafana: http://i.imgur.com/FK1eQND.png

My test is inject(rampUsersPerSec(1) to(10) during(30 seconds),constantUsersPerSec(10) during(60 seconds) so more or less active requests makes sense.

Here is the part of the raport that was generated after all tests were run

http://i.imgur.com/7aICx4u.png

My goal is to show in Graphite/Grafana “Active sessions along the simulation” - I am loadtesting an application and I want to show the place where one machine has too much load which would be represented by active session > 10 (in this case if this load was too big for machine).

The problem is that I am not sure which metrics to use. Certainly it is not allUsers/active - mostly it is 0 here (and I am not sure what does it mean then), at first I was thinking that this should be the metric I would want to show, according to the documentation

  • active : # of users currently running the scenario
    Maybe 0 active users is shown because this test does not overload the machine and mean response time is 19ms so Gatling does not register it somehow (I am guessing here, my experience with metrics and graphite is very limited).

Showing requests per seconds would not work because when machine is overloaded requests will be dispatched but gatling will be waiting for responses.

Is there some way to get graph like “Active sessions along the simulation” in Graphite then?

That’s weird. Are you sure you’re not displaying the derivative instead of the raw value?

I think so. Here is the Graphite screenshot with active + done + waiting users:


Done and waiting are fine I think. Here is the one with active: Imgur: The magic of the Internet

Also It seems I am missing min and max for requests - I have only count.

I am wondering if this could be connected to my storage-schemas.conf and storage-aggregation.conf so here they are:

storage-schemas.conf

[Gatling stats]
pattern = ^gatling..*
retentions = 1s:6d
[default]
pattern = .*
retentions = 10s:1d

storage-aggregation.conf

[min-gatling]

pattern = .min$

xFilesFactor = 0.1

aggregationMethod = min

[max-gatling]

pattern = .max$

xFilesFactor = 0.1

aggregationMethod = max

[min]

pattern = .lower$

xFilesFactor = 0.1

aggregationMethod = min

[max]

pattern = .upper(_\d+)?$

xFilesFactor = 0.1

aggregationMethod = max

[sum]

pattern = .sum$

xFilesFactor = 0

aggregationMethod = sum

[count]

pattern = .count$

xFilesFactor = 0

aggregationMethod = sum

[count_legacy]

pattern = ^stats_counts.*

xFilesFactor = 0

aggregationMethod = sum

[default_average]

pattern = .*

xFilesFactor = 0.3

aggregationMethod = average

W dniu wtorek, 9 września 2014 09:49:20 UTC+2 użytkownik Stéphane Landelle napisał:

I think that part of the problem is that Gatling sends floating points with comma 37,000000 and I observed that Graphite does not want to accept such metrics - it wants floating points with dot

W dniu wtorek, 9 września 2014 10:51:48 UTC+2 użytkownik Andrzej Dębski napisał:

The comma problem was on my side - wrong locale set on Gatling machine. As for active users I created a file with metrics that are being send by the Gatling:

https://gist.github.com/Adebski/27392aad69487df4e024

and as you can see .active is 0 all the time

W dniu wtorek, 9 września 2014 11:36:45 UTC+2 użytkownik Andrzej Dębski napisał:

The locale/format problem is an issue on our side: https://github.com/gatling/gatling/issues/2177

Regarding active users being always equal to 0: is there a chance that your users don’t live longer than a second?

Yes, most of my users are short lived - each user is making one request and receives response in few millis mostly in this example case.

W dniu wtorek, 9 września 2014 14:56:36 UTC+2 użytkownik Stéphane Landelle napisał:

And about the locale - for now I just changed my Linux machine locale to en_US which has dot decimal separator.

Yes, most of my users are short lived - each user is making one request and receives response in few millis mostly in this example case.

So that’s what happens: “active users” are the ones that are still alive at the end of each second.
I can understand people using very shorted lived virtual users get a bit confused. Suggestion welcome.

And about the locale - for now I just changed my Linux machine locale to en_US which has dot decimal separator.

I’ve pinged our Graphite “expert”. I’m not sure Graphite can accept any other format that en_US, so we shouldn’t use Gatling’s locale for this.

> Yes, most of my users are short lived - each user is making one request
and receives response in few millis mostly in this example case.

So that's what happens: "active users" are the ones that are still alive
at the end of each second.
I can understand people using very shorted lived virtual users get a bit
confused. Suggestion welcome.

I think as long as you know how it is measured it makes sense.
it seems reasonable that if the user only lives for 20ms and each user is
injected every 100ms then there is a small chance of sampling 1 active user
at any point in time during the test (with some assumptions).
--> document in the measurements/timings page:
http://gatling.io/docs/2.0.0-RC4/general/timings.html

part of the initial question was the difference between the gatling report
and the graphite chart. http://i.imgur.com/7aICx4u.png
http://i.imgur.com/FK1eQND.png
if so, that difference doesn't seem to be reconciled yet?
Were both those charts from the same test?
If so it is likely that the report is wrong and graphite(+console) correct.
I tested this - cUps(1) with 1 request of <10ms

console reported 0:

          waiting: 20 / running: 0 / done:0

          waiting: 15 / running: 0 / done:5

          waiting: 10 / running: 0 / done:10

          waiting: 5 / running: 0 / done:15

          waiting: 0 / running: 0 / done:20

but reports reported : 1 active user for the duration of the test.

if cUps(5) then reports active=5, console=0.
looks like a reports defect where it sums the users. #2178

Thanks
Alex

Yes, both charts are from the same test - maybe the users in the charts are users that are active at the start (or in other words: I will start so many users during this second + users from previous seconds that are not yet finished).

I think what we should do is:

instead of having user(t+1) = user(t) + start(t+1) - end(t+1)

is having: user(t+1) = user(t) + start(t+1) - end(t)

This way, we would mark users as dead for the next bucket instead of the current one.
WDYT?

See https://github.com/gatling/gatling/issues/2200

OK, so reports already use this algorithm.

Console and Graphite are different from the reports.
They expose 3 metrics:

  1. how many users are waiting to be started

  2. how many users are still alive

  3. how many users are now dead
    If users live less than the period (1 sec): it’s expected to have them accounted for as dead, right?

Our goal with Graphite was to provide users with raw metrics so that they can build their own formula based metrics.

I guess you can calculate (dead + active - waiting) and its derivative.