Thanks for the response Stéphane, I’m another Grouponer chiming in. Some specific responses to your points/questions:
For example, open and closed models are pretty well documented.
We’re aware of open vs closed system testing and this functionality we’d use in concert with our other (sometimes closed, commonly open) testing.
First: what purpose would such feedback based throttling serve? Could you please provide references to some research papers that would explain how it makes sense?
My first thought is that this model is flawed as it makes your injectors slow down so they don’t break your system under load, and that such limitation where clients play nice is most likely to not exist on your live system.
Responding to these in combination, there’s a couple use cases we apply this functionality to. The primary one is exploring the limits of a given system in preparation for a more concerted “allow things to break or not” test. As an example, a system is created, deployed into a new environment, and we want to see how much throughput it can handle (in that environment) before it starts breaking its SLA.
A finalized SLA would be of the form “200 statuses, returned within 25ms, at T throughput, with an overall success rate of 99.9%”. When we’re exploring a deployment, we don’t necessarily know what value of T it can handle. Additionally, given the SLA is defined with a defined success rate, when it starts breaking that SLA there’s not a point in continuing to throw more load at it.
The manual way of doing this exploration would be tor kick off a test with a linear ramp, watch for when it falls over, note that point, and then re-run the test at throughputs around that failover part. What our existing tooling does is watch for the system to start breaking its SLA and then adjust the throughput so that we can start pinpointing the maximum throughput it can handle.
This is especially useful/important when these systems are running in a shared environment (whether it be test or production).
A counterpoint one could make is “the SLA should be based on the requirements, not what it can handle” which is true (and usually how it’s done). However even if you approach the problem that way, you’ll still want to periodically test that the system does match it’s SLA. You can do that by kicking off a test at that throughput, watching for it to succeed/explode, act accordingly. We’d rather kick off the tests and have it automatically adjust if the SLA is violated.
Ignoring this specific design/use-case, the primary issue we’re running into is the throttling controls are a sealed trait, which means there’s no flexibility in throttling algorithms. If someone wants to define a custom one, they either need to get sign-off here and get it added or they must fork Gatling and create their own build. It feels like throttling should be customizable by the tester if so desired.
The solution Philip suggested is one way of allowing for that, without requiring major Gatling implementation changes. Switching ThrottleStep from a sealed trait to a subclassable / extendable type would be another.
Then, how would such strategy work in a distributed environment? How would it be possible to compute the constraint for each node from the global constraint defined in the simulation? I suspect it would be impossible.
It’s actually fairly straightforward, similar to how doing any distributed calc (testing, rate limiting, …) one can handle global by analyzing locally. Take the example I defined above, where failure is defined as “more then .1% of requests return a non-200”. The idealized solution would look at the responses across all nodes and calculate the error rate. The practical solution is assume all nodes are equivalent and look at a node’s responses to calculate the error rate. If node 1 is having a 1% error rate, then node N is likely having it as well, which implies the service overall is having a 1% error rate.