When measuring the performance of an LLM model, the RPS is somehow irrelevant as we use tokens-per-second to measure capacity. The task if to figure out how many tokens-per-second a GPU can support
The no of tokens per req I can extract from response body. This number is nondeterministic and can vary slightly per each call. I am thinking to create a separate file where to track req name/timestamp start/end/no of tokens, but I am curious if anybody has run into this before and came up with an approach to actually display this into the report. Thanks.
This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.