SSE Streaming Latency Gap: Performance LLM (Chat-GPT) Test Experimental

DonTester · October 27, 2025, 3:13am

Gatling version: 3.11.3
Gatling flavor: java kotlin scala javascript typescript
Gatling build tool: maven gradle sbt bundle npm

I made sure I’ve update my Gatling version to the latest release
I read the guidelines and how to ask a question topics.
I provided a SSCCE (or at least, all information to help the community understand my topic)
I copied output I observe, and explain what I think should be.

Hi Gatling community

I’ve been load testing LLM APIs (OpenAI ChatGPT) using SSE streaming and discovered something interesting that I’d like to discuss with the experts here.

My Project: GitHub - rcampos09/load-test-llm-sse-gatling: Proyecto de ejemplo para efectuar pruebas de rendimiento a Apis de LLM

The Observation

When testing with the official example from the LLM API guide, I noticed a significant gap:

Gatling P99 reported: 558ms
User-experienced latency: 2,018ms (measured manually)
Gap: +403% (1,460ms of streaming time NOT captured)

What I’m Measuring

Test setup:

610 concurrent requests to OpenAI GPT-3.5-turbo
Using the official SSE example with asLongAs() loop
Prompt categories: short, medium, long (30 different prompts)

Code structure (official example):

sse("Connect to LLM and get Answer")
  .post("/completions")
  .body(StringBody("{...}"))
  .asJson(),  // ← Timer stops here at HTTP 200 OK

asLongAs("#{stop.isUndefined()}").on(
  sse.processUnmatchedMessages((messages, session) ->
    messages.stream()
      .anyMatch(message -> message.message().contains("[DONE]"))
      ? session.set("stop", true) : session
  )
),  // ← This loop is NOT measured

sse("close").close()

My Understanding (Please Correct Me!)

From my analysis, I believe:

Gatling is technically correct - It measures HTTP connection establishment
This follows HTTP/SSE standards - Connection completes at HTTP 200 OK
But user experience includes streaming - The asLongAs() loop processing is not captured in metrics

Timeline:

t=0ms     → POST request sent
t=558ms   → HTTP 200 OK received  ← Gatling timer stops
t=564ms   → First chunk received (TTFT)
t=2,018ms → [DONE] received       ← User sees complete response

The Question for the Community

Is this the expected behavior for SSE streaming in Gatling?

For LLM applications where the streaming phase (chunks 1-100) contains the actual value, should we:

Accept that Gatling measures connection setup only (current)
Implement manual timing for end-to-end latency (our workaround)
Request a feature for optional end-to-end measurement?

Our Current Solution (Workaround)

We implemented manual timing inside the asLongAs() loop:

long requestStartTime = System.currentTimeMillis();
// ... process chunks ...
long responseTimeMs = System.currentTimeMillis() - requestStartTime;
// Export to custom JSONL for analysis

This works but:

Loses Gatling’s built-in P99/P95 calculations
Requires custom post-processing
Not integrated with Gatling reports

Detailed Analysis Document

I’ve written a comprehensive analysis (15 pages) covering:

Technical breakdown of SSE measurement in Gatling
Code analysis of official example
Comparison: Connection time vs End-to-end time
Feature request proposal with API design
Evidence with real test data (610 requests)

Full document: GATLING_FEATURE_REQUEST_ANALYSIS_EN.md
(Available in this repository) → load-test-llm-sse-gatling/docs/sprint1/experiments/gatling-sse-analysis-en.md at main · rcampos09/load-test-llm-sse-gatling · GitHub

Why This Matters

The official example names the request “Connect to LLM and get Answer” but only measures the “Connect” part. For teams defining SLAs or optimizing UX, this gap can lead to:

Incorrect performance baselines
Misdirected optimizations
Misaligned expectations vs reality

Discussion Points

I’d love to hear from experienced Gatling users:

Is my understanding correct? Am I missing something about how SSE timing works?
How do you handle this? Do you use workarounds or accept connection-only metrics?
Would a feature help? Something like .measureUntilStreamCompletion().completionMarker("[DONE]")
Documentation clarity? Could the docs be clearer about what gets measured vs what gets “waited for”?

Test Environment

Gatling: 3.11.3
Java: 11
Target: OpenAI API (GPT-3.5-turbo with streaming)
Protocol: Server-Sent Events (SSE)
Load: 610 concurrent requests (10 rampUsers, 10/sec constant)

Real Test Results

Gatling Report:

---- Requests --------------------------------------------------------
> Connect to LLM - short    | P99: 558ms
> close                     | P99: 1ms
                            ← asLongAs loop NOT in report

Our Manual Measurements (exported to JSONL):

---- End-to-End Response Time ----------------------------------------
> Complete response (until [DONE])  | P99: 2,018ms
> TTFT (Time To First Token)        | P99: 6ms
> Streaming duration                | P99: 1,460ms

Gap Analysis:

Gatling measures: 558ms (connection only)
User experiences: 2,018ms (connection + streaming)
Missing in metrics: 1,460ms (72% of total latency)

Evidence from Test Data

Distribution by prompt category:

Category	Gatling P99	Real End-to-End P99	Gap
short	558ms	2,018ms	+261%
medium	558ms	9,700ms	+1,638%
long	558ms	>10,000ms	+1,692%

Key observation: Gatling P99 stays constant (~558ms) across all categories, but real user latency varies dramatically based on response length.

Proposed Feature (Optional Discussion)

If the community finds this useful, I’d suggest:

Option 1: Extend timer until stream completion

sse("Connect to LLM and get Answer")
  .post("/completions")
  .body(StringBody("{...}"))
  .measureUntilStreamCompletion()        // NEW
  .completionMarker("[DONE]")            // NEW
  .timeout(10, TimeUnit.SECONDS)         // NEW
  .asJson()

Option 2: Separate measurable request for streaming

scenario("Scenario").exec(
  sse("Connect to LLM")
    .post("/completions")
    .asJson(),

  sse("Process Stream")                  // NEW: Measurable
    .measureStreamDuration()
    .completionMarker("[DONE]")
    .asLongAs("#{stop.isUndefined()}").on(
      sse.processUnmatchedMessages(...)
    ),

  sse("close").close()
)

Benefits:

Backward compatible (opt-in)
Captures full user-perceived latency
Integrated with Gatling’s P99/P95 calculations
Aligned with growing LLM testing use case

I’m Open to Being Wrong!

I’m open to being completely wrong here - that’s why I’m asking the experts! If Gatling’s current behavior is the right approach for SSE streaming, I’d love to understand why.

Maybe there’s:

A technical reason I’m not seeing
A better way to structure the test
A configuration I’m missing
A fundamental misunderstanding on my part

Any guidance would be greatly appreciated!

i-nahornyi · October 27, 2025, 12:56pm

Hi @DonTester

I am trying to utilize the existing Gatling features. Some of my workarounds for similar tasks are based on dummy samplers.

I track TTFB and full response time using them.

I uploaded an example project, so you can take a look here:
https://github.com/i-nahornyi/gatling-llm-demo/blob/main/src/test/java/demo/LLMDemoSimulation.java

Anyway, my hat’s off to you — your research is great!

UPD:
Looks like we can use await with the matching check here without the redundant parts.
That gives us the same result.

I’ve added a new example. here

slandelle · October 27, 2025, 9:20pm

Have you tried measuring the duration of your asLongAs loop with a group?

system · November 26, 2025, 9:21pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
LLM inference - completions endpoint Gatling (Open-Source)	4	661	November 3, 2023
Load testing of SSE events: in a report just one matched message instead of N (many) Gatling (Open-Source)	10	1061	May 3, 2023
Observing "custom" response times for SSE Gatling (Open-Source)	0	141	December 30, 2019
Understanding Gatling request time measurement vs other tools Gatling (Open-Source)	3	213	September 26, 2017
High reported response time values Gatling (Open-Source)	30	2276	July 22, 2022