SSE Streaming Latency Gap: Performance LLM (Chat-GPT) Test Experimental

Gatling version: 3.11.3
Gatling flavor: java kotlin scala javascript typescript
Gatling build tool: maven gradle sbt bundle npm

I made sure I’ve update my Gatling version to the latest release
I read the guidelines and how to ask a question topics.
I provided a SSCCE (or at least, all information to help the community understand my topic)
I copied output I observe, and explain what I think should be.

Hi Gatling community :waving_hand:

I’ve been load testing LLM APIs (OpenAI ChatGPT) using SSE streaming and discovered something interesting that I’d like to discuss with the experts here.

My Project: GitHub - rcampos09/load-test-llm-sse-gatling: Proyecto de ejemplo para efectuar pruebas de rendimiento a Apis de LLM

The Observation

When testing with the official example from the LLM API guide, I noticed a significant gap:

  • Gatling P99 reported: 558ms
  • User-experienced latency: 2,018ms (measured manually)
  • Gap: +403% (1,460ms of streaming time NOT captured)

What I’m Measuring

Test setup:

  • 610 concurrent requests to OpenAI GPT-3.5-turbo
  • Using the official SSE example with asLongAs() loop
  • Prompt categories: short, medium, long (30 different prompts)

Code structure (official example):

sse("Connect to LLM and get Answer")
  .post("/completions")
  .body(StringBody("{...}"))
  .asJson(),  // ← Timer stops here at HTTP 200 OK

asLongAs("#{stop.isUndefined()}").on(
  sse.processUnmatchedMessages((messages, session) ->
    messages.stream()
      .anyMatch(message -> message.message().contains("[DONE]"))
      ? session.set("stop", true) : session
  )
),  // ← This loop is NOT measured

sse("close").close()

My Understanding (Please Correct Me!)

From my analysis, I believe:

  1. :white_check_mark: Gatling is technically correct - It measures HTTP connection establishment
  2. :white_check_mark: This follows HTTP/SSE standards - Connection completes at HTTP 200 OK
  3. :warning: But user experience includes streaming - The asLongAs() loop processing is not captured in metrics

Timeline:

t=0ms     → POST request sent
t=558ms   → HTTP 200 OK received  ← Gatling timer stops
t=564ms   → First chunk received (TTFT)
t=2,018ms → [DONE] received       ← User sees complete response

The Question for the Community

Is this the expected behavior for SSE streaming in Gatling?

For LLM applications where the streaming phase (chunks 1-100) contains the actual value, should we:

  1. Accept that Gatling measures connection setup only (current)
  2. Implement manual timing for end-to-end latency (our workaround)
  3. Request a feature for optional end-to-end measurement?

Our Current Solution (Workaround)

We implemented manual timing inside the asLongAs() loop:

long requestStartTime = System.currentTimeMillis();
// ... process chunks ...
long responseTimeMs = System.currentTimeMillis() - requestStartTime;
// Export to custom JSONL for analysis

This works but:

  • :cross_mark: Loses Gatling’s built-in P99/P95 calculations
  • :cross_mark: Requires custom post-processing
  • :cross_mark: Not integrated with Gatling reports

Detailed Analysis Document

I’ve written a comprehensive analysis (15 pages) covering:

  • Technical breakdown of SSE measurement in Gatling
  • Code analysis of official example
  • Comparison: Connection time vs End-to-end time
  • Feature request proposal with API design
  • Evidence with real test data (610 requests)

:paperclip: Full document: GATLING_FEATURE_REQUEST_ANALYSIS_EN.md
(Available in this repository)load-test-llm-sse-gatling/docs/sprint1/experiments/gatling-sse-analysis-en.md at main · rcampos09/load-test-llm-sse-gatling · GitHub

Why This Matters

The official example names the request “Connect to LLM and get Answer” but only measures the “Connect” part. For teams defining SLAs or optimizing UX, this gap can lead to:

  • Incorrect performance baselines
  • Misdirected optimizations
  • Misaligned expectations vs reality

Discussion Points

I’d love to hear from experienced Gatling users:

  1. Is my understanding correct? Am I missing something about how SSE timing works?
  2. How do you handle this? Do you use workarounds or accept connection-only metrics?
  3. Would a feature help? Something like .measureUntilStreamCompletion().completionMarker("[DONE]")
  4. Documentation clarity? Could the docs be clearer about what gets measured vs what gets “waited for”?

Test Environment

  • Gatling: 3.11.3
  • Java: 11
  • Target: OpenAI API (GPT-3.5-turbo with streaming)
  • Protocol: Server-Sent Events (SSE)
  • Load: 610 concurrent requests (10 rampUsers, 10/sec constant)

Real Test Results

Gatling Report:

---- Requests --------------------------------------------------------
> Connect to LLM - short    | P99: 558ms
> close                     | P99: 1ms
                            ← asLongAs loop NOT in report

Our Manual Measurements (exported to JSONL):

---- End-to-End Response Time ----------------------------------------
> Complete response (until [DONE])  | P99: 2,018ms
> TTFT (Time To First Token)        | P99: 6ms
> Streaming duration                | P99: 1,460ms

Gap Analysis:

  • Gatling measures: 558ms (connection only)
  • User experiences: 2,018ms (connection + streaming)
  • Missing in metrics: 1,460ms (72% of total latency)

Evidence from Test Data

Distribution by prompt category:

Category Gatling P99 Real End-to-End P99 Gap
short 558ms 2,018ms +261%
medium 558ms 9,700ms +1,638%
long 558ms >10,000ms +1,692%

Key observation: Gatling P99 stays constant (~558ms) across all categories, but real user latency varies dramatically based on response length.


Proposed Feature (Optional Discussion)

If the community finds this useful, I’d suggest:

Option 1: Extend timer until stream completion

sse("Connect to LLM and get Answer")
  .post("/completions")
  .body(StringBody("{...}"))
  .measureUntilStreamCompletion()        // NEW
  .completionMarker("[DONE]")            // NEW
  .timeout(10, TimeUnit.SECONDS)         // NEW
  .asJson()

Option 2: Separate measurable request for streaming

scenario("Scenario").exec(
  sse("Connect to LLM")
    .post("/completions")
    .asJson(),

  sse("Process Stream")                  // NEW: Measurable
    .measureStreamDuration()
    .completionMarker("[DONE]")
    .asLongAs("#{stop.isUndefined()}").on(
      sse.processUnmatchedMessages(...)
    ),

  sse("close").close()
)

Benefits:

  • :white_check_mark: Backward compatible (opt-in)
  • :white_check_mark: Captures full user-perceived latency
  • :white_check_mark: Integrated with Gatling’s P99/P95 calculations
  • :white_check_mark: Aligned with growing LLM testing use case

I’m Open to Being Wrong!

I’m open to being completely wrong here - that’s why I’m asking the experts! If Gatling’s current behavior is the right approach for SSE streaming, I’d love to understand why.

Maybe there’s:

  • A technical reason I’m not seeing
  • A better way to structure the test
  • A configuration I’m missing
  • A fundamental misunderstanding on my part

Any guidance would be greatly appreciated! :folded_hands:

1 Like

Hi @DonTester

I am trying to utilize the existing Gatling features. Some of my workarounds for similar tasks are based on dummy samplers.

I track TTFB and full response time using them.

I uploaded an example project, so you can take a look here:
https://github.com/i-nahornyi/gatling-llm-demo/blob/main/src/test/java/demo/LLMDemoSimulation.java

Anyway, my hat’s off to you — your research is great!

UPD:
Looks like we can use await with the matching check here without the redundant parts.
That gives us the same result.

I’ve added a new example. here

2 Likes

Have you tried measuring the duration of your asLongAs loop with a group?

1 Like