API Gateway response streaming is now available for REST APIs. That matters because it lowers time to first byte for interactive apps, lifts the old 10 MB response ceiling, and lets responses flow as they’re produced rather than waiting for the backend to finish. If you’ve been juggling S3 pre‑signed URLs, bolting on WebSockets, or redirecting users to function URLs just to stream, this is your chance to simplify.
What changed and why it matters
Until recently, API Gateway REST APIs buffered responses and enforced a 29‑second integration timeout plus a 10 MB payload limit. With response streaming enabled, API Gateway can start transmitting the body as soon as headers finish and content arrives from your integration. You can keep the stream open for up to 15 minutes and exceed 10 MB, which is a big deal for AI chat, reporting/exports, and media delivery.
Three practical impacts:
First, faster perceived performance. Users see words appear as your model thinks, charts render progressively, and long CSV exports start downloading immediately. Second, simpler architectures. You can drop the “generate to S3, then return a link” pattern for many endpoints. Third, better headroom for complex backends. When the integration takes a while—think agentic AI, big joins, or long queries—you’re not forced to give up on REST.
How API Gateway response streaming works
This feature is for REST APIs only. It supports AWS_PROXY (Lambda proxy) and HTTP_PROXY (services on ECS/EKS/ALB/NLB or private integrations). You flip the integration’s responseTransferMode to STREAM; the default is BUFFERED. That’s it for API Gateway. Your backend should actually stream—flush buffers frequently and emit chunks early—or you’ll only mimic the old behavior.
With Lambda, you use the streaming invocation endpoint and return two parts: a small JSON metadata header (status, headers, cookies) followed by the body stream. With container or VM backends, frameworks like FastAPI, Express, Spring WebFlux, and Node’s http module already expose streaming responses. Once API Gateway receives headers, it forwards them to the client and relays chunks as they arrive, up to 15 minutes.
Timeouts, bandwidth, and endpoints
There are two clocks to watch. The overall integration timeout can be configured up to 15 minutes for streaming. Separately, there’s an idle timeout: 5 minutes for Regional and Private endpoints and 30 seconds for Edge‑optimized. If no bytes move for longer than the idle timeout, the connection closes. If you front Regional APIs with your own CloudFront distribution, you can raise its response timeout to tolerate longer idle periods—handy when an upstream model pauses between tokens.
Bandwidth shaping also kicks in on large responses: the first 10 MB are unthrottled; additional bytes are limited to roughly 2 MB/s. That’s fast enough for most text, CSV, JSON, and many images, but not a fit for high‑bitrate video. For media-heavy use cases, consider signed CloudFront URLs or a dedicated streaming stack for the heavy content path, and use API Gateway streaming primarily for control and metadata.
What you can’t do in streaming mode
Because API Gateway no longer buffers the entire response, three features are disabled on streamed methods: response caching, response transformation with Velocity templates, and content encoding. If you relied on GZIP handled by API Gateway, move compression into the backend. If you rely on response mapping, move that logic upstream.
Pricing math that actually matters
Streaming doesn’t change the base “per request” pricing model for REST APIs—but there’s a twist: each 10 MB of streamed response (rounded up) counts as one billable request unit. Tiny responses still cost one request. A 65 MB streamed file counts as seven request units. You also pay the usual data transfer out per GB and anything you use downstream (Lambda duration, PrivateLink, etc.). The takeaway: small streamed responses cost the same as before; large binary responses get more expensive on the API Gateway line item. If you regularly stream >50 MB per request, run a quick model to see where CloudFront or direct S3 links cross over on cost.
Tip: set up a dashboard that multiplies your REST API request count by average streamed payload in 10 MB increments. Costs can creep as product managers add “export all” buttons.
Who should adopt API Gateway response streaming now?
If you’re building:
AI chat and agents. Token‑by‑token output in chat UIs and code assistants feels dramatically faster. You’ll also appreciate the 15‑minute window for complex chains or function‑calling agents. For broader context on AWS’s push into AI platforms and infra, see our take on AWS’s $50B AI build‑out.
Long‑running reports and exports. Let users download as the file is generated rather than waiting on a “Your report is ready” email. You still need to guard against idle timeouts—emit a harmless progress heartbeat every few seconds.
Large JSON/CSV responses. When you previously had to page or chunk, you can now stream the whole result. Just watch client memory behavior; streamed JSON that’s concatenated in memory by the client can defeat the point.
Media previews and progressively rendered images. For full media delivery, you’ll likely want a CDN and stream-optimized formats. If you’re using CloudFront already, you’ll be interested in recent CloudFront flat‑rate pricing plans and how they compare to per‑GB economics for your mix.
API Gateway response streaming vs alternatives
HTTP APIs
HTTP APIs still don’t offer streaming like this. If you standardized on HTTP APIs for cost and simplicity, you’ll need to consider selective migration of endpoints that benefit from streaming into REST APIs, or use function URLs/ALB for those paths.
WebSockets
WebSockets are great for bidirectional push and long‑lived connections, but they add client complexity and operational overhead. If your pattern is “client sends a request, server streams back the answer,” response streaming is simpler. Keep WebSockets for true server push, multiplayer collaboration, or multi‑party events.
Lambda Function URLs
Function URLs with Lambda response streaming are lightweight and cheap for single‑function services, but they skip WAF, custom authorizers, usage plans, and consistent API design. Response streaming through API Gateway lets you keep your existing auth, throttling, logging, and domain structure intact.
Gotchas you’ll hit (and how to avoid them)
Edge‑optimized endpoints have a 30‑second idle timeout. If you need global edge and long idle times, use a Regional endpoint behind your own CloudFront distribution where you can raise the response timeout. While you’re there, evaluate whether your traffic shape benefits from the new CloudFront flat‑rate plans; we break down the tradeoffs here: Switch or stay with CloudFront flat‑rate?
Caching disappears for streamed methods. For endpoints that sometimes need to stream and sometimes don’t, split routes or offer a query flag that switches to a cached, buffered variant.
Compression isn’t automatic. Set Content-Encoding: gzip on your backend and stream compressed bytes. Test chunk sizes to keep latency low; overly large compression buffers delay first bytes.
Server‑Sent Events (SSE) headers. If you stream text events, set Content-Type: text/event-stream and Cache-Control: no-cache. Send a keep‑alive comment (e.g., :\n\n) every 10–15 seconds to avoid idle timeouts on slow phases.
Client behavior varies. The browser Fetch API streams fine, but some frameworks buffer by default. Verify clients render partial content before claiming victory on TTFB.
The migration playbook
1) Pick the right endpoints
Start where streaming changes user experience or cost structure: chat endpoints, “export CSV/PDF,” “render report,” or large JSON queries. Rank candidates by traffic and support burden.
2) Confirm your endpoint type
Response streaming works with REST APIs only. If you’re on HTTP APIs, clone the routes into a new REST API first. Keep the same auth and rate limits to avoid surprises.
3) Enable streaming
Set responseTransferMode: STREAM on the integration. In Infrastructure‑as‑Code, that’s a one‑line change. For Lambda proxy, ensure you’re using the streaming invoke path. For HTTP proxy, switch your app code to write and flush in chunks.
4) Stream early, stream often
Flush first bytes as soon as you can—headers, then a short intro payload or progress events. Don’t build the whole payload in memory. For AI, forward model tokens as they arrive; for long reports, emit a progress SSE every few seconds even if you have nothing substantive yet.
5) Budget idle time
On Regional/Private endpoints, assume a 5‑minute idle cap. For Edge‑optimized, assume 30 seconds. If you need more, front Regional with your own CloudFront distribution and raise its response timeout. Test with mobile networks and flaky Wi‑Fi—those are the real world.
6) Update observability
Add new log fields to your access logs: response transfer mode, time to first content, and time to all headers. Create CloudWatch dashboards that chart TTFB and stream durations by route. Alert when idle timeouts spike; they usually indicate a backend stall or client network issues.
7) Rethink pricing and SLAs
Model cost in 10 MB increments per request. For media over ~50–100 MB, compare API Gateway streaming versus CloudFront or direct S3 links. Update SLAs to include stream duration and acceptable idle gaps. Communicate clearly to customers when an export could take minutes but starts instantly.
People also ask
Does API Gateway response streaming work with HTTP APIs?
No—only REST APIs. If you chose HTTP APIs for their lower cost, lift critical endpoints that need streaming into a REST API or use function URLs/ALB for those few routes.
Will my authorizers, WAF, and usage plans still work?
Yes. Response streaming lives at the integration layer. You keep custom authorizers or Cognito, usage plans, API keys, throttling, access logging, and WAF protections.
What’s the max size I can stream?
You’re no longer bound by the 10 MB cap, but practical limits still exist: the 15‑minute stream window and the ~2 MB/s shaping after the first 10 MB. For very large binary payloads, a CDN or object storage link is still the right tool.
How do I handle JSON clients that expect a complete object?
Prefer newline‑delimited JSON or SSE. If the client must assemble a single object, stream array items and close with the final bracket. In browsers, use the Streams API to process chunks incrementally.
Implementation checklist (copy/paste)
• Confirm the API is REST, not HTTP. If not, create a REST API for candidate routes.
• Flip the integration’s response transfer mode to STREAM; redeploy the stage.
• For Lambda: enable the streaming invoke and emit metadata JSON then body stream.
• For containers/EC2: return a streaming response from your framework, flush early.
• Set headers correctly (SSE, CORS, cache, compression).
• Emit a heartbeat every 10–15 seconds for long gaps.
• Adjust timeouts: integration up to 15 minutes; build around idle timeouts.
• Update observability: add time‑to‑first‑content and transfer‑mode fields to logs.
• Model costs in 10 MB increments; set alerting if payloads exceed expectations.
• Load test with slow networks and mobile devices; verify real TTFB gains.
A quick performance test you can run today
Spin up two versions of a report endpoint: one buffered, one streaming. For both, emit the same 25 MB CSV in 250 KB chunks, sleeping 100 ms between chunks to mimic processing. Measure:
• Time to first byte at the client (target: streaming starts in < 400 ms on LAN).
• Time to complete (should be similar, but streaming feels faster).
• Idle timeouts under network jitter (stream variant should survive with heartbeats).
• Memory use on the client (ensure chunks are processed and not buffered).
If the streaming version lowers perceived latency by multiple seconds and survives induced jitter, you’re ready to ship it to a subset of users behind a feature flag.
Security and compliance notes
All the usual rules still apply. Protect endpoints with authorizers or Cognito. Keep PII out of logs—even streamed responses land in access logs as sizes/durations. If you stream over custom domains with mTLS, validate cert rotation. For private APIs, remember PrivateLink costs and idle timeouts still apply; plan heartbeats for long internal tasks.
What to do next
• Identify two endpoints where streaming unlocks UX or removes an S3 detour.
• Ship a canary version behind a header flag; measure TTFB and idle timeouts for a week.
• Decide on a standard: SSE for text, chunked transfer for files, gzip at the origin.
• For globally distributed apps, evaluate a Regional API behind CloudFront and review the latest CloudFront pricing moves here: flat‑rate CloudFront options.
• If you’re pushing into agentic apps, pair streaming with the newest AWS AI capabilities; our take: what builders should do now.
• Need a partner to plan, test, and deploy? See what we do or talk to us.
Zooming out
Response streaming doesn’t replace every pattern. You’ll still use WebSockets for push, CloudFront for bulky media, and S3 for truly massive artifacts. But for a large class of web and AI endpoints, this closes a long‑standing gap: modern streaming UX on the same REST rails your teams already know. Start small, measure TTFB, and promote the pattern route by route. Your users will feel the difference long before your architecture diagrams catch up.
