Link to Part 4: Intel AI Inference Platform MVP 2 with llm-d
Disclaimer: This is not production guidance and it is not sponsored. It documents what actually ran in my homelab. Please double-check before you roll it into your own setup.
Table of Contents Link to heading
- Table of Contents
- 00 I came back from vacation to a broken routing layer
- 01 Why Ingress couldn’t follow
- 02 The migration I didn’t plan for
- 03 When timeouts become product behavior
- 04 Where the stack landed
- 05 Three lessons from three migrations
- References
00 I came back from vacation to a broken routing layer Link to heading
So I went on vacation for a week. Before I left I had queued a minor version bump on the gateway provider. Came back, checked Flux, and nothing was routing. All inference traffic was dead. Turns out the provider had deprecated the AI Gateway path I was using and I just hadn’t noticed the release notes.
That was fun to debug on a Monday morning. But it also made me realize this whole thing was not one migration but three, and I should probably write down how I got here.
In Part 4, I introduced llm-d. It was a solution after my original thought that inference on Kubernetes can follow a “standard” path of:
- Get a model running locally
- Run it in a container
- Run it in Kubernetes
- Expose it with Ingress
Step 4 was the original plan. I had Ingress in place for all my other apps, so it felt natural to just add another route.
With one model pod, Ingress looks fine. With two replicas serving the same model, the question changes from “which Service” to “which replica should take this request right now.” Ingress only sees HTTP endpoints. It does not know anything about decode pods, cache locality, or inference-specific backends, while llm-d’s scheduler does.
Once llm-d became the center of inference, Ingress stopped fitting naturally.
The routing question was no longer “which Service” but “which inference pool backend.” Inference pools are a new kind of backend that encode model-serving semantics, and they are only supported in Gateway API with the Inference Extension.
Kubernetes docs now explicitly recommend Gateway API over Ingress, and the Ingress API is marked as frozen (stable, but no new feature development). Around the same time, ingress-nginx retirement was announced on November 11, 2025, with best-effort maintenance through March 2026.
So the path became clear: keep existing Ingress where needed, but invest new routing work in Gateway API.
One dependency bump later, I learned this was not one migration but three:
- Ingress -> Gateway API because of
llm-d kgateway->agentgatewaybecause the provider path changed- Default timeouts -> explicit timeout policies because LLM traffic is long-lived
I originally chose kgateway for its early Gateway API support, but the provider ecosystem is still evolving. When agentgateway emerged with a more focused vision on AI workloads, it made sense to follow that path.
01 Why Ingress couldn’t follow Link to heading
llm-d depends on Gateway API Inference Extension CRDs, so Gateway API became a hard dependency in this repo.
# infrastructure/gateway-api/gateway-api-inference-extension.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: gateway-api-inference-extension
namespace: gateway-system
spec:
url: https://github.com/kubernetes-sigs/gateway-api-inference-extension
ref:
tag: v1.4.0
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: gateway-api-inference-extension
namespace: gateway-system
spec:
path: ./config/crd
dependsOn:
- name: gateway-api
Then the route itself stopped targeting a plain Service and started targeting an InferencePool.
# deployments/llm-d/inference-scheduling/httproute.yaml
rules:
- backendRefs:
- name: llm-d-inferencepool
kind: InferencePool
group: inference.networking.k8s.io
port: 8200
And the pool encodes model-backend semantics (v1 API, decode label selection, target port):
# deployments/llm-d/inference-scheduling/inferencepool/inferencepool-values.yaml
inferencePool:
apiVersion: inference.networking.k8s.io/v1
targetPortNumber: 8200
modelServerType: vllm
modelServers:
matchLabels:
llm-d.ai/role: "decode"
At that point I could have just left Ingress in place for the non-inference apps and only used Gateway API for llm-d. I thought about it for maybe a day. Running two routing stacks in parallel sounded like the kind of decision I’d regret every time something broke and I had to check both. Since I already had to learn Gateway API anyway, I just moved everything over.
The part I didn’t plan for came next.
02 The migration I didn’t plan for Link to heading
A concrete timeline from my infra repo:
f01d2cf(2026-02-07): migrate app ingress routes to Gateway API0a1415d(2026-02-07): migrate infra ingress routes (Flux,MinIO,Grafana,Prometheus,VictoriaLogs)829d7a4(2026-02-13): splitOpenWebUIroute behavior and add long-stream policy7b9a137->4fde0b6(2026-04-07): Gateway API dependency bump tov1.5.1, then revert tov1.4.11904628+8397a51(2026-04-07):kgateway->agentgatewayrefactor plus listener/route cleanup
The revert in step 4 was where I learned about the provider migration. I had been following kgateway releases, but I missed the deprecation notice for AI Gateway support. When I bumped to v1.5.1, all my inference routes stopped working and I had to dig into release notes and code to understand why.
The key nuance: kgateway was not dead. The AI/inference path moved. In kgateway 2.1 release notes, AI Gateway and Gateway API Inference Extension support on Envoy-based proxies was marked deprecated in favor of agentgateway proxy support, with removal planned in 2.2. Then 2.2 introduced dedicated agentgateway.dev APIs and a separate chart/controller split (release notes, 2.2 breaking changes).
The obvious diff looked small:
# commit 1904628 (llm-d infra values)
- provider: kgateway
+ provider: agentgateway
- gatewayClassName: kgateway
+ gatewayClassName: agentgateway
But there was more surface area:
# commit 1904628 (route parent refs)
- namespace: kgateway-system
+ namespace: agentgateway-system
# commit 8397a51 (listener + route cleanup)
- - name: infer-https
+ - name: inference-gateway-https
- - name: agtw-https
+ - name: agentgateway-admin-ui-https
- # HTTP to HTTPS redirect routes
+ # removed and consolidated around HTTPS listeners
After I stabilized these bindings, the next bottleneck was connection behavior.
03 When timeouts become product behavior Link to heading
The long-connection behavior of LLM inference and UI sessions forced me to treat timeout config as architecture, not optional tuning.
For external inference traffic:
# deployments/llm-d/inference-scheduling/httproute-external.yaml
rules:
- matches:
- path:
type: PathPrefix
value: /
timeouts:
backendRequest: "3600s"
request: "0s"
For OpenWebUI:
# deployments/openwebui/httproute.yaml
rules:
- matches:
- path:
type: PathPrefix
value: /
timeouts:
backendRequest: "3600s"
request: "0s"
If you serve short request/response APIs, defaults are often fine. If you serve token streams and slower model turns, default timeout assumptions can quietly wreck UX. I started receiving strange “connection reset” errors in OpenWebUI and llm-d, and it took a while to connect the dots that these were not random network issues but timeout policies kicking in.
04 Where the stack landed Link to heading
The current gateway shape is one shared main-gateway in agentgateway-system, with explicit HTTPS listeners per hostname.
# infrastructure/agentgateway/gateway.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: main-gateway
namespace: agentgateway-system
spec:
gatewayClassName: agentgateway
listeners:
- name: openwebui-https
hostname: chat.sonda.red.intra
port: 443
protocol: HTTPS
- name: inference-gateway-https
hostname: infer.sonda.red.intra
port: 443
protocol: HTTPS
This ended up cleaner than what I had before:
- Gateway API is now a stable dependency of the inference stack
llm-drouting semantics are explicit in manifests- The provider migration is complete and aligned with the current control plane
- Long-lived connection behavior is handled in route policy, not left to defaults
05 Three lessons from three migrations Link to heading
What I expected to be one migration taught me three separate lessons:
- Inference routing is not generic web routing. I think I’m spending the bulk of my time on these issues because of the unique semantics of LLM workloads, not just because of the newness of Gateway API.
- Gateway provider lifecycle matters as much as application lifecycle. I missed one release, went on vacation, came back to a dead routing layer. Things are moving fast in this space and nobody is going to wait for you to catch up.
- Timeout policy is part of product behavior when LLMs are in the loop. A connection is more akin to a session than a request, and the “request” can be arbitrarily long.