ModelRouter
inference.llmkube.dev / v1alpha1
apiVersion: inference.llmkube.dev/v1alpha1
kind: ModelRouter
metadata:
name: example
apiVersion
string
APIVersion defines the versioned schema of this representation of an object.
Servers should convert recognized schemas to the latest internal value, and
may reject unrecognized values.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
kind
string
Kind is a string value representing the REST resource this object represents.
Servers may infer this from the endpoint the client submits requests to.
Cannot be updated.
In CamelCase.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
metadata
object
spec object required
spec defines the desired state of ModelRouter
backends []object required
Backends are the candidate destinations the router can dispatch to.
Order is not significant; selection is rule-driven. At least one
backend must be declared.
minItems:
1
capabilities
[]string
Capabilities advertised by this backend. Rules can require
capabilities (e.g. ["tools", "vision", "long-context"]) to filter
candidates.
costPerMillionTokens object
CostPerMillionTokens is informational. Used for cost-aware routing
metrics and audit-log enrichment. Values are USD.
completionUSD
string
CompletionUSD is the cost per million completion (output) tokens,
in USD.
pattern:
^[0-9]+(\.[0-9]+)?$
promptUSD
string
PromptUSD is the cost per million prompt (input) tokens, in USD.
pattern:
^[0-9]+(\.[0-9]+)?$
displayName
string
DisplayName is an optional freeform label published as the model id
on /v1/models and used by BackendNameMatch to resolve a request's
model field to a backend. When unset, Name is used for both
purposes (current behavior). This lets the k8s-safe Name differ
from the user-facing model identifier (e.g. Name "claude-opus-4"
with DisplayName "claude-opus-4-20250514").
external object
External describes an out-of-cluster provider (Anthropic, OpenAI,
or a LiteLLM proxy). Mutually exclusive with InferenceServiceRef.
credentialsSecretRef object
CredentialsSecretRef points to a Kubernetes Secret containing the
provider credentials. Conventional keys: ANTHROPIC_API_KEY,
OPENAI_API_KEY, LITELLM_MASTER_KEY. The router-proxy reads these as
environment variables.
name
string
Name of the referent.
This field is effectively required, but due to backwards compatibility is
allowed to be empty. Instances of this type with an empty value here are
almost certainly wrong.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
model
string required
Model is the upstream model identifier passed through to the
provider (e.g. "claude-opus-4-7", "gpt-5", a LiteLLM model alias).
provider
string required
Provider identifies the upstream API surface. For "litellm", URL must
point at a running LiteLLM proxy speaking OpenAI-compatible API.
For first-party providers, URL is optional (provider defaults apply).
enum:
anthropic, openai, bedrock, vertex_ai, litellm
url
string
URL is the base URL for the provider. Required for "litellm";
optional for first-party providers, which use their published default.
healthCheck object
HealthCheck overrides the default health probe applied to this
backend by the router-proxy.
intervalSeconds
integer
IntervalSeconds is how often the router-proxy probes the backend.
format:
int32minimum:
1
path
string
Path is the HTTP path probed for health. Defaults to "/health" for
local backends and to the provider's documented health route for
external providers.
timeoutSeconds
integer
TimeoutSeconds is the maximum time a single probe may take.
format:
int32minimum:
1inferenceServiceRef object
InferenceServiceRef references an in-cluster InferenceService.
Mutually exclusive with External.
name
string
Name of the referent.
This field is effectively required, but due to backwards compatibility is
allowed to be empty. Instances of this type with an empty value here are
almost certainly wrong.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
name
string required
Name is the stable identifier used by rules and observability labels.
Must be lowercase alphanumeric or '-'.
pattern:
^[a-z0-9][a-z0-9-]{0,62}$
tier
string
Tier classifies the backend for rule matching. "local" backends are
served from inside the cluster; "cloud" backends egress the cluster
boundary. Fail-closed rules can only route to local-tier backends.
enum:
local, cloud
timeout
string
Timeout caps how long the proxy waits for this backend to begin
sending response headers. When set it overrides the proxy
default for dispatches that target this backend. Resolution
order at dispatch time: rule.timeout || backend.timeout ||
proxy default (ModelRouter.spec.proxy.responseHeaderTimeout).
Useful when backends in the same router have wildly different
P99 envelopes (in-cluster vLLM vs Anthropic global LB).
weight
integer
Weight is used for the "weighted" routing strategy. Higher values
receive proportionally more traffic. Ignored for other strategies.
Default 1 when unset.
format:
int32minimum:
0
dataPlane
string
DataPlane selects how this ModelRouter serves traffic.
"Proxy" (default) provisions the managed router-proxy Deployment +
Service and routes in-process per the rules below (today's behavior,
fully back-compat).
"Gateway" compiles the backends and rules onto a pre-installed Envoy AI
Gateway: a Backend + AIServiceBackend per InferenceServiceRef backend, a
multi-rule AIGatewayRoute, and a retry/failover BackendTrafficPolicy. In
Gateway mode the router-proxy is NOT provisioned. Requires the aigw CRDs
to be installed; when they are absent the gateway resources are not
generated and a condition explains why.
enum:
Proxy, Gateway
defaultRoute
string
DefaultRoute names a backend used when no rule matches.
Must reference the Name of an entry in Backends.
defaultRouteStrategy
string
DefaultRouteStrategy decides what happens when no rule matches.
"Static" (default) routes to DefaultRoute. "BackendNameMatch" first
tries to match the request's model to a backend Name, falling back to
DefaultRoute only if none matches. BackendNameMatch makes every backend,
including cloud-tier ones, directly addressable by name; sensitive-data
protection still relies on a matching failClosed rule, which is evaluated
first and therefore continues to gate before any name match.
enum:
Static, BackendNameMatchendpoint object
Endpoint defines the Kubernetes Service the router-proxy is exposed
through. Mirrors the shape used by InferenceService.
gateway object
Gateway opts this InferenceService into Envoy AI Gateway exposure. When
set and Enabled, the operator generates the Backend / AIServiceBackend /
AIGatewayRoute resources that front this service through a pre-installed
Envoy AI Gateway. nil (the default) preserves today's behavior (no
gateway resources). The Envoy AI Gateway stack and the referenced Gateway
are a documented prerequisite; LLMKube does not install or own them.
enabled
boolean
Enabled is the opt-in switch. When false (or when Gateway is nil), the
operator generates no gateway resources for this InferenceService.
gatewayRef object required
GatewayRef identifies the pre-installed Gateway (gateway.networking.k8s.io)
the generated AIGatewayRoute attaches to. The Gateway typically lives in a
dedicated gateway namespace; cross-namespace attachment requires the
Gateway listener's allowedRoutes.namespaces to permit this
InferenceService's namespace (a documented prerequisite for the MVP; the
operator does not generate ReferenceGrants or touch the listener).
name
string required
Name is the Gateway's name.
namespace
string
Namespace is the Gateway's namespace. Empty means the InferenceService's
own namespace.
modelName
string
ModelName is the OpenAI "model" string clients send, matched by the
generated route rule (the x-ai-eg-model header the gateway's ext_proc
populates from the request body). Defaults to ModelRef, falling back to
the InferenceService name when ModelRef is empty.
nodePort
integer
NodePort is the specific NodePort to pin when endpoint.type is NodePort.
If set, the Service will use this exact port instead of auto-assigning
from the 30000-32767 range. This provides a stable external endpoint
across redeployments.
format:
int32minimum:
30000maximum:
32767
path
string
Path is the HTTP path for the inference endpoint
port
integer
Port is the service port
format:
int32minimum:
1maximum:
65535
type
string
Type is the Kubernetes service type (ClusterIP, NodePort, LoadBalancer)
enum:
ClusterIP, NodePort, LoadBalancergatewayRef object
GatewayRef identifies the pre-installed Gateway (gateway.networking.k8s.io)
the generated AIGatewayRoute attaches to when DataPlane is "Gateway".
Required in Gateway mode; ignored in Proxy mode. The Gateway and the Envoy
AI Gateway stack are a documented prerequisite; LLMKube does not install
or own them. Cross-namespace attachment requires the Gateway listener's
allowedRoutes.namespaces to permit this ModelRouter's namespace.
name
string required
Name is the Gateway's name.
namespace
string
Namespace is the Gateway's namespace. Empty means the InferenceService's
own namespace.
mcpServer object
MCPServer optionally exposes this router as a Model Context Protocol
endpoint. Inactive until the Phase 3 MCP feature lands; the field is
reserved in the schema for forward compatibility.
enabled
boolean
Enabled toggles MCP exposure. Default false. When true (after Phase
3 lands), the router-proxy serves an MCP endpoint at /mcp using
Streamable HTTP transport and OAuth 2.1.
policy object
Policy holds cross-cutting controls (budgets, classification, audit).
auditLog object
AuditLog controls structured audit emission. Auditing is always on;
this field tunes the destination and verbosity.
filePath
string
FilePath is the destination when Sink=file. Must be writable inside
the router-proxy container. Defaults to "/var/log/mlx-router/audit.log".
includeRequestBody
boolean
IncludeRequestBody, when true, includes the OpenAI request body in
every audit entry. Disabled by default for size and privacy.
sink
string
Sink selects the audit-log destination.
"stdout" (default) emits one JSON object per line to the proxy
container stdout, where it can be collected by the cluster log
stack.
"file" writes to FilePath inside the proxy container.
"otlp" forwards entries to an OTel collector as log records.
enum:
stdout, file, otlpauth object
Auth configures request authentication. In dataPlane: Gateway mode it
compiles to an Envoy AI Gateway SecurityPolicy that validates inbound JWTs
and maps a verified claim onto a trusted header before any model dispatch.
nil means no authentication is enforced. Authentication only; per-team
model allowlists (authorization) are a separate surface.
allowlists []object
Allowlists restricts which verified team may reach which model
(authorization), as a sibling of JWT (authentication). Each entry grants a
team the models it may reach. JWT proves identity; Allowlists decide what
that identity is permitted to do.
Empty or nil means NO authorization is enforced: any authenticated request
reaches any model (the authentication-only behavior of slice 2d-core, so
adding this field cannot retroactively lock out an existing router). A
non-empty Allowlists flips the generated SecurityPolicy to default-Deny:
only the named teams (and, per entry, only their listed models) are
allowed, and every other verified team is rejected with HTTP 403.
Authorization requires authentication: Allowlists set without JWT is
rejected fail-loud (you cannot authorize on an unverified claim), as is an
entry with an empty Team or a duplicate Team. In dataPlane: Gateway mode
these compile to the authorization block of the SAME SecurityPolicy JWT
generates: one Allow rule per entry whose principal matches the verified
TeamClaim value (and, when the entry lists models, the resolved
x-ai-eg-model header).
models
[]string
Models is the set of model names this team may reach. Empty means the
team may reach all models (identity-only allow). Each value matches the
resolved model name (the x-ai-eg-model header), the same value
spec.rules[].match.models route on.
team
string required
Team is the verified teamClaim value this entry grants access to.
minLength:
1jwt object
JWT enables JSON Web Token validation. When set (in dataPlane: Gateway
mode) the gateway rejects requests without a valid token with HTTP 401
before any model dispatch, and maps the configured claim onto a trusted
header.
headerKey
string
HeaderKey is the request header the verified TeamClaim value lands in.
Downstream team-scoped budgets key on this header. Defaults to
"x-llmkube-team", matching the budget default.
issuer
string required
Issuer is the OIDC issuer URL that must match the token's "iss" claim.
minLength:
1
jwksURI
string required
JWKSURI is the remote JWKS endpoint the gateway fetches signing keys from
to verify token signatures.
minLength:
1
provider
string required
Provider is a short name for the JWT provider (e.g. "keycloak"). It labels
the provider in the generated SecurityPolicy.
minLength:
1
teamClaim
string required
TeamClaim is the JWT claim that identifies the tenant (e.g. "team"). Its
verified value is copied into HeaderKey.
minLength:
1budgets []object
Budgets caps token and dollar consumption per scope over a rolling
window. Empty list means no budget enforcement.
headerKey
string
HeaderKey is the request header carrying the team identifier when
Scope=team. Defaults to "x-llmkube-team".
maxTokens
integer
MaxTokens caps total tokens (prompt + completion) over the window.
Either MaxTokens or MaxUSD (or both) must be set.
format:
int64minimum:
1
maxUSD
string
MaxUSD caps total estimated cost in USD over the window. Cost is
computed from RouterBackend.CostPerMillionTokens.
pattern:
^[0-9]+(\.[0-9]+)?$
name
string required
Name identifies this budget for metrics, status, and audit logs.
pattern:
^[a-z0-9][a-z0-9-]{0,62}$
ruleName
string
RuleName is required when Scope=rule. References a RouterRule.Name.
scope
string required
Scope determines what the budget applies to.
"router" caps all traffic through this ModelRouter.
"rule" caps traffic matching a single named rule (see RuleName).
"team" caps traffic identified by a request header (see HeaderKey).
enum:
router, rule, team
windowSeconds
integer
WindowSeconds is the rolling window over which the cap is evaluated.
format:
int32minimum:
1classification object
Classification configures how the router determines the data
classification of an inbound request.
headerKey
string
HeaderKey is the request header carrying the classification.
Defaults to "x-llmkube-classification".
mode
string
Mode determines how the router determines a request's
classification.
"header-only" (default) trusts the request header
(HeaderKey, defaults to x-llmkube-classification).
"detector" runs the bundled in-proxy detector.
"hybrid" prefers the header, falling back to the detector when no
header is present.
enum:
header-only, detector, hybrid
sensitiveClassifications
[]string
SensitiveClassifications are the classification values that trigger
fail-closed validation: any rule matching one of these values must
have FailClosed=true and reference only local-tier backends.
Defaults to ["pii", "phi"].
proxy object
Proxy configures the managed router-proxy Deployment (replicas,
image override for air-gapped sites, resources). Sensible defaults
apply when omitted.
image
string
Image overrides the default router-proxy container image. Useful
for air-gapped clusters that pin to an internal registry digest.
imagePullSecrets []object
ImagePullSecrets are passed through to the router-proxy pod spec.
name
string
Name of the referent.
This field is effectively required, but due to backwards compatibility is
allowed to be empty. Instances of this type with an empty value here are
almost certainly wrong.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
quarantineDuration
string
QuarantineDuration controls how long the proxy keeps a backend in
the "skip" state after a 5xx or network error before allowing a
half-open probe. Default 15s when unset. Shorter windows make the
proxy recover faster from transient blips; longer windows reduce
probe load on genuinely-down upstreams. Tests can shrink this to
sub-second to verify recovery without sleeping the full default.
replicas
integer
Replicas is the desired number of router-proxy pods. Defaults to 1.
The proxy is stateless for routing decisions; budget and SLO
counters live in memory and reset on pod restart until the
persistence feature lands.
format:
int32minimum:
1maximum:
10resources object
Resources sets the pod's compute resource requests and limits.
claims []object
Claims lists the names of resources, defined in spec.resourceClaims,
that are used by this container.
This field depends on the
DynamicResourceAllocation feature gate.
This field is immutable. It can only be set for containers.
name
string required
Name must match the name of one entry in pod.spec.resourceClaims of
the Pod where this field is used. It makes that resource available
inside a container.
request
string
Request is the name chosen for a request in the referenced claim.
If empty, everything from the claim is made available, otherwise
only the result of this request.
limits
object
Limits describes the maximum amount of compute resources allowed.
More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
requests
object
Requests describes the minimum amount of compute resources required.
If Requests is omitted for a container, it defaults to Limits if that is explicitly specified,
otherwise to an implementation-defined value. Requests cannot exceed Limits.
More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
responseHeaderTimeout
string
ResponseHeaderTimeout caps how long the proxy waits for the
upstream to begin sending response headers. For non-streaming
chat completions this is effectively a max-generation-time
cap; for streaming dispatches the first SSE chunk arrives well
inside the window so the cap is invisible. Default 120s when
unset. Per-rule and per-backend timeouts (see RouterRule.Timeout
and RouterBackend.Timeout) tighten this on a per-request basis
but cannot extend it beyond this cap.
revisionHistoryLimit
integer
RevisionHistoryLimit caps how many old ReplicaSets the proxy Deployment
keeps for rollback. Unset uses the Kubernetes default (10); 0 keeps none.
Useful to bound ReplicaSet buildup, since the proxy re-rolls on every
config change.
format:
int32minimum:
0rules []object
Rules are evaluated in declaration order. The first matching rule wins.
If no rule matches, the fallback is governed by DefaultRouteStrategy,
ending at DefaultRoute. If nothing resolves, the request is rejected
with HTTP 503.
failClosed
boolean
FailClosed: when true, if no backend in Route.Backends is healthy
or otherwise eligible, the router rejects the request with HTTP 503
rather than falling through to DefaultRoute or subsequent rules.
This is the regulated-data gate: a fail-closed rule guarantees that
matched requests are never served by any other route.
match object
Match groups all match conditions. All declared conditions must be
true for the rule to fire (AND semantics). If Match is omitted the
rule always matches (useful as a catch-all before DefaultRoute).
dataClassification
[]string
DataClassification matches if the inbound request carries any of
these classifications. The classification source depends on
Policy.Classification.Mode: a request header
(x-llmkube-classification by default), the bundled detector, or
both. Common values: "public", "internal", "confidential", "pii",
"phi".
headers
object
Headers performs exact-match equality on inbound HTTP headers
(case-insensitive header name comparison).
latencySLOMs
integer
LatencySLOMs is a P95 first-token-latency target in milliseconds.
When set, if the rolling P95 for the primary backend exceeds this
value the rule promotes its declared fallback. Honored only by the
"primary-fallback" strategy.
format:
int32minimum:
1
models
[]string
Models matches against the OpenAI-style "model" field in the
request body. Glob patterns are supported (e.g. "qwen3-*").
requiredCapabilities
[]string
RequiredCapabilities filters backends. The rule only matches if at
least one backend in Route.Backends advertises every listed
capability.
taskComplexity
string
TaskComplexity matches the inbound complexity hint (header
x-llmkube-task-complexity).
enum:
simple, moderate, complex
name
string required
Name is used in audit logs and metrics labels.
pattern:
^[a-z0-9][a-z0-9-]{0,62}$route object required
Route is the action taken when this rule matches.
backends
[]string required
Backends is an ordered list of RouterBackend.Name values. For the
"primary-fallback" strategy, the first entry is the primary and
subsequent entries are tried in order on failure. For "weighted",
traffic is distributed across all entries by Backend.Weight. For
"shadow", the first entry serves the response and subsequent entries
receive mirrored requests for evaluation only.
minItems:
1
strategy
string
Strategy selects how multiple backends are used.
enum:
primary-fallback, weighted, shadow
timeout
string
Timeout caps how long the proxy waits for the upstream to begin
sending response headers on dispatches matched by this rule.
When set it overrides RouterBackend.Timeout and the proxy
default. Resolution order at dispatch time:
rule.timeout || backend.timeout || proxy default.
Useful for tightening regulated-data rules (sub-10s strict
fail-fast) or extending long-reasoning rules (120s+).
status object
status defines the observed state of ModelRouter
activeRules
integer
ActiveRules is the count of rules that successfully validated
against current backend state.
format:
int32backends []object
Backends reports the resolved address and current health of every
declared backend.
address
string
Address is the resolved upstream URL the router-proxy dispatches to.
For local backends this is the InferenceService's cluster URL; for
external backends it is the provider's base URL.
healthy
boolean
Healthy reflects the most recent probe result.
lastProbeTime
string
LastProbeTime is when the proxy last completed a health probe for
this backend.
format:
date-time
message
string
Message provides extra context, especially when Healthy is false
(e.g. "InferenceService not Ready", "Secret missing key
ANTHROPIC_API_KEY").
name
string required
Name matches RouterBackend.Name.
tier
string
Tier mirrors RouterBackend.Tier for convenience.
budgetUtilization []object
BudgetUtilization summarises current budget consumption.
name
string required
Name matches BudgetSpec.Name.
tokensUsed
integer
TokensUsed is the rolling-window token count.
format:
int64
usdUsed
string
USDUsed is the rolling-window estimated cost in USD.
utilization
string
Utilization is the fraction of the budget consumed, 0.0 to 1.0.
When both MaxTokens and MaxUSD are set this is the maximum of the
two utilizations.
conditions []object
conditions represent the current state of the ModelRouter resource.
Standard condition types:
- "Validated": the spec passed static validation
- "BackendsReady": all referenced backends are reachable and healthy
- "Available": the router-proxy is serving traffic
- "Degraded": at least one backend is unhealthy but the router
can still serve other routes
- "GatewayReady": (dataPlane: Gateway) the gateway resources reconciled
lastTransitionTime
string required
lastTransitionTime is the last time the condition transitioned from one status to another.
This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable.
format:
date-time
message
string required
message is a human readable message indicating details about the transition.
This may be an empty string.
maxLength:
32768
observedGeneration
integer
observedGeneration represents the .metadata.generation that the condition was set based upon.
For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date
with respect to the current state of the instance.
format:
int64minimum:
0
reason
string required
reason contains a programmatic identifier indicating the reason for the condition's last transition.
Producers of specific condition types may define expected values and meanings for this field,
and whether the values are considered a guaranteed API.
The value should be a CamelCase string.
This field may not be empty.
pattern:
^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$minLength:
1maxLength:
1024
status
string required
status of the condition, one of True, False, Unknown.
enum:
True, False, Unknown
type
string required
type of condition in CamelCase or in foo.example.com/CamelCase.
pattern:
^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$maxLength:
316
endpoint
string
Endpoint is the in-cluster URL clients should hit. Populated once
the router-proxy Service is ready.
gateway object
Gateway reports the observed state of dataPlane: Gateway exposure: whether
the AIGatewayRoute (and its backing Backend / AIServiceBackend /
BackendTrafficPolicy) reconciled, and the resolved gateway endpoint. nil
in Proxy mode. Also surfaced via the GatewayReady condition.
authEnabled
boolean
AuthEnabled indicates a SecurityPolicy enforcing JWT authentication was
compiled for this route (ModelRouter policy.auth.jwt). Set by the
ModelRouter dataPlane: Gateway path; false when no auth is configured.
endpoint
string
Endpoint is the gateway address clients send OpenAI requests to. Set by
the ModelRouter dataPlane: Gateway path (resolved from the referenced
Gateway); empty for the InferenceService path.
modelName
string
ModelName is the resolved model-name match value clients send as the
OpenAI "model" string to reach this InferenceService through the gateway.
Set by the InferenceService path; empty for ModelRouter (which fronts
many model names).
routeReady
boolean
RouteReady indicates the AIGatewayRoute (and its backing Backend +
AIServiceBackend) were reconciled successfully against the gateway.
lastUpdated
string
LastUpdated is the timestamp of the last status reconciliation.
format:
date-time
phase
string
Phase is a coarse summary of the router's state.
Possible values: Pending, Provisioning, Ready, Degraded, Failed.
enum:
Pending, Provisioning, Ready, Degraded, FailedNo matches. Try .spec.backends for an exact path