Skip to search

ModelRouter

inference.llmkube.dev / v1alpha1

apiVersion: inference.llmkube.dev/v1alpha1 kind: ModelRouter metadata: name: example
View raw schema
apiVersion string
APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
kind string
Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
metadata object
spec object required
spec defines the desired state of ModelRouter
backends []object required
Backends are the candidate destinations the router can dispatch to. Order is not significant; selection is rule-driven. At least one backend must be declared.
minItems: 1
capabilities []string
Capabilities advertised by this backend. Rules can require capabilities (e.g. ["tools", "vision", "long-context"]) to filter candidates.
costPerMillionTokens object
CostPerMillionTokens is informational. Used for cost-aware routing metrics and audit-log enrichment. Values are USD.
completionUSD string
CompletionUSD is the cost per million completion (output) tokens, in USD.
pattern: ^[0-9]+(\.[0-9]+)?$
promptUSD string
PromptUSD is the cost per million prompt (input) tokens, in USD.
pattern: ^[0-9]+(\.[0-9]+)?$
displayName string
DisplayName is an optional freeform label published as the model id on /v1/models and used by BackendNameMatch to resolve a request's model field to a backend. When unset, Name is used for both purposes (current behavior). This lets the k8s-safe Name differ from the user-facing model identifier (e.g. Name "claude-opus-4" with DisplayName "claude-opus-4-20250514").
external object
External describes an out-of-cluster provider (Anthropic, OpenAI, or a LiteLLM proxy). Mutually exclusive with InferenceServiceRef.
credentialsSecretRef object
CredentialsSecretRef points to a Kubernetes Secret containing the provider credentials. Conventional keys: ANTHROPIC_API_KEY, OPENAI_API_KEY, LITELLM_MASTER_KEY. The router-proxy reads these as environment variables.
name string
Name of the referent. This field is effectively required, but due to backwards compatibility is allowed to be empty. Instances of this type with an empty value here are almost certainly wrong. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
model string required
Model is the upstream model identifier passed through to the provider (e.g. "claude-opus-4-7", "gpt-5", a LiteLLM model alias).
provider string required
Provider identifies the upstream API surface. For "litellm", URL must point at a running LiteLLM proxy speaking OpenAI-compatible API. For first-party providers, URL is optional (provider defaults apply).
enum: anthropic, openai, bedrock, vertex_ai, litellm
url string
URL is the base URL for the provider. Required for "litellm"; optional for first-party providers, which use their published default.
healthCheck object
HealthCheck overrides the default health probe applied to this backend by the router-proxy.
intervalSeconds integer
IntervalSeconds is how often the router-proxy probes the backend.
format: int32
minimum: 1
path string
Path is the HTTP path probed for health. Defaults to "/health" for local backends and to the provider's documented health route for external providers.
timeoutSeconds integer
TimeoutSeconds is the maximum time a single probe may take.
format: int32
minimum: 1
inferenceServiceRef object
InferenceServiceRef references an in-cluster InferenceService. Mutually exclusive with External.
name string
Name of the referent. This field is effectively required, but due to backwards compatibility is allowed to be empty. Instances of this type with an empty value here are almost certainly wrong. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
name string required
Name is the stable identifier used by rules and observability labels. Must be lowercase alphanumeric or '-'.
pattern: ^[a-z0-9][a-z0-9-]{0,62}$
tier string
Tier classifies the backend for rule matching. "local" backends are served from inside the cluster; "cloud" backends egress the cluster boundary. Fail-closed rules can only route to local-tier backends.
enum: local, cloud
timeout string
Timeout caps how long the proxy waits for this backend to begin sending response headers. When set it overrides the proxy default for dispatches that target this backend. Resolution order at dispatch time: rule.timeout || backend.timeout || proxy default (ModelRouter.spec.proxy.responseHeaderTimeout). Useful when backends in the same router have wildly different P99 envelopes (in-cluster vLLM vs Anthropic global LB).
weight integer
Weight is used for the "weighted" routing strategy. Higher values receive proportionally more traffic. Ignored for other strategies. Default 1 when unset.
format: int32
minimum: 0
dataPlane string
DataPlane selects how this ModelRouter serves traffic. "Proxy" (default) provisions the managed router-proxy Deployment + Service and routes in-process per the rules below (today's behavior, fully back-compat). "Gateway" compiles the backends and rules onto a pre-installed Envoy AI Gateway: a Backend + AIServiceBackend per InferenceServiceRef backend, a multi-rule AIGatewayRoute, and a retry/failover BackendTrafficPolicy. In Gateway mode the router-proxy is NOT provisioned. Requires the aigw CRDs to be installed; when they are absent the gateway resources are not generated and a condition explains why.
enum: Proxy, Gateway
defaultRoute string
DefaultRoute names a backend used when no rule matches. Must reference the Name of an entry in Backends.
defaultRouteStrategy string
DefaultRouteStrategy decides what happens when no rule matches. "Static" (default) routes to DefaultRoute. "BackendNameMatch" first tries to match the request's model to a backend Name, falling back to DefaultRoute only if none matches. BackendNameMatch makes every backend, including cloud-tier ones, directly addressable by name; sensitive-data protection still relies on a matching failClosed rule, which is evaluated first and therefore continues to gate before any name match.
enum: Static, BackendNameMatch
endpoint object
Endpoint defines the Kubernetes Service the router-proxy is exposed through. Mirrors the shape used by InferenceService.
gateway object
Gateway opts this InferenceService into Envoy AI Gateway exposure. When set and Enabled, the operator generates the Backend / AIServiceBackend / AIGatewayRoute resources that front this service through a pre-installed Envoy AI Gateway. nil (the default) preserves today's behavior (no gateway resources). The Envoy AI Gateway stack and the referenced Gateway are a documented prerequisite; LLMKube does not install or own them.
enabled boolean
Enabled is the opt-in switch. When false (or when Gateway is nil), the operator generates no gateway resources for this InferenceService.
gatewayRef object required
GatewayRef identifies the pre-installed Gateway (gateway.networking.k8s.io) the generated AIGatewayRoute attaches to. The Gateway typically lives in a dedicated gateway namespace; cross-namespace attachment requires the Gateway listener's allowedRoutes.namespaces to permit this InferenceService's namespace (a documented prerequisite for the MVP; the operator does not generate ReferenceGrants or touch the listener).
name string required
Name is the Gateway's name.
namespace string
Namespace is the Gateway's namespace. Empty means the InferenceService's own namespace.
modelName string
ModelName is the OpenAI "model" string clients send, matched by the generated route rule (the x-ai-eg-model header the gateway's ext_proc populates from the request body). Defaults to ModelRef, falling back to the InferenceService name when ModelRef is empty.
nodePort integer
NodePort is the specific NodePort to pin when endpoint.type is NodePort. If set, the Service will use this exact port instead of auto-assigning from the 30000-32767 range. This provides a stable external endpoint across redeployments.
format: int32
minimum: 30000
maximum: 32767
path string
Path is the HTTP path for the inference endpoint
port integer
Port is the service port
format: int32
minimum: 1
maximum: 65535
type string
Type is the Kubernetes service type (ClusterIP, NodePort, LoadBalancer)
enum: ClusterIP, NodePort, LoadBalancer
gatewayRef object
GatewayRef identifies the pre-installed Gateway (gateway.networking.k8s.io) the generated AIGatewayRoute attaches to when DataPlane is "Gateway". Required in Gateway mode; ignored in Proxy mode. The Gateway and the Envoy AI Gateway stack are a documented prerequisite; LLMKube does not install or own them. Cross-namespace attachment requires the Gateway listener's allowedRoutes.namespaces to permit this ModelRouter's namespace.
name string required
Name is the Gateway's name.
namespace string
Namespace is the Gateway's namespace. Empty means the InferenceService's own namespace.
mcpServer object
MCPServer optionally exposes this router as a Model Context Protocol endpoint. Inactive until the Phase 3 MCP feature lands; the field is reserved in the schema for forward compatibility.
enabled boolean
Enabled toggles MCP exposure. Default false. When true (after Phase 3 lands), the router-proxy serves an MCP endpoint at /mcp using Streamable HTTP transport and OAuth 2.1.
policy object
Policy holds cross-cutting controls (budgets, classification, audit).
auditLog object
AuditLog controls structured audit emission. Auditing is always on; this field tunes the destination and verbosity.
filePath string
FilePath is the destination when Sink=file. Must be writable inside the router-proxy container. Defaults to "/var/log/mlx-router/audit.log".
includeRequestBody boolean
IncludeRequestBody, when true, includes the OpenAI request body in every audit entry. Disabled by default for size and privacy.
sink string
Sink selects the audit-log destination. "stdout" (default) emits one JSON object per line to the proxy container stdout, where it can be collected by the cluster log stack. "file" writes to FilePath inside the proxy container. "otlp" forwards entries to an OTel collector as log records.
enum: stdout, file, otlp
auth object
Auth configures request authentication. In dataPlane: Gateway mode it compiles to an Envoy AI Gateway SecurityPolicy that validates inbound JWTs and maps a verified claim onto a trusted header before any model dispatch. nil means no authentication is enforced. Authentication only; per-team model allowlists (authorization) are a separate surface.
allowlists []object
Allowlists restricts which verified team may reach which model (authorization), as a sibling of JWT (authentication). Each entry grants a team the models it may reach. JWT proves identity; Allowlists decide what that identity is permitted to do. Empty or nil means NO authorization is enforced: any authenticated request reaches any model (the authentication-only behavior of slice 2d-core, so adding this field cannot retroactively lock out an existing router). A non-empty Allowlists flips the generated SecurityPolicy to default-Deny: only the named teams (and, per entry, only their listed models) are allowed, and every other verified team is rejected with HTTP 403. Authorization requires authentication: Allowlists set without JWT is rejected fail-loud (you cannot authorize on an unverified claim), as is an entry with an empty Team or a duplicate Team. In dataPlane: Gateway mode these compile to the authorization block of the SAME SecurityPolicy JWT generates: one Allow rule per entry whose principal matches the verified TeamClaim value (and, when the entry lists models, the resolved x-ai-eg-model header).
models []string
Models is the set of model names this team may reach. Empty means the team may reach all models (identity-only allow). Each value matches the resolved model name (the x-ai-eg-model header), the same value spec.rules[].match.models route on.
team string required
Team is the verified teamClaim value this entry grants access to.
minLength: 1
jwt object
JWT enables JSON Web Token validation. When set (in dataPlane: Gateway mode) the gateway rejects requests without a valid token with HTTP 401 before any model dispatch, and maps the configured claim onto a trusted header.
headerKey string
HeaderKey is the request header the verified TeamClaim value lands in. Downstream team-scoped budgets key on this header. Defaults to "x-llmkube-team", matching the budget default.
issuer string required
Issuer is the OIDC issuer URL that must match the token's "iss" claim.
minLength: 1
jwksURI string required
JWKSURI is the remote JWKS endpoint the gateway fetches signing keys from to verify token signatures.
minLength: 1
provider string required
Provider is a short name for the JWT provider (e.g. "keycloak"). It labels the provider in the generated SecurityPolicy.
minLength: 1
teamClaim string required
TeamClaim is the JWT claim that identifies the tenant (e.g. "team"). Its verified value is copied into HeaderKey.
minLength: 1
budgets []object
Budgets caps token and dollar consumption per scope over a rolling window. Empty list means no budget enforcement.
headerKey string
HeaderKey is the request header carrying the team identifier when Scope=team. Defaults to "x-llmkube-team".
maxTokens integer
MaxTokens caps total tokens (prompt + completion) over the window. Either MaxTokens or MaxUSD (or both) must be set.
format: int64
minimum: 1
maxUSD string
MaxUSD caps total estimated cost in USD over the window. Cost is computed from RouterBackend.CostPerMillionTokens.
pattern: ^[0-9]+(\.[0-9]+)?$
name string required
Name identifies this budget for metrics, status, and audit logs.
pattern: ^[a-z0-9][a-z0-9-]{0,62}$
ruleName string
RuleName is required when Scope=rule. References a RouterRule.Name.
scope string required
Scope determines what the budget applies to. "router" caps all traffic through this ModelRouter. "rule" caps traffic matching a single named rule (see RuleName). "team" caps traffic identified by a request header (see HeaderKey).
enum: router, rule, team
windowSeconds integer
WindowSeconds is the rolling window over which the cap is evaluated.
format: int32
minimum: 1
classification object
Classification configures how the router determines the data classification of an inbound request.
headerKey string
HeaderKey is the request header carrying the classification. Defaults to "x-llmkube-classification".
mode string
Mode determines how the router determines a request's classification. "header-only" (default) trusts the request header (HeaderKey, defaults to x-llmkube-classification). "detector" runs the bundled in-proxy detector. "hybrid" prefers the header, falling back to the detector when no header is present.
enum: header-only, detector, hybrid
sensitiveClassifications []string
SensitiveClassifications are the classification values that trigger fail-closed validation: any rule matching one of these values must have FailClosed=true and reference only local-tier backends. Defaults to ["pii", "phi"].
proxy object
Proxy configures the managed router-proxy Deployment (replicas, image override for air-gapped sites, resources). Sensible defaults apply when omitted.
image string
Image overrides the default router-proxy container image. Useful for air-gapped clusters that pin to an internal registry digest.
imagePullSecrets []object
ImagePullSecrets are passed through to the router-proxy pod spec.
name string
Name of the referent. This field is effectively required, but due to backwards compatibility is allowed to be empty. Instances of this type with an empty value here are almost certainly wrong. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
quarantineDuration string
QuarantineDuration controls how long the proxy keeps a backend in the "skip" state after a 5xx or network error before allowing a half-open probe. Default 15s when unset. Shorter windows make the proxy recover faster from transient blips; longer windows reduce probe load on genuinely-down upstreams. Tests can shrink this to sub-second to verify recovery without sleeping the full default.
replicas integer
Replicas is the desired number of router-proxy pods. Defaults to 1. The proxy is stateless for routing decisions; budget and SLO counters live in memory and reset on pod restart until the persistence feature lands.
format: int32
minimum: 1
maximum: 10
resources object
Resources sets the pod's compute resource requests and limits.
claims []object
Claims lists the names of resources, defined in spec.resourceClaims, that are used by this container. This field depends on the DynamicResourceAllocation feature gate. This field is immutable. It can only be set for containers.
name string required
Name must match the name of one entry in pod.spec.resourceClaims of the Pod where this field is used. It makes that resource available inside a container.
request string
Request is the name chosen for a request in the referenced claim. If empty, everything from the claim is made available, otherwise only the result of this request.
limits object
Limits describes the maximum amount of compute resources allowed. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
requests object
Requests describes the minimum amount of compute resources required. If Requests is omitted for a container, it defaults to Limits if that is explicitly specified, otherwise to an implementation-defined value. Requests cannot exceed Limits. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
responseHeaderTimeout string
ResponseHeaderTimeout caps how long the proxy waits for the upstream to begin sending response headers. For non-streaming chat completions this is effectively a max-generation-time cap; for streaming dispatches the first SSE chunk arrives well inside the window so the cap is invisible. Default 120s when unset. Per-rule and per-backend timeouts (see RouterRule.Timeout and RouterBackend.Timeout) tighten this on a per-request basis but cannot extend it beyond this cap.
revisionHistoryLimit integer
RevisionHistoryLimit caps how many old ReplicaSets the proxy Deployment keeps for rollback. Unset uses the Kubernetes default (10); 0 keeps none. Useful to bound ReplicaSet buildup, since the proxy re-rolls on every config change.
format: int32
minimum: 0
rules []object
Rules are evaluated in declaration order. The first matching rule wins. If no rule matches, the fallback is governed by DefaultRouteStrategy, ending at DefaultRoute. If nothing resolves, the request is rejected with HTTP 503.
failClosed boolean
FailClosed: when true, if no backend in Route.Backends is healthy or otherwise eligible, the router rejects the request with HTTP 503 rather than falling through to DefaultRoute or subsequent rules. This is the regulated-data gate: a fail-closed rule guarantees that matched requests are never served by any other route.
match object
Match groups all match conditions. All declared conditions must be true for the rule to fire (AND semantics). If Match is omitted the rule always matches (useful as a catch-all before DefaultRoute).
dataClassification []string
DataClassification matches if the inbound request carries any of these classifications. The classification source depends on Policy.Classification.Mode: a request header (x-llmkube-classification by default), the bundled detector, or both. Common values: "public", "internal", "confidential", "pii", "phi".
headers object
Headers performs exact-match equality on inbound HTTP headers (case-insensitive header name comparison).
latencySLOMs integer
LatencySLOMs is a P95 first-token-latency target in milliseconds. When set, if the rolling P95 for the primary backend exceeds this value the rule promotes its declared fallback. Honored only by the "primary-fallback" strategy.
format: int32
minimum: 1
models []string
Models matches against the OpenAI-style "model" field in the request body. Glob patterns are supported (e.g. "qwen3-*").
requiredCapabilities []string
RequiredCapabilities filters backends. The rule only matches if at least one backend in Route.Backends advertises every listed capability.
taskComplexity string
TaskComplexity matches the inbound complexity hint (header x-llmkube-task-complexity).
enum: simple, moderate, complex
name string required
Name is used in audit logs and metrics labels.
pattern: ^[a-z0-9][a-z0-9-]{0,62}$
route object required
Route is the action taken when this rule matches.
backends []string required
Backends is an ordered list of RouterBackend.Name values. For the "primary-fallback" strategy, the first entry is the primary and subsequent entries are tried in order on failure. For "weighted", traffic is distributed across all entries by Backend.Weight. For "shadow", the first entry serves the response and subsequent entries receive mirrored requests for evaluation only.
minItems: 1
strategy string
Strategy selects how multiple backends are used.
enum: primary-fallback, weighted, shadow
timeout string
Timeout caps how long the proxy waits for the upstream to begin sending response headers on dispatches matched by this rule. When set it overrides RouterBackend.Timeout and the proxy default. Resolution order at dispatch time: rule.timeout || backend.timeout || proxy default. Useful for tightening regulated-data rules (sub-10s strict fail-fast) or extending long-reasoning rules (120s+).
status object
status defines the observed state of ModelRouter
activeRules integer
ActiveRules is the count of rules that successfully validated against current backend state.
format: int32
backends []object
Backends reports the resolved address and current health of every declared backend.
address string
Address is the resolved upstream URL the router-proxy dispatches to. For local backends this is the InferenceService's cluster URL; for external backends it is the provider's base URL.
healthy boolean
Healthy reflects the most recent probe result.
lastProbeTime string
LastProbeTime is when the proxy last completed a health probe for this backend.
format: date-time
message string
Message provides extra context, especially when Healthy is false (e.g. "InferenceService not Ready", "Secret missing key ANTHROPIC_API_KEY").
name string required
Name matches RouterBackend.Name.
tier string
Tier mirrors RouterBackend.Tier for convenience.
budgetUtilization []object
BudgetUtilization summarises current budget consumption.
name string required
Name matches BudgetSpec.Name.
tokensUsed integer
TokensUsed is the rolling-window token count.
format: int64
usdUsed string
USDUsed is the rolling-window estimated cost in USD.
utilization string
Utilization is the fraction of the budget consumed, 0.0 to 1.0. When both MaxTokens and MaxUSD are set this is the maximum of the two utilizations.
conditions []object
conditions represent the current state of the ModelRouter resource. Standard condition types: - "Validated": the spec passed static validation - "BackendsReady": all referenced backends are reachable and healthy - "Available": the router-proxy is serving traffic - "Degraded": at least one backend is unhealthy but the router can still serve other routes - "GatewayReady": (dataPlane: Gateway) the gateway resources reconciled
lastTransitionTime string required
lastTransitionTime is the last time the condition transitioned from one status to another. This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable.
format: date-time
message string required
message is a human readable message indicating details about the transition. This may be an empty string.
maxLength: 32768
observedGeneration integer
observedGeneration represents the .metadata.generation that the condition was set based upon. For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date with respect to the current state of the instance.
format: int64
minimum: 0
reason string required
reason contains a programmatic identifier indicating the reason for the condition's last transition. Producers of specific condition types may define expected values and meanings for this field, and whether the values are considered a guaranteed API. The value should be a CamelCase string. This field may not be empty.
pattern: ^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$
minLength: 1
maxLength: 1024
status string required
status of the condition, one of True, False, Unknown.
enum: True, False, Unknown
type string required
type of condition in CamelCase or in foo.example.com/CamelCase.
pattern: ^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$
maxLength: 316
endpoint string
Endpoint is the in-cluster URL clients should hit. Populated once the router-proxy Service is ready.
gateway object
Gateway reports the observed state of dataPlane: Gateway exposure: whether the AIGatewayRoute (and its backing Backend / AIServiceBackend / BackendTrafficPolicy) reconciled, and the resolved gateway endpoint. nil in Proxy mode. Also surfaced via the GatewayReady condition.
authEnabled boolean
AuthEnabled indicates a SecurityPolicy enforcing JWT authentication was compiled for this route (ModelRouter policy.auth.jwt). Set by the ModelRouter dataPlane: Gateway path; false when no auth is configured.
endpoint string
Endpoint is the gateway address clients send OpenAI requests to. Set by the ModelRouter dataPlane: Gateway path (resolved from the referenced Gateway); empty for the InferenceService path.
modelName string
ModelName is the resolved model-name match value clients send as the OpenAI "model" string to reach this InferenceService through the gateway. Set by the InferenceService path; empty for ModelRouter (which fronts many model names).
routeReady boolean
RouteReady indicates the AIGatewayRoute (and its backing Backend + AIServiceBackend) were reconciled successfully against the gateway.
lastUpdated string
LastUpdated is the timestamp of the last status reconciliation.
format: date-time
phase string
Phase is a coarse summary of the router's state. Possible values: Pending, Provisioning, Ready, Degraded, Failed.
enum: Pending, Provisioning, Ready, Degraded, Failed

No matches. Try .spec.backends for an exact path