Model

inference.llmkube.dev / v1alpha1

apiVersion: inference.llmkube.dev/v1alpha1 kind: Model metadata: name: example

apiVersion string

APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources

kind string

Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds

metadata object

spec object required

spec defines the desired state of Model

files []string

Files lists model weight artifacts to stage from Source. Entries are repo-relative for repository sources. The first entry is the primary model file passed to the runtime.

format string

Format specifies the model file format. "gguf" is used with the llama-server runtime; "mlx" is used with the oMLX runtime; "safetensors", "pytorch", and "custom" are used with the generic runtime.

enum: gguf, mlx, safetensors, pytorch, custom

hardware object

Hardware specifies hardware acceleration preferences

accelerator string

Accelerator specifies the type of hardware acceleration. "vulkan" covers AMD and Intel GPUs using the Vulkan runtime (gpu.vendor: amd/intel + gpu.runtime: vulkan). When set to "vulkan" the readiness-check path uses devic.es/dri-render as the GPU resource name instead of amd.com/gpu or nvidia.com/gpu.

enum: cpu, metal, cuda, rocm, intel, vulkan

gpu object

GPU specifies GPU device requirements

count integer

Count specifies the number of GPUs required Supports multi-GPU for model sharding (future feature)

format: int32

minimum: 0

maximum: 8

enabled boolean

Enabled indicates whether GPU acceleration is enabled

layers integer

Layers specifies layer offloading configuration for multi-GPU Format: number of layers to offload to GPU (e.g., 32 for full offload on 7B model) -1 means auto-detect optimal layer split

format: int32

minimum: -1

memory string

Memory specifies minimum GPU memory required per GPU (e.g., "8Gi", "16Gi")

resourceClaims []object

ResourceClaims defines DRA (Dynamic Resource Allocation) claims for GPU devices. Uses resource.k8s.io/v1 PodResourceClaim format. Each claim must have exactly one of resourceClaimName or resourceClaimTemplateName set. Mutually exclusive with resourceName.

maxItems: 16

name string required

Name uniquely identifies this resource claim inside the pod. This must be a DNS_LABEL.

resourceClaimName string

ResourceClaimName is the name of a ResourceClaim object in the same namespace as this pod. Exactly one of ResourceClaimName and ResourceClaimTemplateName must be set.

resourceClaimTemplateName string

ResourceClaimTemplateName is the name of a ResourceClaimTemplate object in the same namespace as this pod. The template will be used to create a new ResourceClaim, which will be bound to this pod. When this pod is deleted, the ResourceClaim will also be deleted. The pod name and resource name, along with a generated component, will be used to form a unique name for the ResourceClaim, which will be recorded in pod.status.resourceClaimStatuses. When the DRAWorkloadResourceClaims feature gate is enabled and the pod belongs to a PodGroup that defines a PodGroupResourceClaim with the same Name and ResourceClaimTemplateName, this PodResourceClaim resolves to the ResourceClaim generated for the PodGroup. All pods in the group that define an equivalent PodResourceClaim matching the PodGroupResourceClaim's Name and ResourceClaimTemplateName share the same generated ResourceClaim. ResourceClaims generated for a PodGroup are owned by the PodGroup and their lifecycles are tied to the PodGroup instead of any individual pod. This field is immutable and no changes will be made to the corresponding ResourceClaim by the control plane after creating the ResourceClaim. Exactly one of ResourceClaimName and ResourceClaimTemplateName must be set.

resourceName string

ResourceName overrides the extended resource the operator requests for this Model's pods. Defaults are derived from Vendor: nvidia -> nvidia.com/gpu amd -> amd.com/gpu intel -> gpu.intel.com/i915 Set this for non-default device plugins (e.g. squat/generic-device-plugin advertising `squat.ai/dri-render`, NVIDIA MIG slices). When set, this value also drives the GPU toleration unless TolerationKey is provided explicitly.

pattern: ^[a-z0-9.\-]+/[a-z0-9._\-]+$

runtime string

Runtime selects the GPU compute backend the operator schedules for this Model, independent of the Vendor field. It exists so `vendor: amd` is not overloaded to mean both "ROCm" and "Vulkan". For the llama.cpp inference backend with `vendor: amd`: - "vulkan": schedule LLMKube's Vulkan llama.cpp image and request the generic-device-plugin resource `devic.es/dri-render` (unless ResourceName overrides it). The plugin injects /dev/dri; the non-root container still needs the host render group, supplied via InferenceService.spec.podSecurityContext.supplementalGroups. - "rocm": the historical behavior (amd -> amd.com/gpu, stock image). - "" (empty): back-compatible, identical to "rocm". Ignored for non-AMD vendors and non-llama.cpp backends.

enum: vulkan, rocm

sharding object

Sharding defines how to shard the model across multiple GPUs Only applicable when Count > 1

layerSplit []string

LayerSplit defines custom layer splits per GPU Example: [0-15, 16-31] for 2-GPU split of 32-layer model If empty, auto-calculate even split

strategy string

Strategy defines the sharding approach for multi-GPU model execution. - "layer" (default): shard by transformer layers. llama.cpp --split-mode layer. - "tensor" (alias: "row"): true tensor parallelism. llama.cpp --split-mode row. Splits each tensor operation across GPUs rather than assigning whole layers to each. Performance varies by workload; typically better on compute-bound ops. - "none": disable multi-GPU sharding (single GPU). llama.cpp --split-mode none. - "pipeline": accepted for forward compatibility but currently falls back to "layer" with a reconciler warning; llama.cpp has no pipeline split-mode.

enum: layer, tensor, row, pipeline, none

tolerationKey string

TolerationKey overrides the taint key the operator tolerates when scheduling GPU pods. Defaults to ResourceName (or the vendor default resource name when ResourceName is unset), so in most cases this can be left empty.

pattern: ^[a-z0-9.\-]+/[a-z0-9._\-]+$

vendor string

Vendor specifies GPU vendor preference (nvidia, amd, intel) Future-proof for multi-vendor support

enum: nvidia, amd, intel

memoryBudget string

MemoryBudget is an absolute memory limit for the model process (e.g., "24Gi", "8192Mi"). When set, it takes precedence over MemoryFraction and the agent-level --memory-fraction flag. Parsed via resource.ParseQuantity().

memoryFraction number

MemoryFraction is the fraction of total system memory to budget for this model's inference process (0.0–1.0). Takes precedence over the agent-level --memory-fraction flag but not MemoryBudget.

mmproj string

Mmproj is an optional multimodal projector file to stage from Source and pass to runtimes that support projector arguments.

quantization string

Quantization describes the quantization level (e.g., Q4_0, Q5_K_M, F16)

refreshPolicy string

RefreshPolicy controls whether a cached model file is re-fetched when the upstream source changes. - "IfNotPresent" (default): download only if the cached file is missing. Upstream changes are still detected and surfaced via the SourceDrifted condition, but the cached file is never re-fetched on its own. This preserves the historical behavior so an operator upgrade triggers no surprise re-pulls. - "OnChange": re-download when the upstream bytes differ from what was cached (HTTP ETag/Content-Length for remote sources, file size/mtime for local sources). The re-download overwrites the file in the existing cache directory; the cache key is unchanged.

enum: IfNotPresent, OnChange

resources object

Resources defines resource requirements for running the model

cpu string

CPU specifies CPU requirements (e.g., "2" or "2000m")

memory string

Memory specifies memory requirements (e.g., "4Gi")

sha256 string

SHA256 is the expected SHA256 hash of the model file for integrity verification. When set, the controller verifies the downloaded/copied file matches this hash.

pattern: ^[a-fA-F0-9]{64}$

source string required

Source defines where to obtain the model. For GGUF models: URL or path to a .gguf file. For MLX models: local directory path containing the model (config.json, weights). Supported schemes: http://, https://, file://, pvc://, or absolute paths. Examples: - https://huggingface.co/org/repo/resolve/main/model.gguf - file:///mnt/models/model.gguf - /mnt/models/model.gguf (air-gapped deployments) - pvc://my-models-pvc/path/to/model.gguf (pre-staged on a PersistentVolumeClaim) - /mnt/models/Llama-3.2-3B-Instruct-4bit (MLX model directory) file:// caveat for hybrid topologies: the controller pod must be able to read the path. In Mac kind / k3s / GKE deployments where the metal-agent runs on the host and the controller runs inside a container, /Users/... and other host paths are invisible to the controller and will fail to fetch. The controller marks the Model Failed and backs off to a 5-minute requeue rather than retrying tightly (#405). Workaround: pre-stage on a pvc://, or use the equivalent https://huggingface.co/.../<filename>.gguf URL which the runtime/init container resolves at deploy time.

pattern: ^(https?|file|pvc|hf)://.*|^/[^\s]+$|^[a-zA-Z0-9][\w\-\.\/]+$

status object

status defines the observed state of Model

acceleratorReady boolean

AcceleratorReady indicates if hardware acceleration is configured and ready

cacheKey string

CacheKey is the SHA256 hash prefix of the source URL used for cache storage Models with the same source URL share the same cache entry

conditions []object

conditions represent the current state of the Model resource. Each condition has a unique type and reflects the status of a specific aspect of the resource. Standard condition types include: - "Available": the model is downloaded and ready for use - "Progressing": the model is being downloaded or processed - "Degraded": the model download or setup failed - "SourceDrifted": the upstream source bytes differ from the cached copy The status of each condition is one of True, False, or Unknown.

lastTransitionTime string required

lastTransitionTime is the last time the condition transitioned from one status to another. This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable.

format: date-time

message string required

message is a human readable message indicating details about the transition. This may be an empty string.

maxLength: 32768

observedGeneration integer

observedGeneration represents the .metadata.generation that the condition was set based upon. For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date with respect to the current state of the instance.

format: int64

minimum: 0

reason string required

reason contains a programmatic identifier indicating the reason for the condition's last transition. Producers of specific condition types may define expected values and meanings for this field, and whether the values are considered a guaranteed API. The value should be a CamelCase string. This field may not be empty.

pattern: ^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$

minLength: 1

maxLength: 1024

status string required

status of the condition, one of True, False, Unknown.

enum: True, False, Unknown

type string required

type of condition in CamelCase or in foo.example.com/CamelCase.

pattern: ^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$

maxLength: 316

gguf object

GGUF contains metadata extracted from the GGUF file header

architecture string

Architecture is the model architecture (e.g., "llama", "mistral", "phi")

contextLength integer

ContextLength is the maximum context length (tokens)

format: int64

embeddingSize integer

EmbeddingSize is the embedding dimension size

format: int64

fileVersion integer

FileVersion is the GGUF file format version

format: int32

headCount integer

HeadCount is the number of attention heads

format: int64

layerCount integer

LayerCount is the number of transformer layers/blocks

format: int64

license string

License is the license identifier extracted from the GGUF file metadata

modelName string

ModelName is the model name as stored in the GGUF file

quantization string

Quantization is the quantization type (e.g., "Q4_K_M", "Q5_K_M")

tensorCount integer

TensorCount is the number of tensors in the model

format: int64

lastRevalidated string

LastRevalidated is the timestamp of the last upstream revalidation check. Revalidation is cadence-gated so the controller does not issue a HEAD on every reconcile.

format: date-time

lastUpdated string

LastUpdated is the timestamp of the last status update

format: date-time

path string

Path represents the local path where the model is stored

phase string

Phase represents the current lifecycle phase of the model. Possible values: Pending, Downloading, Copying, Ready, Failed.

enum: Pending, Downloading, Copying, Ready, Failed

sha256 string

SHA256 is the computed SHA256 hash of the model file. Populated after download/copy for integrity tracking.

size string

Size represents the size of the downloaded model file

sourceContentLength integer

SourceContentLength is the upstream size recorded at the last revalidation. For http/https sources it is the Content-Length reported by a HEAD request; for local sources it is the file size on disk. Used together with SourceETag (or mtime for local sources) to detect upstream changes.

format: int64

sourceETag string

SourceETag is the HTTP ETag recorded for the upstream source at the last revalidation. Used to detect upstream changes for http/https sources (HuggingFace serves the blob SHA as the ETag, so a moved branch is caught).

stagedFiles []string

StagedFiles lists the repo-relative paths staged in the model cache. Populated when spec.files or spec.mmproj are set.

No matches. Try .spec.files for an exact path