API Reference¶
Packages¶
inference.networking.x-k8s.io/v1alpha2¶
Package v1alpha2 contains API Schema definitions for the inference.networking.x-k8s.io API group.
Resource Types¶
Criticality¶
Underlying type: string
Criticality defines how important it is to serve the model compared to other models. Criticality is intentionally a bounded enum to contain the possibilities that need to be supported by the load balancing algorithm. Any reference to the Criticality field must be optional(use a pointer), and set no default. This allows us to union this with a oneOf field in the future should we wish to adjust/extend this behavior.
Validation: - Enum: [Critical Standard Sheddable]
Appears in: - InferenceModelSpec
Field | Description |
---|---|
Critical |
Critical defines the highest level of criticality. Requests to this band will be shed last. |
Standard |
Standard defines the base criticality level and is more important than Sheddable but less important than Critical. Requests in this band will be shed before critical traffic. Most models are expected to fall within this band. |
Sheddable |
Sheddable defines the lowest level of criticality. Requests to this band will be shed before all other bands. |
EndpointPickerConfig¶
EndpointPickerConfig specifies the configuration needed by the proxy to discover and connect to the endpoint picker extension. This type is intended to be a union of mutually exclusive configuration options that we may add in the future.
Appears in: - InferencePoolSpec
Field | Description | Default | Validation |
---|---|---|---|
extensionRef Extension |
Extension configures an endpoint picker as an extension service. | Required: {} |
Extension¶
Extension specifies how to configure an extension that runs the endpoint picker.
Appears in: - EndpointPickerConfig - InferencePoolSpec
Field | Description | Default | Validation |
---|---|---|---|
group Group |
Group is the group of the referent. The default value is "", representing the Core API group. |
MaxLength: 253 Pattern: ^$\|^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$ |
|
kind Kind |
Kind is the Kubernetes resource kind of the referent. For example "Service". Defaults to "Service" when not specified. ExternalName services can refer to CNAME DNS records that may live outside of the cluster and as such are difficult to reason about in terms of conformance. They also may not be safe to forward to (see CVE-2021-25740 for more information). Implementations MUST NOT support ExternalName Services. |
Service | MaxLength: 63 MinLength: 1 Pattern: ^[a-zA-Z]([-a-zA-Z0-9]*[a-zA-Z0-9])?$ |
name ObjectName |
Name is the name of the referent. | MaxLength: 253 MinLength: 1 Required: {} |
|
portNumber PortNumber |
The port number on the service running the extension. When unspecified, implementations SHOULD infer a default value of 9002 when the Kind is Service. |
Maximum: 65535 Minimum: 1 |
|
failureMode ExtensionFailureMode |
Configures how the gateway handles the case when the extension is not responsive. Defaults to failClose. |
FailClose | Enum: [FailOpen FailClose] |
ExtensionConnection¶
ExtensionConnection encapsulates options that configures the connection to the extension.
Appears in: - Extension
Field | Description | Default | Validation |
---|---|---|---|
failureMode ExtensionFailureMode |
Configures how the gateway handles the case when the extension is not responsive. Defaults to failClose. |
FailClose | Enum: [FailOpen FailClose] |
ExtensionFailureMode¶
Underlying type: string
ExtensionFailureMode defines the options for how the gateway handles the case when the extension is not responsive.
Validation: - Enum: [FailOpen FailClose]
Appears in: - Extension - ExtensionConnection
Field | Description |
---|---|
FailOpen |
FailOpen specifies that the proxy should not drop the request and forward the request to and endpoint of its picking. |
FailClose |
FailClose specifies that the proxy should drop the request. |
ExtensionReference¶
ExtensionReference is a reference to the extension deployment.
Appears in: - Extension
Field | Description | Default | Validation |
---|---|---|---|
group Group |
Group is the group of the referent. The default value is "", representing the Core API group. |
MaxLength: 253 Pattern: ^$\|^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$ |
|
kind Kind |
Kind is the Kubernetes resource kind of the referent. For example "Service". Defaults to "Service" when not specified. ExternalName services can refer to CNAME DNS records that may live outside of the cluster and as such are difficult to reason about in terms of conformance. They also may not be safe to forward to (see CVE-2021-25740 for more information). Implementations MUST NOT support ExternalName Services. |
Service | MaxLength: 63 MinLength: 1 Pattern: ^[a-zA-Z]([-a-zA-Z0-9]*[a-zA-Z0-9])?$ |
name ObjectName |
Name is the name of the referent. | MaxLength: 253 MinLength: 1 Required: {} |
|
portNumber PortNumber |
The port number on the service running the extension. When unspecified, implementations SHOULD infer a default value of 9002 when the Kind is Service. |
Maximum: 65535 Minimum: 1 |
Group¶
Underlying type: string
Group refers to a Kubernetes Group. It must either be an empty string or a RFC 1123 subdomain.
This validation is based off of the corresponding Kubernetes validation: https://github.com/kubernetes/apimachinery/blob/02cfb53916346d085a6c6c7c66f882e3c6b0eca6/pkg/util/validation/validation.go#L208
Valid values include:
- "" - empty string implies core Kubernetes API group
- "gateway.networking.k8s.io"
- "foo.example.com"
Invalid values include:
- "example.com/bar" - "/" is an invalid character
Validation:
- MaxLength: 253
- Pattern: ^$|^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$
Appears in: - Extension - ExtensionReference - PoolObjectReference
InferenceModel¶
InferenceModel is the Schema for the InferenceModels API.
Field | Description | Default | Validation |
---|---|---|---|
apiVersion string |
inference.networking.x-k8s.io/v1alpha2 |
||
kind string |
InferenceModel |
||
metadata ObjectMeta |
Refer to Kubernetes API documentation for fields of metadata . |
||
spec InferenceModelSpec |
|||
status InferenceModelStatus |
InferenceModelSpec¶
InferenceModelSpec represents the desired state of a specific model use case. This resource is managed by the "Inference Workload Owner" persona.
The Inference Workload Owner persona is someone that trains, verifies, and leverages a large language model from a model frontend, drives the lifecycle and rollout of new versions of those models, and defines the specific performance and latency goals for the model. These workloads are expected to operate within an InferencePool sharing compute capacity with other InferenceModels, defined by the Inference Platform Admin.
InferenceModel's modelName (not the ObjectMeta name) is unique for a given InferencePool, if the name is reused, an error will be shown on the status of a InferenceModel that attempted to reuse. The oldest InferenceModel, based on creation timestamp, will be selected to remain valid. In the event of a race condition, one will be selected at random.
Appears in: - InferenceModel
Field | Description | Default | Validation |
---|---|---|---|
modelName string |
ModelName is the name of the model as it will be set in the "model" parameter for an incoming request. ModelNames must be unique for a referencing InferencePool (names can be reused for a different pool in the same cluster). The modelName with the oldest creation timestamp is retained, and the incoming InferenceModel is sets the Ready status to false with a corresponding reason. In the rare case of a race condition, one Model will be selected randomly to be considered valid, and the other rejected. Names can be reserved without an underlying model configured in the pool. This can be done by specifying a target model and setting the weight to zero, an error will be returned specifying that no valid target model is found. |
MaxLength: 256 Required: {} |
|
criticality Criticality |
Criticality defines how important it is to serve the model compared to other models referencing the same pool. Criticality impacts how traffic is handled in resource constrained situations. It handles this by queuing or rejecting requests of lower criticality. InferenceModels of an equivalent Criticality will fairly share resources over throughput of tokens. In the future, the metric used to calculate fairness, and the proportionality of fairness will be configurable. Default values for this field will not be set, to allow for future additions of new field that may 'one of' with this field. Any implementations that may consume this field may treat an unset value as the 'Standard' range. |
Enum: [Critical Standard Sheddable] |
|
targetModels TargetModel array |
TargetModels allow multiple versions of a model for traffic splitting. If not specified, the target model name is defaulted to the modelName parameter. modelName is often in reference to a LoRA adapter. |
MaxItems: 10 |
|
poolRef PoolObjectReference |
PoolRef is a reference to the inference pool, the pool must exist in the same namespace. | Required: {} |
InferenceModelStatus¶
InferenceModelStatus defines the observed state of InferenceModel
Appears in: - InferenceModel
Field | Description | Default | Validation |
---|---|---|---|
conditions Condition array |
Conditions track the state of the InferenceModel. Known condition types are: * "Accepted" |
[map[lastTransitionTime:1970-01-01T00:00:00Z message:Waiting for controller reason:Pending status:Unknown type:Ready]] | MaxItems: 8 |
InferencePool¶
InferencePool is the Schema for the InferencePools API.
Field | Description | Default | Validation |
---|---|---|---|
apiVersion string |
inference.networking.x-k8s.io/v1alpha2 |
||
kind string |
InferencePool |
||
metadata ObjectMeta |
Refer to Kubernetes API documentation for fields of metadata . |
||
spec InferencePoolSpec |
|||
status InferencePoolStatus |
InferencePoolSpec¶
InferencePoolSpec defines the desired state of InferencePool
Appears in: - InferencePool
Field | Description | Default | Validation |
---|---|---|---|
selector object (keys:LabelKey, values:LabelValue) |
Selector defines a map of labels to watch model server pods that should be included in the InferencePool. In some cases, implementations may translate this field to a Service selector, so this matches the simple map used for Service selectors instead of the full Kubernetes LabelSelector type. If sepecified, it will be applied to match the model server pods in the same namespace as the InferencePool. Cross namesoace selector is not supported. |
Required: {} |
|
targetPortNumber integer |
TargetPortNumber defines the port number to access the selected model servers. The number must be in the range 1 to 65535. |
Maximum: 65535 Minimum: 1 Required: {} |
|
extensionRef Extension |
Extension configures an endpoint picker as an extension service. | Required: {} |
InferencePoolStatus¶
InferencePoolStatus defines the observed state of InferencePool
Appears in: - InferencePool
Field | Description | Default | Validation |
---|---|---|---|
parent PoolStatus array |
Parents is a list of parent resources (usually Gateways) that are associated with the route, and the status of the InferencePool with respect to each parent. A maximum of 32 Gateways will be represented in this list. An empty list means the route has not been attached to any Gateway. |
MaxItems: 32 |
Kind¶
Underlying type: string
Kind refers to a Kubernetes Kind.
Valid values include:
- "Service"
- "HTTPRoute"
Invalid values include:
- "invalid/kind" - "/" is an invalid character
Validation:
- MaxLength: 63
- MinLength: 1
- Pattern: ^[a-zA-Z]([-a-zA-Z0-9]*[a-zA-Z0-9])?$
Appears in: - Extension - ExtensionReference - PoolObjectReference
LabelKey¶
Underlying type: string
LabelKey was originally copied from: https://github.com/kubernetes-sigs/gateway-api/blob/99a3934c6bc1ce0874f3a4c5f20cafd8977ffcb4/apis/v1/shared_types.go#L694-L731 Duplicated as to not take an unexpected dependency on gw's API.
LabelKey is the key of a label. This is used for validation of maps. This matches the Kubernetes "qualified name" validation that is used for labels. Labels are case sensitive, so: my-label and My-Label are considered distinct.
Valid values include:
- example
- example.com
- example.com/path
- example.com/path.html
Invalid values include:
- example~ - "~" is an invalid character
- example.com. - can not start or end with "."
Validation:
- MaxLength: 253
- MinLength: 1
- Pattern: ^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?([A-Za-z0-9][-A-Za-z0-9_.]{0,61})?[A-Za-z0-9]$
Appears in: - InferencePoolSpec
LabelValue¶
Underlying type: string
LabelValue is the value of a label. This is used for validation of maps. This matches the Kubernetes label validation rules: * must be 63 characters or less (can be empty), * unless empty, must begin and end with an alphanumeric character ([a-z0-9A-Z]), * could contain dashes (-), underscores (_), dots (.), and alphanumerics between.
Valid values include:
- MyValue
- my.name
- 123-my-value
Validation:
- MaxLength: 63
- MinLength: 0
- Pattern: ^(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?$
Appears in: - InferencePoolSpec
ObjectName¶
Underlying type: string
ObjectName refers to the name of a Kubernetes object. Object names can have a variety of forms, including RFC 1123 subdomains, RFC 1123 labels, or RFC 1035 labels.
Validation: - MaxLength: 253 - MinLength: 1
Appears in: - Extension - ExtensionReference - PoolObjectReference
PoolObjectReference¶
PoolObjectReference identifies an API object within the namespace of the referrer.
Appears in: - InferenceModelSpec
Field | Description | Default | Validation |
---|---|---|---|
group Group |
Group is the group of the referent. | inference.networking.x-k8s.io | MaxLength: 253 Pattern: ^$\|^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$ |
kind Kind |
Kind is kind of the referent. For example "InferencePool". | InferencePool | MaxLength: 63 MinLength: 1 Pattern: ^[a-zA-Z]([-a-zA-Z0-9]*[a-zA-Z0-9])?$ |
name ObjectName |
Name is the name of the referent. | MaxLength: 253 MinLength: 1 Required: {} |
PoolStatus¶
PoolStatus defines the observed state of InferencePool from a Gateway.
Appears in: - InferencePoolStatus
Field | Description | Default | Validation |
---|---|---|---|
parentRef ObjectReference |
GatewayRef indicates the gateway that observed state of InferencePool. | ||
conditions Condition array |
Conditions track the state of the InferencePool. Known condition types are: "Accepted" "ResolvedRefs" |
[map[lastTransitionTime:1970-01-01T00:00:00Z message:Waiting for controller reason:Pending status:Unknown type:Accepted]] | MaxItems: 8 |
PortNumber¶
Underlying type: integer
PortNumber defines a network port.
Validation: - Maximum: 65535 - Minimum: 1
Appears in: - Extension - ExtensionReference
TargetModel¶
TargetModel represents a deployed model or a LoRA adapter. The Name field is expected to match the name of the LoRA adapter (or base model) as it is registered within the model server. Inference Gateway assumes that the model exists on the model server and it's the responsibility of the user to validate a correct match. Should a model fail to exist at request time, the error is processed by the Inference Gateway and emitted on the appropriate InferenceModel object.
Appears in: - InferenceModelSpec
Field | Description | Default | Validation |
---|---|---|---|
name string |
Name is the name of the adapter or base model, as expected by the ModelServer. | MaxLength: 253 Required: {} |
|
weight integer |
Weight is used to determine the proportion of traffic that should be sent to this model when multiple target models are specified. Weight defines the proportion of requests forwarded to the specified model. This is computed as weight/(sum of all weights in this TargetModels list). For non-zero values, there may be some epsilon from the exact proportion defined here depending on the precision an implementation supports. Weight is not a percentage and the sum of weights does not need to equal 100. If a weight is set for any targetModel, it must be set for all targetModels. Conversely weights are optional, so long as ALL targetModels do not specify a weight. |
Maximum: 1e+06 Minimum: 1 |