Add Resource Driver concept

byako · windsonsea · elezar · byako · commit 77421cdb8965 · 2024-02-11T11:22:28.000+02:00
Co-authored-by: Michael &lt;haifeng.yao@daocloud.io&gt;
Co-authored-by: Evan Lezar &lt;evanlezar@gmail.com&gt;
Co-authored-by: Tim Bannister &lt;tim@scalefactory.com&gt;
Co-authored-by: Eero Tamminen &lt;eero.t.tamminen@intel.com&gt;
Co-authored-by: Patrick Ohly &lt;patrick.ohly@intel.com&gt;
Co-authored-by: Dipesh Rawat &lt;rawat.dipesh@gmail.com&gt;
diff --git a/content/en/docs/concepts/extend-kubernetes/compute-storage-net/_index.md b/content/en/docs/concepts/extend-kubernetes/compute-storage-net/_index.md
@@ -42,3 +42,9 @@ fabric that links Pods together.
   Kubernetes {{< skew currentVersion >}} is compatible with {{< glossary_tooltip text="CNI" term_id="cni" >}}
   network plugins.
 
+* [Resource drivers](/docs/concepts/extend-kubernetes/compute-storage-net/resource-drivers/)
+
+  Resource drivers allow custom allocation logic for non-native cluster resources that are
+  difficult to represent with scalar values. They offload from the scheduler the burden of
+  understanding these resources and planning their usage through ResourceClaims by Pods.
+
diff --git a/content/en/docs/concepts/extend-kubernetes/compute-storage-net/resource-drivers.md b/content/en/docs/concepts/extend-kubernetes/compute-storage-net/resource-drivers.md
@@ -0,0 +1,230 @@
+---
+title: DRA Resource Drivers
+description: Resource drivers provide non-trivial allocation logic and management for devices or resources that require vendor-specific or just complex setup, such as GPUs, NICs, FPGAs, etc.
+content_type: concept
+weight: 10
+---
+
+<!-- overview -->
+{{< feature-state for_k8s_version="v1.27" state="alpha" >}}
+
+Kubernetes provides a
+[Dynamic Resource Allocation](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) (DRA)
+mechanism
+that can be leveraged to provide more complex hardware resources to workloads with custom resource
+accounting.
+
+Similarly to {{< glossary_tooltip term_id="device-plugin" text="device plugins">}}, instead of
+customizing the code for Kubernetes itself, vendors can implement a _resource driver_ that you deploy
+into the cluster to account for and control the allocation of GPUs, high-performance NICs, FPGAs,
+InfiniBand adapters, and other similar computing resources that may require vendor specific
+initialization and setup.
+
+With device plugins, the scheduler was given a trivial, numerical representation of the resources
+available on a node for consideration during scheduling as an extended resource.
+
+With DRA, the scheduler is offloading the task of allocating and accounting for non-native resources
+to the resource driver, which manages such resources in the cluster.
+
+A resource driver consists of two main components:
+
+- a _controller_ (one per cluster), manages hardware resources allocation for
+  {{< glossary_tooltip term_id="resource-claim" text="ResourceClaims">}}
+- _kubelet plugin_ (one per node that has or can access the associated resource), that:
+  - discovers the supported hardware
+  - announces the discovered hardware to the resource driver controller
+  - prepares the hardware allocated to a ResourceClaim when the Kubelet prepares to create the Pod
+  - unprepares the hardware allocated for a Pod when the Pod has reached final state or is being deleted.
+
+There are two common ways to implement communication between the controller and a kubelet plugin:
+
+- through custom resource objects that use
+  {{< glossary_tooltip term_id="CustomResourceDefinition" text="CustomResourceDefinitions">}}
+ provided by the vendor or project behind the resource driver
+- through a ResourceHandle which is a part of an `AllocationResult` provided by the controller in case of
+  successful allocation
+
+General recommendations:
+
+- resource driver name pattern: `<HW type>.resource.<companyname>.<companydomain>`. For example,
+  gpu.resource.example.com
+
+<!-- body -->
+
+## Implementing controller
+
+Communication between the scheduler and a resource driver is done through
+{{< glossary_tooltip term_id="pod-scheduling-context" text="PodSchedulingContext">}} API object. It is
+recommended to use [controller helper library](/docs/reference/helper-libraries/dra-driver-controller/),
+it implements all needed operations related to PodSchedulingContext, they are common for all resource
+drivers.
+
+When using the DRA helper code, resource driver controller has to implement a
+[driver interface](https://pkg.go.dev/k8s.io/dynamic-resource-allocation@v0.28.1/controller#Driver),
+which then can be simply used with
+[controller helper instance](https://pkg.go.dev/k8s.io/dynamic-resource-allocation@v0.28.1/controller#New).
+
+See the [example resource driver controller](https://github.com/kubernetes-sigs/dra-example-driver/blob/151b7c8da2e620c47a3591e1a937f8d0297b0c25/cmd/dra-example-controller/main.go#L208) for details.
+
+A resource driver controller's main responsibility is to allocate and deallocate resources for
+{{< glossary_tooltip term_id="resource-claim" text="ResourceClaims">}}.
+
+There are two modes of allocation the ResourceClaim can have:
+
+- `WaitForFirstConsumer` (default), which you could think of as meaning _delayed_.
+  In this mode the cluster only requests resource(s)
+  for ResourceClaim when a Pod that needs it is being scheduled.
+- `Immediate`: the resource has to be allocated to the ResourceClaim as soon as possible, and
+  retained until the ResourceClaim is deleted.
+
+If `allocationMode` is not set explicitly, the mode is `WaitForFirstConsumer`.
+
+### Delayed allocation
+
+Controller helper code will first call _UnsuitableNodes_ for driver to report which of candidate Nodes
+chosen by the scheduler are not suitable for allocating all needed ResourceClaims of this resource driver.
+If no nodes were suitable, scheduler selects another batch of Node names, and _UnsuitableNodes_ is
+called again until suitable node is found.
+
+When at least one Node is found to be suitable for all ResourceClaims, the scheduler
+considers the suitable nodes for the other Pod scheduling constraints (native resources
+requests, affinity, selectors, etc.), and picks up exactly one Node name.
+
+After the Node was selected, the controller helper code will invoke _Allocate_ call of the Driver
+to do the actual resource allocation for needed ResourceClaims on selected Node.
+
+If Allocate call returns error for any number of ResourceClaims, the helper code will repeat the
+same call with interval until it succeeds.
+
+### Immediate allocation
+
+Immediate allocation does not have selected node, and it is up to the resource driver controller
+to select the best suitable node based on the ResourceClaim, ResourceClass and their parameters.
+Therefore in this scenario only `Allocate` is called by the helper library, without `UnsupportedNodes`
+being called first.
+
+### Common calls for both allocation modes
+
+In both modes the allocation is preceded by getting parameters objects for ResourceClaims and
+ResourceClasses to ensure the resource driver is able to get these objects and understand them.
+
+## Sharing resources
+
+There are two main ways of sharing resources between Pods:
+- by using the same ResourceClaim in multiple Pods
+- by using the same underlying resource for different ResourceClaims
+
+### Shared ResourceClaims
+
+If the `Shareable` field is set to `true` in AllocationResult for ResourceClaim, scheduler will
+allow the same ResourceClaim to be used by up to 32 Pods by automatically updating
+`Claim.Status.ReservedFor` field without consulting the resource driver that allocated resource
+for this ResourceClaim.
+
+### Internal accounting in resource driver
+
+The other way of sharing same resource is by implementing the sharing logic in the resource driver.
+This can be based on, for instance, ResourceClass parameters field that would specify whether the
+resource driver should exclusively allocate the resource to the ResourceClaim, or same resource
+can be allocated to other ResourceClaims.
+
+## Implementing kubelet plugin
+
+It is recommended to use
+[kubelet plugin helper library](/docs/reference/helper-libraries/dra-driver-kubelet-plugin/).
+
+Resource driver's kubelet plugin's main purpose is to ensure the ResourceClaims, that the Pod will
+be using on Node, have all resources ready for Pod to take in use.
+
+### Example {#example-pod}
+
+Suppose a Kubernetes cluster is running a resource driver gpu.resource.example.com with Resource
+Class `example.example.com`. Here is an example of a pod requesting this resource to run a demo
+workload:
+
+```yaml
+# gpu.resource.example.com GpuClaimParameters is an example extension API for parameters
+apiVersion: gpu.resource.example.com/v1alpha1
+kind: GpuClaimParameters
+metadata:
+  name: single-gpu
+spec:
+  count: 1
+---
+apiVersion: resource.k8s.io/v1alpha2
+kind: ResourceClaim
+metadata:
+  name: gpu-test
+spec:
+  resourceClassName: gpu.example.com
+  parametersRef:
+    apiGroup: gpu.resource.example.com/v1alpha1
+    kind: GpuClaimParameters
+    name: single-gpu
+---
+apiVersion: v1
+kind: Pod
+metadata:
+  namespace: gpu-test4
+  name: pod0
+  labels:
+    app: pod
+spec:
+  containers:
+  - name: container1
+    image: ubuntu:22.04
+    command: ["bash", "-c"]
+    args: ["export; sleep 9999"]
+    resources:
+      claims:
+      - name: gpus
+  resourceClaims:
+  - name: gpus
+    source:
+      resourceClaimTemplateName: gpu-test
+# This Pod wants to use ResourceClaim gpu-test that needs 1 device of ResourceClass
+# gpu.example.com, handled by the gpu.resource.example.com resource driver.
+#
+# The resource driver allocates the resources required for that ResourceClaim and ensures that these are
+# ready to use, only then the Pod will start.
+```
+
+## Good practice for resource driver deployment {#resource-driver-deploy-tips}
+
+The recommended way to deploy a resource driver is a Deployment for controller part and a DaemonSet
+for the kubelet plugin part. It is also possible to deploy it as a package for your node's
+operating system, or manually.
+
+The kubelet uses a gRPC interface to interact with a resource driver's kubelet plugin. On the Kubernetes side,
+no special permissions are required for resource drivers.
+
+When you deploy a resource driver, you typically also define at least one ResourceClass using that driver.
+
+## API compatibility
+
+Kubernetes Dynamic Resource Allocation support is in alpha. The API may change before stabilization,
+in incompatible ways. As a project, Kubernetes recommends that resource driver developers:
+
+* Watch for changes in future releases.
+* Support multiple versions of the resource driver API for backward/forward compatibility.
+
+If you enable the `DynamicResourceAllocation` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) and run associated kubelet plugins on nodes
+that need to be upgraded to a Kubernetes release with a newer DRA API version, upgrade your
+resource drivers to support both versions before upgrading these nodes. Taking that approach will
+ensure the continuous functioning of the device allocations during the upgrade.
+
+## DRA resource driver examples {#examples}
+
+{{% thirdparty-content %}}
+
+Here are some examples of resource driver implementations:
+
+* The [example resource driver](https://github.com/kubernetes-sigs/dra-example-driver)
+* The [Intel GPU resource driver](https://github.com/intel/intel-resource-drivers-for-kubernetes)
+* The [NVIDIA GPU resource driver](https://github.com/NVIDIA/k8s-dra-driver)
+
+
+## {{% heading "whatsnext" %}}
+
+* Learn about [creating your own DRA resource driver](https://www.youtube.com/watch?v=_fi9asserLE)
+* Discover the [example DRA resource driver](https://github.com/kubernetes-sigs/dra-example-driver)
diff --git a/content/en/docs/reference/_index.md b/content/en/docs/reference/_index.md
@@ -59,7 +59,7 @@ client libraries:
   a set of back-ends.
 * [kube-scheduler](/docs/reference/command-line-tools-reference/kube-scheduler/) -
   Scheduler that manages availability, performance, and capacity.
-  
+
   * [Scheduler Policies](/docs/reference/scheduling/policies)
   * [Scheduler Profiles](/docs/reference/scheduling/config#profiles)
 
@@ -92,7 +92,7 @@ operator to use or manage a cluster.
 * [kube-controller-manager configuration (v1alpha1)](/docs/reference/config-api/kube-controller-manager-config.v1alpha1/)
 * [kube-proxy configuration (v1alpha1)](/docs/reference/config-api/kube-proxy-config.v1alpha1/)
 * [`audit.k8s.io/v1` API](/docs/reference/config-api/apiserver-audit.v1/)
-* [Client authentication API (v1beta1)](/docs/reference/config-api/client-authentication.v1beta1/) and 
+* [Client authentication API (v1beta1)](/docs/reference/config-api/client-authentication.v1beta1/) and
   [Client authentication API (v1)](/docs/reference/config-api/client-authentication.v1/)
 * [WebhookAdmission configuration (v1)](/docs/reference/config-api/apiserver-webhookadmission.v1/)
 * [ImagePolicy API (v1alpha1)](/docs/reference/config-api/imagepolicy.v1alpha1/)
@@ -117,3 +117,9 @@ An archive of the design docs for Kubernetes functionality. Good starting points
 [Kubernetes Architecture](https://git.k8s.io/design-proposals-archive/architecture/architecture.md) and
 [Kubernetes Design Overview](https://git.k8s.io/design-proposals-archive).
 
+## Helper libraries
+
+### Dynamic resource allocation
+
+[Resource driver controller](/docs/reference/helper-libraries/dra-driver-controller/)
+[Resource driver kubelet plugin](/docs/reference/helper-libraries/dra-driver-kubelet-plugin/)
diff --git a/content/en/docs/reference/glossary/pod-scheduling-context.md b/content/en/docs/reference/glossary/pod-scheduling-context.md
@@ -0,0 +1,23 @@
+---
+id: pod-scheduling-context
+title: Pod Scheduling Context
+full-link: /docs/concepts/scheduling-eviction/dynamic-resource-allocation/
+date: 2023-11-28
+short_description: >
+ A short-lived object that is created by the kube-scheduler to coordinate with resource drivers the
+ selection of a Node for the Pod, that uses one or more ResourceClaims.
+
+related:
+ - kube-scheduler
+ - resource-claim
+ - resource-driver
+---
+
+ A [Pod Scheduling Context](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) is used
+ by kube-scheduler when Pod needs ResourceClaims in order to be scheduled.
+
+<!--more-->
+
+Resource drivers and kube-scheduler communicate through RecourceClaim and PodSchedulingContext objects
+during scheduling. The Pod only gets a Node name assigned when all the ResourceClaims listed in
+PodSchedulingContext are in status `Allocated` and are `ReservedFor` for the Pod that is being scheduled.
diff --git a/content/en/docs/reference/glossary/resource-claim-parameters.md b/content/en/docs/reference/glossary/resource-claim-parameters.md
@@ -0,0 +1,19 @@
+---
+title: Resource Claim Parameters
+id: resource-claim-parameters
+date: 2023-10-16
+full_link: /docs/concepts/scheduling-eviction/dynamic-resource-allocation/
+short_description: >
+  Specification of what and how much of resources the Resource Claim needs.
+aka:
+tags:
+- extension
+---
+ {{< glossary_tooltip term_id="resource-driver" text="Resource Driver">}}-specific object, subject
+to vendor implementation. Optional. Typically contains quantity and characteristics of the requested
+resources.
+
+<!--more-->
+
+Not part of core Kubernetes. Referenced in `ParametersRef` field of
+{{< glossary_tooltip term_id="resource-claim" text="Resource Claim">}}.
diff --git a/content/en/docs/reference/glossary/resource-claim.md b/content/en/docs/reference/glossary/resource-claim.md
@@ -0,0 +1,20 @@
+---
+title: Resource Claim
+id: resource-claim
+date: 2023-10-16
+full_link: /docs/concepts/scheduling-eviction/dynamic-resource-allocation/
+short_description: >
+  Defines what kind of resource is needed and what the parameters for it are.
+aka:
+tags:
+- core-object
+- fundamental
+---
+ Additional parameters are provided by a cluster admin in
+{{< glossary_tooltip text="Resource Class" term_id="resource-class" >}}.
+
+<!--more-->
+
+Can reference
+{{< glossary_tooltip term_id="resource-claim-parameters" text="Resource Claim Parameters">}}
+with {{< glossary_tooltip term_id="resource-driver" text="Resource Driver">}}-specific details.
diff --git a/content/en/docs/reference/glossary/resource-class-parameters.md b/content/en/docs/reference/glossary/resource-class-parameters.md
@@ -0,0 +1,14 @@
+---
+title: Resource Class Parameters
+id: resource-class-parameters
+date: 2023-10-16
+full_link: /docs/concepts/scheduling-eviction/dynamic-resource-allocation/
+short_description: >
+  Details for Resource Driver on how to allocate resources.
+aka:
+tags:
+- extension
+---
+ {{< glossary_tooltip term_id="resource-driver" text="Resource Driver">}}-specific object that,
+when referenced in {{< glossary_tooltip term_id="resource-class" text="Resource Class">}}, provides
+details about how to allocate resources.
diff --git a/content/en/docs/reference/glossary/resource-class.md b/content/en/docs/reference/glossary/resource-class.md
@@ -0,0 +1,30 @@
+---
+title: Resource Class
+id: resource-class
+date: 2023-10-16
+full_link: /docs/concepts/scheduling-eviction/dynamic-resource-allocation/
+short_description: >
+  Describes the type of resources the Resource Driver can allocate.
+aka:
+tags:
+- core-object
+- fundamental
+---
+ Abstract object that links {{< glossary_tooltip term_id="resource-claim" text="Resource Claims">}}
+and {{< glossary_tooltip term_id="resource-driver" text="Resource Drivers">}}.
+
+<!--more-->
+
+When Resource Claim needs resources allocation, its `resourceClassName` field indicates which
+Resource Class will be used to initiate allocation. Resource Class contains the name of the driver,
+that will perform the allocation, in `driverName` field, and optionally
+{{< glossary_tooltip term_id="resource-class-parameters" text="Resource Class Parameters">}}
+reference to provide Resource Driver with further allocation process customization.
+
+Same Resource Driver can be referenced in many Resource Classes, typically in such case, Resource
+Classes have different
+{{< glossary_tooltip term_id="resource-class-parameters" text="Resource Class Parameters">}}
+telling driver to do the allocation differently for each of them. For instance, one class can be
+used to allocate shared resources, another - to allocate resources exclusively.
+
+Typically managed by the cluster admin.
diff --git a/content/en/docs/reference/glossary/resource-driver.md b/content/en/docs/reference/glossary/resource-driver.md
diff --git a/content/en/docs/reference/helper-libraries/_index.md b/content/en/docs/reference/helper-libraries/_index.md
diff --git a/content/en/docs/reference/helper-libraries/dra-driver-controller.md b/content/en/docs/reference/helper-libraries/dra-driver-controller.md
diff --git a/content/en/docs/reference/helper-libraries/dra-driver-kubelet-plugin.md b/content/en/docs/reference/helper-libraries/dra-driver-kubelet-plugin.md