Pod Placement

StorageOS has the capacity to influence Kubernetes Pod placement decisions to ensure that Pods are scheduled on the same nodes as their data. This functionality is known as Pod Locality.

StorageOS grants access to data by presenting, on local or remote nodes, the devices used in a Pod’s VolumeMounts. However, it is often the case that it is required or preferred to place the Pod on the node where the StorageOS Primary Volume is located, because IO operations are fastest as a result of minimized network traffic and associated latency. Read operations are served locally and writes require fewer round trips to the replicas of the volume.

StorageOS automatically enables the use of a custom scheduler for any Pod using StorageOS Volumes. Checkout the Admission Controller reference for more information.

Locality modes

There are two modes available to set pod locality for StorageOS Volumes.

Preferred

The Pod SHOULD be placed alongside its data, if possible. Otherwise, it will be placed alongside volume replicas. If neither scenario is possible, the Pod will start on another node and StorageOS will grant access to the data over the network.

Preferred mode is the default behaviour when using the StorageOS scheduler.

Strict

The Pod MUST be placed alongside its data, i.e. on a node with the master volume or a replica. If that is not possible, the Pod will remain in pending state until the premise can be fulfilled.

The aim of strict mode is to provide the user with the capability to guarantee best performance for applications. Some applications are required to give a certain level of performance, and for such applications strict co-location of application and data is essential.

For instance, when running Kafka Pods under heavy load, it may be better to avoid scheduling a Pod using a remote volume rather than have clients direct traffic at a cluster member which exhibits degraded performance.

To see examples on how to set a mode for your Pods, check out the examples reference page.

Storageos Kubernetes Scheduler

StorageOS achieves Pod locality by implementing a Kubernetes scheduler extender. The Kubernetes standard scheduler interacts with the StorageOS scheduler when placement decisions need to be made.

The Kubernetes standard scheduler selects a set of nodes for a placement decision based on nodeSelectors, affinity rules, etc. This list of nodes is sent to the StorageOS scheduler which sends back the target node where the Pod shall be placed.

The StorageOS scheduler logic is provided by a Pod in the Namespace where StorageOS Pods are running.

Scheduling process

When a Pod needs to be scheduled, the scheduler collects information about all available nodes and the requirements of the Pod. The collected data is then passed through the Filter phase, during which the scheduler predicates are applied to the node data to decide if the given nodes are compatible with the Pod requirements. The result of the filter consists of a list of nodes that are compatible for the given Pod and a list of nodes that aren’t compatible.

The list of compatible nodes is then passed to the Prioritize phase, in which the nodes are scored based on attributes such as the state. The result of the Prioritize phase is a list of nodes with their respective scores. The more favorable nodes get higher scores than less favorable nodes. The list is then used by the scheduler to decide the final node to schedule the Pod on.

Once a node has been selected, the third phase, Bind, handles the binding of the Pod to the Kubernetes apiserver. Once bound, the kubelet on the node provisions the Pod.

The StorageOS scheduler implement Filter and Prioritization phases and leaves binding to the default Kubernetes scheduler.

    Available         +------------------+                     +------------------+
  NodeList & Pod      |                  |  Filtered NodeList  |                  |    Scored
   Information        |                  |  & Pod Information  |                  |   NodeList
+-------------------->+      Filter      +-------------------->+    Prioritize    |--------------->
                      |   (Predicates)   |                     |   (Priorities)   |
                      |                  |                     |                  |
                      +------------------+                     +------------------+

Scheduling Rules

The StorageOS scheduler filters nodes ensuring that the remaining subset fulfill the following prerequisites:

The node is running StorageOS
The node is healthy
The node is not StorageOS Cordoned
The node is not in a StorageOS Drained state
The node is not a StorageOS compute-only node

The scoring protocol once the nodes are filtered is as follows:

Node with master volume - 15 points
Node with replica volume - 10 points
Node with no master or replica volume - 5 points
Node with unhealthy volume or unsynced replica - 1 point

Admission Controller

StorageOS implements an admission controller that ensures any Pod using StorageOS Volumes are scheduled by the StorageOS Scheduler. This makes the use of the scheduler transparent to the user. Check the reference page to see how to alter this behaviour.

The Admission Controller is based on admission webhooks. Therefore, no custom admission plugins need to be enabled at bootstrap of your Kubernetes cluster. Admission webhooks are HTTP callbacks that receive admission requests and do something with them. The StorageOS Cluster Operator serves the admission webhook. So when a Pod is in the process of being created, the StorageOS Cluster Operator mutates the spec.schedulerName ensuring the storageos-scheduler is set.