Troubleshooting

Pod in pending because of mount error

Issue:

The output of oc describe pod $POD_ID contains no such file or directory and references the StorageOS volume device file.

[email protected]:~# oc -n storageos describe $POD_ID
(...)
Events:
  (...)
  Normal   Scheduled         11s                default-scheduler  Successfully assigned default/d1 to node3
  Warning  FailedMount       4s (x4 over 9s)    kubelet, node3     MountVolume.SetUp failed for volume "pvc-f2a49198-c00c-11e8-ba01-0800278dc04d" : stat /var/lib/storageos/volumes/d9df3549-26c0-4cfc-62b4-724b443069a1: no such file or directory

Reason:

Mount propagation is not enabled.

Doublecheck:

SSH into the one of the nodes and check if /var/lib/storageos/volumes is empty. If so, exec into any StorageOS pod and check the same directory.

[email protected]:~# ls /var/lib/storageos/volumes/
[email protected]:~# 
[email protected]:~# oc exec $POD_ID -c storageos -- ls -l /var/lib/storageos/volumes
bst-196004
d529b340-0189-15c7-f8f3-33bfc4cf03fa
ff537c5b-e295-e518-a340-0b6308b69f74

If the directory inside the container and the device files are visible, disabled mount propagation is the cause.

Solution:

Enable mount propagation both for OpenShift and docker, following the prerequisites page

PVC pending state - Failed to dial StorageOS

A created PVC remains in pending state making pods that need to mount that PVC unable to start.

Issue:

[email protected]:~/# oc get pvc
NAME      STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
vol-1     Pending                                                                            fast           7s

oc describe pvc $PVC
(...)
Events:
  Type     Reason              Age               From                         Message
  ----     ------              ----              ----                         -------
  Warning  ProvisioningFailed  7s (x2 over 18s)  persistentvolume-controller  Failed to provision volume with StorageClass "fast": Get http://storageos-cluster/version: failed to dial all known cluster members, (10.233.59.206:5705)

Reason:

For non CSI installations of StorageOS, OpenShift uses the StorageOS API endpoint to communicate. If that communication fails, relevant actions such as create or mount a volume can’t be transmitted to StorageOS, hence the PVC will remain in pending state. StorageOS never received the action to perform, so it never sent back an acknowledgement.

In this case, the Event message indicates that StorageOS API is not responding, implying that StorageOS is not running. For OpenShift to define StorageOS pods ready, the health check must pass.

Doublecheck:

Check the status of StorageOS pods.

[email protected]:~/# oc -n storageos get pod --selector app=storageos # for CSI add --selector kind=daemonset
NAME              READY     STATUS    RESTARTS   AGE
storageos-qrqkj   0/1       Running   0          1m
storageos-s4bfv   0/1       Running   0          1m
storageos-vcpfx   0/1       Running   0          1m
storageos-w98f5   0/1       Running   0          1m

If the pods are not READY, the service will not forward traffic to the API they serve hence PVC will remain in pending state until StorageOS pods are available.

OpenShift keeps trying to execute the action until it succeeds. If a PVC is created before StorageOS finish starting, the PVC will be created eventually.

Solution:

  • StorageOS health check takes 60 seconds of grace before reporting as READY. If StorageOS is starting properly after that period, the volume will be created when StorageOS finishes its bootstrap.
  • If StorageOS is not running or is not starting properly, the solution to this issue is to make StorageOS start. Check the troubleshooting installation or follow the installation procedures

PVC pending state - Secret Missing

A created PVC remains in pending state making pods that need to mount that PVC unable to start.

Issue:

oc describe pvc $PVC
(...)
Events:
  Type     Reason              Age                From                         Message
  ----     ------              ----               ----                         -------
  Warning  ProvisioningFailed  13s (x2 over 28s)  persistentvolume-controller  Failed to provision volume with StorageClass "fast": failed to get secret from ["storageos"/"storageos-api"]

Reason:

For non CSI installations of StorageOS, OpenShift uses the StorageOS API endpoint to communicate. If that communication fails, relevant actions such as create or mount a volume can’t be transmitted to StorageOS, and the PVC will remain in pending state. StorageOS never received the action to perform, so it never sent back an acknowledgement.

The StorageClass provisioned for StorageOS references a Secret from where it retrieves the API endpoint and the authentication parameters. If that secret is incorrect or missing, the connections won’t be established. It is common to see that the Secret has been deployed in a different namespace where the StorageClass expects it or that is has been deployed with a different name.

Doublecheck:

  1. Check the StorageClass parameters to know where the Secret is expected to be found.

     oc get storageclass fast -o yaml
     apiVersion: storage.k8s.io/v1
     kind: StorageClass
     metadata:
       creationTimestamp: 2018-09-25T08:44:57Z
       labels:
         app: storageos
       name: fast
       resourceVersion: "108853"
       selfLink: /apis/storage.k8s.io/v1/storageclasses/fast
       uid: 48490a9b-c09f-11e8-ba01-0800278dc04d
     parameters:
       adminSecretName: storageos-api
       adminSecretNamespace: storageos
       description: Kubernetes volume
       fsType: ext4
       pool: default
     provisioner: kubernetes.io/storageos
     reclaimPolicy: Delete
    

    Note that the parameters specify adminSecretName and adminSecretNamespace.

  2. Check if the secret exists according to those parameters

     oc -n storageos get secret storageos-api
     No resources found.
     Error from server (NotFound): secrets "storageos-api" not found
    

    If no resources are found, it is clear that the Secret doesn’t exist or it is not deployed in the right location.

Solution:

Deploy StorageOS following the installation procedures. If you are using the manifests provided for OpenShift to deploy StorageOS rather than using automated provisioners, make sure that the StorageClass parameters and the Secret reference match.