Troubleshooting
This section is aimed to help you troubleshoot issues in your cluster, whether they are related to the StorageOS installation, integration with orchestrators or common misconfigurations.
Tools
To be able to troubleshoot issues the StorageOS cli is required.
Pod in pending because of mount error
Issue:
The output of kubectl describe pod $POD_ID
contains no such file or directory
and references the StorageOS volume device file.
root@node1:~# kubectl -n kube-system describe $POD_ID
(...)
Events:
(...)
Normal Scheduled 11s default-scheduler Successfully assigned default/d1 to node3
Warning FailedMount 4s (x4 over 9s) kubelet, node3 MountVolume.SetUp failed for volume "pvc-f2a49198-c00c-11e8-ba01-0800278dc04d" : stat /var/lib/storageos/volumes/d9df3549-26c0-4cfc-62b4-724b443069a1: no such file or directory
Reason:
There are two main reasons this issue may arise:
- The StorageOS
DEVICE_DIR
location is wrongly configured when using Kubelet as a container - Mount Propagation is not enabled
(Option 1) Misconfiguration of the DeviceDir/SharedDir
Some Kubernetes distributions such as Rancher, DockerEE or some installations of OpenShift deploy the Kubelet as a container, because of this, the device files that StorageOS creates to mount into the containers need to be visible to the kubelet. StorageOS can be configured to share the device directory.
Modern installations use CSI, which handles the complexity internally.
Assert:
root@node1:~# kubectl -n default describe stos | grep "Shared Dir"
Shared Dir: # <-- Shouldn't be blank
Solution:
The Cluster Operator Custom Definition should specify the SharedDir option as follows.
spec:
sharedDir: '/var/lib/kubelet/plugins/kubernetes.io~storageos' # Needed when Kubelet as a container
...
See example on how to configure the StorageOS Custom Resource.
(Option 2) Mount propagation is not enabled.
Applies only if Option 1 is configured properly.
Assert:
If not using the Kubelet as a container, SSH into one of the nodes and check if
/var/lib/storageos/volumes
is empty. If so, exec into any StorageOS pod and
check the same directory.
root@node1:~# ls /var/lib/storageos/volumes/
root@node1:~# # <-- Shouldn't be blank
root@node1:~# kubectl exec $POD_ID -c storageos -- ls -l /var/lib/storageos/volumes
bst-196004
d529b340-0189-15c7-f8f3-33bfc4cf03fa
ff537c5b-e295-e518-a340-0b6308b69f74
If the directory inside the container and the device files are visible, disabled mount propagation is the cause.
If using the Kubelet as a container, SSH into one of the nodes and check if
/var/lib/kubelet/plugins/kubernetes.io~storageos/devices
is empty. If
so, exec into any StorageOS pod and check the same directory.
root@node1:~# ls /var/lib/kubelet/plugins/kubernetes.io~storageos/devices
root@node1:~# # <-- Shouldn't be blank
root@node1:~# kubectl exec $POD_ID -c storageos -- ls -l /var/lib/kubelet/plugins/kubernetes.io~storageos/devices
bst-196004
d529b340-0189-15c7-f8f3-33bfc4cf03fa
ff537c5b-e295-e518-a340-0b6308b69f74
If the directory inside the container and the device files are visible, disabled mount propagation is the cause.
Solution:
Older versions of Kubernetes need to enable mount propagation as it is not enabled by default. Most Kubernetes distributions allow MountPropagation to be enabled using FeatureGates. Rancher specifically, needs to enable it in the “View in API” section of your cluster. You need to edit the section “rancherKubernetesEngineConfig” to enable the Kubelet feature gate.
PVC pending state - Failed to dial StorageOS
A created PVC remains in pending state making pods that need to mount that PVC unable to start.
Issue:
root@node1:~/# kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
vol-1 Pending fast 7s
kubectl describe pvc $PVC
(...)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 7s (x2 over 18s) persistentvolume-controller Failed to provision volume with StorageClass "fast": Get http://storageos-cluster/version: failed to dial all known cluster members, (10.233.59.206:5705)
Reason:
For non CSI installations of StorageOS, Kubernetes uses the StorageOS API endpoint to communicate. If that communication fails, relevant actions such as create or mount volume can’t be transmitted to StorageOS, hence the PVC will remain in pending state. StorageOS never received the action to perform, so it never sent back an acknowledgement.
In this case, the Event message indicates that StorageOS API is not responding, implying that StorageOS is not running. For Kubernetes to define StorageOS pods ready, the health check must pass.
Assert:
Check the status of StorageOS pods.
root@node1:~/# kubectl -n kube-system get pod --selector app=storageos # for CSI add --selector kind=daemonset
NAME READY STATUS RESTARTS AGE
storageos-qrqkj 0/1 Running 0 1m
storageos-s4bfv 0/1 Running 0 1m
storageos-vcpfx 0/1 Running 0 1m
storageos-w98f5 0/1 Running 0 1m
If the pods are not READY, the service will not forward traffic to the API they serve hence PVC will remain in pending state until StorageOS pods are available.
Kubernetes keeps trying to execute the action until it succeeds. If a PVC is created before StorageOS finish starting, the PVC will be created eventually.
Solution:
- StorageOS health check takes 60 seconds of grace before reporting as READY. If StorageOS is starting properly after that period, the volume will be created when StorageOS finishes its bootstrap.
- If StorageOS is not running or is not starting properly, the solution would be to troubleshoot the installation.
PVC pending state - Secret Missing
A created PVC remains in pending state making pods that need to mount that PVC unable to start.
Issue:
kubectl describe pvc $PVC
(...)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 13s (x2 over 28s) persistentvolume-controller Failed to provision volume with StorageClass "fast": failed to get secret from ["storageos"/"storageos-api"]
Reason:
For non CSI installations of StorageOS, Kubernetes uses the StorageOS API endpoint to communicate. If that communication fails, relevant actions such as create or mount a volume can’t be transmitted to StorageOS, and the PVC will remain in pending state. StorageOS never received the action to perform, so it never sent back an acknowledgement.
The StorageClass provisioned for StorageOS references a Secret from where it retrieves the API endpoint and the authentication parameters. If that secret is incorrect or missing, the connections won’t be established. It is common to see that the Secret has been deployed in a different namespace where the StorageClass expects it or that is has been deployed with a different name.
Assert:
-
Check the StorageClass parameters to know where the Secret is expected to be found.
kubectl get storageclass fast -o yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: creationTimestamp: 2018-09-25T08:44:57Z labels: app: storageos name: fast resourceVersion: "108853" selfLink: /apis/storage.k8s.io/v1/storageclasses/fast uid: 48490a9b-c09f-11e8-ba01-0800278dc04d parameters: adminSecretName: storageos-api adminSecretNamespace: storageos description: Kubernetes volume fsType: ext4 provisioner: kubernetes.io/storageos reclaimPolicy: Delete
Note that the parameters specify
adminSecretName
andadminSecretNamespace
. -
Check if the secret exists according to those parameters
kubectl -n kube-system get secret storageos-api No resources found. Error from server (NotFound): secrets "storageos-api" not found
If no resources are found, it is clear that the Secret doesn’t exist or it is not deployed in the right location.
Solution:
Deploy StorageOS following the installation procedures. If you are using the manifests provided for Kubernetes to deploy StorageOS rather than using automated provisioners, make sure that the StorageClass parameters and the Secret reference match.
Node name different from Hostname
Issue:
StorageOS nodes can’t join the cluster showing the following log entries.
time="2018-09-24T13:47:02Z" level=error msg="failed to start api" error="error verifying UUID: UUID aed3275f-846b-1f75-43a1-adbfec8bf974 has already been registered and has hostname 'debian-4', not 'node4'" module=command
Reason:
The StorageOS registration process to start the cluster uses the hostname of the node where the StorageOS container is running, provided by the Kubelet. However, StorageOS verifies the network hostname of the OS as a prestart check to make sure it can communicate with other nodes. If those names don’t match, StorageOS will be unable to start.
Solution:
Make sure the hostnames match with the Kubernetes advertised names. If you have changed the hostname of your nodes, make sure that you restart the nodes to apply the change.
Peer discovery - Networking
Issue:
StorageOS nodes can’t join the cluster showing the following logs after one minute of container uptime.
time="2018-09-24T13:40:20Z" level=info msg="not first cluster node, joining first node" action=create address=172.28.128.5 category=etcd host=node3 module=cp target=172.28.128.6
time="2018-09-24T13:40:20Z" level=error msg="could not retrieve cluster config from api" status_code=503
time="2018-09-24T13:40:20Z" level=error msg="failed to join existing cluster" action=create category=etcd endpoint="172.28.128.3,172.28.128.4,172.28.128.5,172.28.128.6" error="503 Service Unavailable" module=cp
time="2018-09-24T13:40:20Z" level=info msg="retrying cluster join in 5 seconds..." action=create category=etcd module=cp
Reason:
StorageOS uses a gossip protocol to discover nodes in the cluster. When
StorageOS starts, one or more nodes can be referenced so new nodes can query
existing nodes for the list of members. This error indicates that the node
can’t connect to any of the nodes in the known list. The known list is defined
in the JOIN
variable.
Assert:
It is likely that ports are block by a firewall.
SSH into one of your nodes and check connectivity to the rest of the nodes.
# Successfull execution:
[root@node06 ~]# nc -zv node04 5705
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 10.0.1.166:5705.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.
StorageOS exposes network diagnostics in its API, viewable from the CLI. To use this feature, the CLI must query the API of a running node. The diagnostics show information from all known cluster members. If all the ports are blocked during the first bootstrap of the cluster, the diagnostics won’t show any data as nodes couldn’t register.
StorageOS networks diagnostics are available for storageos-rc5 and storageos-cli-rc3 and above.
# Example:
root@node1:~# storageos cluster connectivity
SOURCE NAME ADDRESS LATENCY STATUS MESSAGE
node4 node2.nats 172.28.128.4:5708 1.949275ms OK
node4 node3.api 172.28.128.5:5705 3.070574ms OK
node4 node3.nats 172.28.128.5:5708 2.989238ms OK
node4 node2.directfs 172.28.128.4:5703 2.925707ms OK
node4 node3.etcd 172.28.128.5:5707 2.854726ms OK
node4 node3.directfs 172.28.128.5:5703 2.833371ms OK
node4 node1.api 172.28.128.3:5705 2.714467ms OK
node4 node1.nats 172.28.128.3:5708 2.613752ms OK
node4 node1.etcd 172.28.128.3:5707 2.594159ms OK
node4 node1.directfs 172.28.128.3:5703 2.601834ms OK
node4 node2.api 172.28.128.4:5705 2.598236ms OK
node4 node2.etcd 172.28.128.4:5707 16.650625ms OK
node3 node4.nats 172.28.128.6:5708 1.304126ms OK
node3 node4.api 172.28.128.6:5705 1.515218ms OK
node3 node2.directfs 172.28.128.4:5703 1.359827ms OK
node3 node1.api 172.28.128.3:5705 1.185535ms OK
node3 node4.directfs 172.28.128.6:5703 1.379765ms OK
node3 node1.etcd 172.28.128.3:5707 1.221176ms OK
node3 node1.nats 172.28.128.3:5708 1.330122ms OK
node3 node2.api 172.28.128.4:5705 1.238541ms OK
node3 node1.directfs 172.28.128.3:5703 1.413574ms OK
node3 node2.etcd 172.28.128.4:5707 1.214273ms OK
node3 node2.nats 172.28.128.4:5708 1.321145ms OK
node1 node4.directfs 172.28.128.6:5703 1.140797ms OK
node1 node3.api 172.28.128.5:5705 1.089252ms OK
node1 node4.api 172.28.128.6:5705 1.178439ms OK
node1 node4.nats 172.28.128.6:5708 1.176648ms OK
node1 node2.directfs 172.28.128.4:5703 1.529612ms OK
node1 node2.etcd 172.28.128.4:5707 1.165681ms OK
node1 node2.api 172.28.128.4:5705 1.29602ms OK
node1 node2.nats 172.28.128.4:5708 1.267454ms OK
node1 node3.nats 172.28.128.5:5708 1.485657ms OK
node1 node3.etcd 172.28.128.5:5707 1.469429ms OK
node1 node3.directfs 172.28.128.5:5703 1.503015ms OK
node2 node4.directfs 172.28.128.6:5703 1.484ms OK
node2 node1.directfs 172.28.128.3:5703 1.275304ms OK
node2 node4.nats 172.28.128.6:5708 1.261422ms OK
node2 node4.api 172.28.128.6:5705 1.465532ms OK
node2 node3.api 172.28.128.5:5705 1.252768ms OK
node2 node3.nats 172.28.128.5:5708 1.212332ms OK
node2 node3.directfs 172.28.128.5:5703 1.192792ms OK
node2 node3.etcd 172.28.128.5:5707 1.270076ms OK
node2 node1.etcd 172.28.128.3:5707 1.218522ms OK
node2 node1.api 172.28.128.3:5705 1.363071ms OK
node2 node1.nats 172.28.128.3:5708 1.349383ms OK
Solution:
Open ports following the prerequisites page.
Peer discovery - Pod allocation
Issue:
StorageOS nodes can’t join the cluster and show the following log entries.
time="2018-09-24T13:40:20Z" level=info msg="not first cluster node, joining first node" action=create address=172.28.128.5 category=etcd host=node3 module=cp target=172.28.128.6
time="2018-09-24T13:40:20Z" level=error msg="could not retrieve cluster config from api" status_code=503
time="2018-09-24T13:40:20Z" level=error msg="failed to join existing cluster" action=create category=etcd endpoint="172.28.128.3,172.28.128.4,172.28.128.5,172.28.128.6" error="503 Service Unavailable" module=cp
time="2018-09-24T13:40:20Z" level=info msg="retrying cluster join in 5 seconds..." action=create category=etcd module=cp
Reason:
StorageOS uses a gossip protocol to discover the nodes in the cluster. When
StorageOS starts, one or more active nodes must be referenced so new nodes can
query existing nodes for the list of members. This error indicates that the node
can’t connect to any of the nodes in the known list. The known list is defined
in the JOIN
variable.
If there are no active StorageOS nodes, the bootstrap process will elect the
first node in the JOIN
variable as master, and the rest will try to
discover from it. In case of that node not starting, the whole cluster will
remain unable to bootstrap.
Installations of StorageOS use a DaemonSet, and by default do not schedule
StorageOS pods to master nodes, due to the presence of the
node-role.kubernetes.io/master:NoSchedule
taint that is typically present. In
such cases the JOIN
variable must not contain master nodes or the StorageOS
cluster will remain unable to start.
Assert:
Check that the first node of the JOIN
variable started properly.
root@node1:~/# kubectl -n kube-system describe ds/storageos | grep JOIN
JOIN: 172.28.128.3,172.28.128.4,172.28.128.5
root@node1:~/# kubectl -n kube-system get pod -o wide | grep 172.28.128.3
storageos-8zqxl 1/1 Running 0 2m 172.28.128.3 node1
Solution:
Make sure that the JOIN
variable doesn’t specify the master nodes. In case
you are using the discovery service, it is necessary to ensure that the
DaemonSet won’t allocate Pods on the masters. This can be achieved with taints,
node selectors or labels.
For installations with the StorageOS operator you can specify which nodes to deploy StorageOS on using nodeSelectors. See examples in the Cluster Operator Examples page.
For more advanced installations using compute-only and storage nodes, check the
storageos.com/deployment=computeonly
label that can be added to the nodes
through Kubernetes node labels, or StorageOS in the Labels page.
LIO Init:Error
Issue:
StorageOS pods not starting with Init:Error
kubectl -n kube-system get pod
NAME READY STATUS RESTARTS AGE
storageos-2kwqx 0/3 Init:Err 0 6s
storageos-cffcr 0/3 Init:Err 0 6s
storageos-d4f69 0/3 Init:Err 0 6s
storageos-nhq7m 0/3 Init:Err 0 6s
Reason:
This indicates that since the Linux open source SCSI drivers are not enabled, StorageOS cannot start. The StorageOS DaemonSet enables the required kernel modules on the host system. If you are seeing these errors it is because that container couldn’t load the modules.
Assert
Check the logs of the init container.
kubectl -n kube-system logs $ANY_STORAGEOS_POD -c storageos-init
In case of failure, it will show the following output, indicating which kernel modules couldn’t be loaded or that they are not properly configured:
Checking configfs
configfs mounted on sys/kernel/config
Module target_core_mod is not running
executing modprobe -b target_core_mod
Module tcm_loop is not running
executing modprobe -b tcm_loop
modprobe: FATAL: Module tcm_loop not found.
Solution:
Install the required kernel modules (usually found in the
linux-image-extra-$(uname -r)
package of your distribution) on your nodes
following this prerequisites page and delete StorageOS
pods, allowing the DaemonSet to create the pods again.
LIO not enabled
Issue:
StorageOS node can’t start and shows the following log entries.
time="2018-09-24T14:34:40Z" level=error msg="liocheck returned error" category=liocheck error="exit status 1" module=dataplane stderr="Sysfs root '/sys/kernel/config/target' is missing, is kernel configfs present and target_core_mod loaded? category=fslio level=warn\nRuntime error checking stage 'target_core_mod': SysFs root missing category=fslio level=warn\nliocheck: FAIL (lio_capable_system() returns failure) category=fslio level=fatal\n" stdout=
time="2018-09-24T14:34:40Z" level=error msg="failed to start dataplane services" error="system dependency check failed: exit status 1" module=command
Reason:
This indicates that one or more kernel modules required for StorageOS are not loaded.
Assert
The following kernel modules must be enabled in the host.
lsmod | egrep "^tcm_loop|^target_core_mod|^target_core_file|^configfs"
Solution:
Install the required kernel modules (usually found in the
linux-image-extra-$(uname -r)
package of your distribution) on your nodes
following this prerequisites page and restart the container.
(OpenShift) StorageOS pods missing – DaemonSet error
StorageOS DaemonSet doesn’t have any pod replicas. The DaemonSet couldn’t allocate any Pod due to security issues.
Issue:
[root@master02 standard]# oc get pod
No resources found.
[root@master02 standard]# oc describe daemonset storageos
(...)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 0s (x12 over 10s) daemonset-controller Error creating: pods "storageos-" is forbidden: unable to validate against any security context constraint: [provider restricted: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used provider restricted: .spec.securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.initContainers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed capabilities.add: Invalid value: "SYS_ADMIN": capability may not be added spec.initContainers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.initContainers[0].securityContext.containers[0].hostPort: Invalid value: 5705: Host ports are not allowed to be used spec.initContainers[0].securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed capabilities.add: Invalid value: "SYS_ADMIN": capability may not be added spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.containers[0].securityContext.containers[0].hostPort: Invalid value: 5705: Host ports are not allowed to be used spec.containers[0].securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used]
Reason:
The OpenShift cluster has security context constraint policies enabled that forbid any pod, without an explicitly set policy for the service account, to be allocated.
Assert:
Check if the StorageOS ServiceAccount can create pods with enough permissions
oc get scc privileged -o yaml # Or custom scc with enough privileges
(...)
users:
- system:admin
- system:serviceaccount:openshift-infra:build-controller
- system:serviceaccount:management-infra:management-admin
- system:serviceaccount:management-infra:inspector-admin
- system:serviceaccount:storageos:storageos <--
- system:serviceaccount:tiller:tiller
If the StorageOS sa system:serviceaccount:storageos:storageos is in the privileged scc it will be able to create pods.
Solution:
Add the ServiceAccount system:serviceaccount:storageos:storageos to a scc with enough privileges.
oc adm policy add-scc-to-user privileged system:serviceaccount:storageos:storageos
Getting Help
If our troubleshooting guides do not help resolve your issue, please see our support section for details on how to get in touch with us.