Troubleshooting

Node name different from Hostname

Issue:

StorageOS nodes can’t join the cluster showing the following log entries.

time="2018-09-24T13:47:02Z" level=error msg="failed to start api" error="error verifying UUID: UUID aed3275f-846b-1f75-43a1-adbfec8bf974 has already been registered and has hostname 'debian-4', not 'node4'" module=command

Reason:

The StorageOS registration process to start the cluster uses the hostname of the node where StorageOS container is running, provided by Kubernetes. However, StorageOS verifies the network hostname of the OS as a prestart check to make sure it can communicate with other nodes. If those 2 names don’t match, StorageOS will remain unable to start.

Solution:

Make sure the hostnames match with the Kubernetes advertised names. If you have changed the hostname of your nodes, make sure that you restart the nodes to apply the change.

One node clusters

Issue:

StorageOS nodes have started creating multiple clusters of one node, rather than one cluster of many nodes.

[email protected]:~# storageos -H node1 node ls
NAME                ADDRESS             HEALTH                   SCHEDULER           VOLUMES             TOTAL
node1               172.28.128.3        Healthy About a minute   true                M: 0, R: 0          8.699GiB
[email protected]:~# storageos -H node2 node ls 
NAME                ADDRESS             HEALTH                   SCHEDULER           VOLUMES             TOTAL
node2               172.28.128.4        Healthy About a minute   true                M: 0, R: 0          8.699GiB
[email protected]:~# storageos -H node3 node ls 
NAME                ADDRESS             HEALTH                   SCHEDULER           VOLUMES             TOTAL
node3               172.28.128.5        Healthy About a minute   true                M: 0, R: 0          8.699GiB
[email protected]:~# storageos -H node4 node ls 
NAME                ADDRESS             HEALTH                   SCHEDULER           VOLUMES             TOTAL
node4               172.28.128.6        Healthy About a minute   true                M: 0, R: 0          8.699GiB

Reason:

The JOIN variable has been misconfigured. One common mistake is to set the variable to localhost or set to the value of the ADVERTISE_IP.

Installations with Helm might cause this behaviour unless the JOIN parameter is explicitly defined.

StorageOS uses the JOIN variable to discover other nodes in the cluster during the node bootstrapping process. It must be set to one or more active nodes.

You don’t actually need to specify all the nodes. Once a new StorageOS node can connect to a member of the cluster the gossip protocol discovers the whole list of members. For high availability during the bootstrap process, it is recommended to set up as many as possible, so if one node is unavailable at the bootstrap process the next in the list will be queried.

Solution:

Define the JOIN variable according to the discovery documentation.

Peer discovery - Networking

Issue:

StorageOS nodes can’t join the cluster showing the following logs after one minute of container uptime.

time="2018-09-24T13:40:20Z" level=info msg="not first cluster node, joining first node" action=create address=172.28.128.5 category=etcd host=node3 module=cp target=172.28.128.6
time="2018-09-24T13:40:20Z" level=error msg="could not retrieve cluster config from api" status_code=503
time="2018-09-24T13:40:20Z" level=error msg="failed to join existing cluster" action=create category=etcd endpoint="172.28.128.3,172.28.128.4,172.28.128.5,172.28.128.6" error="503 Service Unavailable" module=cp
time="2018-09-24T13:40:20Z" level=info msg="retrying cluster join in 5 seconds..." action=create category=etcd module=cp

Reason:

StorageOS uses a gossip protocol to discover the nodes in the cluster. When StorageOS starts, one or more nodes can be referenced so new nodes can query existing ones for the list of members. This error indicates that the node can’t connect to any of the nodes in the known list. The known list is defined in the JOIN variable.

Doublecheck:

It is likely that ports are block by a firewall.

SSH into one of your nodes and check connectivity to the rest of the nodes.

# Successfull execution:
[[email protected] ~]# nc -zv node04 5705
Ncat: Version 7.50 ( https://nmap.org/ncat  )
Ncat: Connected to 10.0.1.166:5705.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.

StorageOS exposes network diagnostics in its API, viewable from the CLI. To use this feature, the CLI must query the API of a running node. The diagnostics show information from all known cluster members. If all the ports are blocked during the first bootstrap of the cluster, the diagnostics won’t show any data as nodes couldn’t register.

StorageOS networks diagnostics are available for storageos-rc5 and storageos-cli-rc3 and above.

# Example:
[email protected]:~# storageos cluster connectivity
SOURCE  NAME            ADDRESS            LATENCY      STATUS  MESSAGE
node4   node2.nats      172.28.128.4:5708  1.949275ms   OK
node4   node3.api       172.28.128.5:5705  3.070574ms   OK
node4   node3.nats      172.28.128.5:5708  2.989238ms   OK
node4   node2.directfs  172.28.128.4:5703  2.925707ms   OK
node4   node3.etcd      172.28.128.5:5707  2.854726ms   OK
node4   node3.directfs  172.28.128.5:5703  2.833371ms   OK
node4   node1.api       172.28.128.3:5705  2.714467ms   OK
node4   node1.nats      172.28.128.3:5708  2.613752ms   OK
node4   node1.etcd      172.28.128.3:5707  2.594159ms   OK
node4   node1.directfs  172.28.128.3:5703  2.601834ms   OK
node4   node2.api       172.28.128.4:5705  2.598236ms   OK
node4   node2.etcd      172.28.128.4:5707  16.650625ms  OK
node3   node4.nats      172.28.128.6:5708  1.304126ms   OK
node3   node4.api       172.28.128.6:5705  1.515218ms   OK
node3   node2.directfs  172.28.128.4:5703  1.359827ms   OK
node3   node1.api       172.28.128.3:5705  1.185535ms   OK
node3   node4.directfs  172.28.128.6:5703  1.379765ms   OK
node3   node1.etcd      172.28.128.3:5707  1.221176ms   OK
node3   node1.nats      172.28.128.3:5708  1.330122ms   OK
node3   node2.api       172.28.128.4:5705  1.238541ms   OK
node3   node1.directfs  172.28.128.3:5703  1.413574ms   OK
node3   node2.etcd      172.28.128.4:5707  1.214273ms   OK
node3   node2.nats      172.28.128.4:5708  1.321145ms   OK
node1   node4.directfs  172.28.128.6:5703  1.140797ms   OK
node1   node3.api       172.28.128.5:5705  1.089252ms   OK
node1   node4.api       172.28.128.6:5705  1.178439ms   OK
node1   node4.nats      172.28.128.6:5708  1.176648ms   OK
node1   node2.directfs  172.28.128.4:5703  1.529612ms   OK
node1   node2.etcd      172.28.128.4:5707  1.165681ms   OK
node1   node2.api       172.28.128.4:5705  1.29602ms    OK
node1   node2.nats      172.28.128.4:5708  1.267454ms   OK
node1   node3.nats      172.28.128.5:5708  1.485657ms   OK
node1   node3.etcd      172.28.128.5:5707  1.469429ms   OK
node1   node3.directfs  172.28.128.5:5703  1.503015ms   OK
node2   node4.directfs  172.28.128.6:5703  1.484ms      OK
node2   node1.directfs  172.28.128.3:5703  1.275304ms   OK
node2   node4.nats      172.28.128.6:5708  1.261422ms   OK
node2   node4.api       172.28.128.6:5705  1.465532ms   OK
node2   node3.api       172.28.128.5:5705  1.252768ms   OK
node2   node3.nats      172.28.128.5:5708  1.212332ms   OK
node2   node3.directfs  172.28.128.5:5703  1.192792ms   OK
node2   node3.etcd      172.28.128.5:5707  1.270076ms   OK
node2   node1.etcd      172.28.128.3:5707  1.218522ms   OK
node2   node1.api       172.28.128.3:5705  1.363071ms   OK
node2   node1.nats      172.28.128.3:5708  1.349383ms   OK

Solution:

Open ports following the prerequisites page.

Peer discovery - Pod allocation

Issue:

StorageOS nodes can’t join the cluster and show the following log entries.

time="2018-09-24T13:40:20Z" level=info msg="not first cluster node, joining first node" action=create address=172.28.128.5 category=etcd host=node3 module=cp target=172.28.128.6
time="2018-09-24T13:40:20Z" level=error msg="could not retrieve cluster config from api" status_code=503
time="2018-09-24T13:40:20Z" level=error msg="failed to join existing cluster" action=create category=etcd endpoint="172.28.128.3,172.28.128.4,172.28.128.5,172.28.128.6" error="503 Service Unavailable" module=cp
time="2018-09-24T13:40:20Z" level=info msg="retrying cluster join in 5 seconds..." action=create category=etcd module=cp

Reason:

StorageOS uses a gossip protocol to discover the nodes in the cluster. When StorageOS starts, one or more active nodes must be referenced so new nodes can query existing ones for the list of members. This error indicates that the node can’t connect to any of the nodes in the known list. The known list is defined in the JOIN variable.

If there are no active StorageOS nodes, the bootstrap process will elect the first node in the JOIN variable as master, and the rest will try to discover from it. In case of that node not starting, the whole cluster will remain unable to bootstrap.

Installations of StorageOS on Kubernetes use a DaemonSet, and by default do not schedule StorageOS pods to master nodes, due to the presence of the node-role.kubernetes.io/master:NoSchedule taint in typical installations. In such cases the JOIN variable must not contain master nodes or the StorageOS cluster will remain unable to start.

Doublecheck:

Check that the first node of the JOIN variable started properly.

[email protected]:~/# kubectl -n storageos describe ds/storageos | grep JOIN
    JOIN:          172.28.128.3,172.28.128.4,172.28.128.5
[email protected]:~/# kubectl -n storageos get pod -o wide | grep 172.28.128.3
storageos-8zqxl   1/1       Running   0          2m        172.28.128.3   node1

Solution:

Make sure that the JOIN variable doesn’t specify the master nodes. In case you are using the discovery service, it is necessary to ensure that the DaemonSet won’t allocate Pods on the masters. This can be achieved with taints, node selectors or labels.

An example of deployment is available to see how to run StorageOS with node labels.

LIO Init:Error

Issue:

StorageOS pods not starting with Init:Error

kubectl -n storageos get pod
NAME              READY     STATUS              RESTARTS   AGE
storageos-2kwqx   0/1       Init:Err             0          6s
storageos-cffcr   0/1       Init:Err             0          6s
storageos-d4f69   0/1       Init:Err             0          6s
storageos-nhq7m   0/1       Init:Err             0          6s

Reason:

This indicates that since the Linux open source SCSI drivers are not enabled, StorageOS cannot start. The StorageOS DaemonSet enables the required kernel modules from the host system. If you are seeing these errors it is because that container couldn’t load the modules.

Doublecheck

Check the logs of the init container.

kubectl -n storageos logs $ANY_STORAGEOS_POD -c enable-lio

In case of failure, it will show the following output, indicating which kernel modules couldn’t be loaded or that they are not properly configured:

Checking configfs
configfs mounted on sys/kernel/config
Module target_core_mod is not running
executing modprobe -b target_core_mod
Module tcm_loop is not running
executing modprobe -b tcm_loop
modprobe: FATAL: Module tcm_loop not found.

Solution:

Install the required kernel modules (usually found in the linux-image-extra-$(uname -r) package of your distribution) on your nodes following this prerequisites page and delete StorageOS pods, allowing the DaemonSet to create the pods again.

LIO not enabled

Issue:

StorageOS node can’t start and shows the following log entries.

time="2018-09-24T14:34:40Z" level=error msg="liocheck returned error" category=liocheck error="exit status 1" module=dataplane stderr="Sysfs root '/sys/kernel/config/target' is missing, is kernel configfs present and target_core_mod loaded? category=fslio level=warn\nRuntime error checking stage 'target_core_mod': SysFs root missing category=fslio level=warn\nliocheck: FAIL (lio_capable_system() returns failure) category=fslio level=fatal\n" stdout=
time="2018-09-24T14:34:40Z" level=error msg="failed to start dataplane services" error="system dependency check failed: exit status 1" module=command

Reason:

This indicates that one or more kernel modules required for StorageOS are not loaded.

Doublecheck

The following kernel modules must be enabled in the host.

lsmod  | egrep "^tcm_loop|^target_core_mod|^target_core_file|^configfs"

Solution:

Install the required kernel modules (usually found in the linux-image-extra-$(uname -r) package of your distribution) on your nodes following this prerequisites page and restart the container.