Troubleshooting
- Node name different from Hostname
- One node clusters
- Peer discovery - Networking
- Peer discovery - Pod allocation
- LIO Init:Error
- LIO not enabled
Node name different from Hostname
Issue:
StorageOS nodes can’t join the cluster showing the following log entries.
time="2018-09-24T13:47:02Z" level=error msg="failed to start api" error="error verifying UUID: UUID aed3275f-846b-1f75-43a1-adbfec8bf974 has already been registered and has hostname 'debian-4', not 'node4'" module=command
Reason:
The StorageOS registration process to start the cluster uses the hostname of the node where StorageOS container is running, provided by Kubernetes. However, StorageOS verifies the network hostname of the OS as a prestart check to make sure it can communicate with other nodes. If those 2 names don’t match, StorageOS will remain unable to start.
Solution:
Make sure the hostnames match with the Kubernetes advertised names. If you have changed the hostname of your nodes, make sure that you restart the nodes to apply the change.
One node clusters
Issue:
StorageOS nodes have started creating multiple clusters of one node, rather than one cluster of many nodes.
[email protected]:~# storageos -H node1 node ls
NAME ADDRESS HEALTH SCHEDULER VOLUMES TOTAL
node1 172.28.128.3 Healthy About a minute true M: 0, R: 0 8.699GiB
[email protected]:~# storageos -H node2 node ls
NAME ADDRESS HEALTH SCHEDULER VOLUMES TOTAL
node2 172.28.128.4 Healthy About a minute true M: 0, R: 0 8.699GiB
[email protected]:~# storageos -H node3 node ls
NAME ADDRESS HEALTH SCHEDULER VOLUMES TOTAL
node3 172.28.128.5 Healthy About a minute true M: 0, R: 0 8.699GiB
[email protected]:~# storageos -H node4 node ls
NAME ADDRESS HEALTH SCHEDULER VOLUMES TOTAL
node4 172.28.128.6 Healthy About a minute true M: 0, R: 0 8.699GiB
Reason:
The JOIN
variable has been misconfigured. One common mistake is to set the
variable to localhost
or set to the value of the ADVERTISE_IP
.
Installations with Helm might cause this behaviour unless the
JOIN
parameter is explicitly defined.
StorageOS uses the JOIN
variable to discover other nodes in the cluster during
the node bootstrapping process. It must be set to one or more active nodes.
You don’t actually need to specify all the nodes. Once a new StorageOS node can connect to a member of the cluster the gossip protocol discovers the whole list of members. For high availability during the bootstrap process, it is recommended to set up as many as possible, so if one node is unavailable at the bootstrap process the next in the list will be queried.
Solution:
Define the JOIN
variable according to the discovery documentation.
Peer discovery - Networking
Issue:
StorageOS nodes can’t join the cluster showing the following logs after one minute of container uptime.
time="2018-09-24T13:40:20Z" level=info msg="not first cluster node, joining first node" action=create address=172.28.128.5 category=etcd host=node3 module=cp target=172.28.128.6
time="2018-09-24T13:40:20Z" level=error msg="could not retrieve cluster config from api" status_code=503
time="2018-09-24T13:40:20Z" level=error msg="failed to join existing cluster" action=create category=etcd endpoint="172.28.128.3,172.28.128.4,172.28.128.5,172.28.128.6" error="503 Service Unavailable" module=cp
time="2018-09-24T13:40:20Z" level=info msg="retrying cluster join in 5 seconds..." action=create category=etcd module=cp
Reason:
StorageOS uses a gossip protocol to discover the nodes in the cluster. When
StorageOS starts, one or more nodes can be referenced so new nodes can query
existing ones for the list of members. This error indicates that the node can’t
connect to any of the nodes in the known list. The known list is defined in the
JOIN
variable.
Assert:
It is likely that ports are block by a firewall.
SSH into one of your nodes and check connectivity to the rest of the nodes.
# Successfull execution:
[[email protected] ~]# nc -zv node04 5705
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 10.0.1.166:5705.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.
StorageOS exposes network diagnostics in its API, viewable from the CLI. To use this feature, the CLI must query the API of a running node. The diagnostics show information from all known cluster members. If all the ports are blocked during the first bootstrap of the cluster, the diagnostics won’t show any data as nodes couldn’t register.
StorageOS networks diagnostics are available for storageos-rc5 and storageos-cli-rc3 and above.
# Example:
[email protected]:~# storageos cluster connectivity
SOURCE NAME ADDRESS LATENCY STATUS MESSAGE
node4 node2.nats 172.28.128.4:5708 1.949275ms OK
node4 node3.api 172.28.128.5:5705 3.070574ms OK
node4 node3.nats 172.28.128.5:5708 2.989238ms OK
node4 node2.directfs 172.28.128.4:5703 2.925707ms OK
node4 node3.etcd 172.28.128.5:5707 2.854726ms OK
node4 node3.directfs 172.28.128.5:5703 2.833371ms OK
node4 node1.api 172.28.128.3:5705 2.714467ms OK
node4 node1.nats 172.28.128.3:5708 2.613752ms OK
node4 node1.etcd 172.28.128.3:5707 2.594159ms OK
node4 node1.directfs 172.28.128.3:5703 2.601834ms OK
node4 node2.api 172.28.128.4:5705 2.598236ms OK
node4 node2.etcd 172.28.128.4:5707 16.650625ms OK
node3 node4.nats 172.28.128.6:5708 1.304126ms OK
node3 node4.api 172.28.128.6:5705 1.515218ms OK
node3 node2.directfs 172.28.128.4:5703 1.359827ms OK
node3 node1.api 172.28.128.3:5705 1.185535ms OK
node3 node4.directfs 172.28.128.6:5703 1.379765ms OK
node3 node1.etcd 172.28.128.3:5707 1.221176ms OK
node3 node1.nats 172.28.128.3:5708 1.330122ms OK
node3 node2.api 172.28.128.4:5705 1.238541ms OK
node3 node1.directfs 172.28.128.3:5703 1.413574ms OK
node3 node2.etcd 172.28.128.4:5707 1.214273ms OK
node3 node2.nats 172.28.128.4:5708 1.321145ms OK
node1 node4.directfs 172.28.128.6:5703 1.140797ms OK
node1 node3.api 172.28.128.5:5705 1.089252ms OK
node1 node4.api 172.28.128.6:5705 1.178439ms OK
node1 node4.nats 172.28.128.6:5708 1.176648ms OK
node1 node2.directfs 172.28.128.4:5703 1.529612ms OK
node1 node2.etcd 172.28.128.4:5707 1.165681ms OK
node1 node2.api 172.28.128.4:5705 1.29602ms OK
node1 node2.nats 172.28.128.4:5708 1.267454ms OK
node1 node3.nats 172.28.128.5:5708 1.485657ms OK
node1 node3.etcd 172.28.128.5:5707 1.469429ms OK
node1 node3.directfs 172.28.128.5:5703 1.503015ms OK
node2 node4.directfs 172.28.128.6:5703 1.484ms OK
node2 node1.directfs 172.28.128.3:5703 1.275304ms OK
node2 node4.nats 172.28.128.6:5708 1.261422ms OK
node2 node4.api 172.28.128.6:5705 1.465532ms OK
node2 node3.api 172.28.128.5:5705 1.252768ms OK
node2 node3.nats 172.28.128.5:5708 1.212332ms OK
node2 node3.directfs 172.28.128.5:5703 1.192792ms OK
node2 node3.etcd 172.28.128.5:5707 1.270076ms OK
node2 node1.etcd 172.28.128.3:5707 1.218522ms OK
node2 node1.api 172.28.128.3:5705 1.363071ms OK
node2 node1.nats 172.28.128.3:5708 1.349383ms OK
Solution:
Open ports following the prerequisites page.
Peer discovery - Pod allocation
Issue:
StorageOS nodes can’t join the cluster and show the following log entries.
time="2018-09-24T13:40:20Z" level=info msg="not first cluster node, joining first node" action=create address=172.28.128.5 category=etcd host=node3 module=cp target=172.28.128.6
time="2018-09-24T13:40:20Z" level=error msg="could not retrieve cluster config from api" status_code=503
time="2018-09-24T13:40:20Z" level=error msg="failed to join existing cluster" action=create category=etcd endpoint="172.28.128.3,172.28.128.4,172.28.128.5,172.28.128.6" error="503 Service Unavailable" module=cp
time="2018-09-24T13:40:20Z" level=info msg="retrying cluster join in 5 seconds..." action=create category=etcd module=cp
Reason:
StorageOS uses a gossip protocol to discover the nodes in the cluster. When
StorageOS starts, one or more active nodes must be referenced so new nodes can
query existing ones for the list of members. This error indicates that the node
can’t connect to any of the nodes in the known list. The known list is defined
in the JOIN
variable.
If there are no active StorageOS nodes, the bootstrap process will elect the
first node in the JOIN
variable as master, and the rest will try to
discover from it. In case of that node not starting, the whole cluster will
remain unable to bootstrap.
Installations of StorageOS on Kubernetes use a DaemonSet, and by
default do not schedule StorageOS pods to master nodes, due to the presence of
the node-role.kubernetes.io/master:NoSchedule
taint in typical installations.
In such cases the JOIN
variable must not contain master nodes or the
StorageOS cluster will remain unable to start.
Assert:
Check that the first node of the JOIN
variable started properly.
[email protected]:~/# kubectl -n storageos describe ds/storageos | grep JOIN
JOIN: 172.28.128.3,172.28.128.4,172.28.128.5
[email protected]:~/# kubectl -n storageos get pod -o wide | grep 172.28.128.3
storageos-8zqxl 1/1 Running 0 2m 172.28.128.3 node1
Solution:
Make sure that the JOIN
variable doesn’t specify the master nodes. In case
you are using the discovery service, it is necessary to ensure that the
DaemonSet won’t allocate Pods on the masters. This can be achieved with taints,
node selectors or labels.
For installations with the StorageOS operator you can specify which nodes to deploy StorageOS on using node labels.
For more advanced installations using compute-only and storage nodes our example deployment is available as an example of how to run StorageOS with node labels.
LIO Init:Error
Issue:
StorageOS pods not starting with Init:Error
kubectl -n storageos get pod
NAME READY STATUS RESTARTS AGE
storageos-2kwqx 0/1 Init:Err 0 6s
storageos-cffcr 0/1 Init:Err 0 6s
storageos-d4f69 0/1 Init:Err 0 6s
storageos-nhq7m 0/1 Init:Err 0 6s
Reason:
This indicates that since the Linux open source SCSI drivers are not enabled, StorageOS cannot start. The StorageOS DaemonSet enables the required kernel modules from the host system. If you are seeing these errors it is because that container couldn’t load the modules.
Assert
Check the logs of the init container.
kubectl -n storageos logs $ANY_STORAGEOS_POD -c enable-lio
In case of failure, it will show the following output, indicating which kernel modules couldn’t be loaded or that they are not properly configured:
Checking configfs
configfs mounted on sys/kernel/config
Module target_core_mod is not running
executing modprobe -b target_core_mod
Module tcm_loop is not running
executing modprobe -b tcm_loop
modprobe: FATAL: Module tcm_loop not found.
Solution:
Install the required kernel modules (usually found in the
linux-image-extra-$(uname -r)
package of your distribution) on your nodes
following this prerequisites page and delete StorageOS pods,
allowing the DaemonSet to create the pods again.
LIO not enabled
Issue:
StorageOS node can’t start and shows the following log entries.
time="2018-09-24T14:34:40Z" level=error msg="liocheck returned error" category=liocheck error="exit status 1" module=dataplane stderr="Sysfs root '/sys/kernel/config/target' is missing, is kernel configfs present and target_core_mod loaded? category=fslio level=warn\nRuntime error checking stage 'target_core_mod': SysFs root missing category=fslio level=warn\nliocheck: FAIL (lio_capable_system() returns failure) category=fslio level=fatal\n" stdout=
time="2018-09-24T14:34:40Z" level=error msg="failed to start dataplane services" error="system dependency check failed: exit status 1" module=command
Reason:
This indicates that one or more kernel modules required for StorageOS are not loaded.
Assert
The following kernel modules must be enabled in the host.
lsmod | egrep "^tcm_loop|^target_core_mod|^target_core_file|^configfs"
Solution:
Install the required kernel modules (usually found in the
linux-image-extra-$(uname -r)
package of your distribution) on your nodes
following this prerequisites page and restart the container.