Disk Full
When the physical disks in your cluster get full, or start using disk space too
fast, Kubernetes will automatically apply the DiskPressure:NoExecute
taint.
The StorageOS Daemonset has a toleration applied for this taint, so it will continue to run.
Detecting a full disk
You will see the following when viewing the StorageOS container logs on the
affected node with kubectl logs storageos-daemonset-xxxx -c storageos
:
"msg": "StartAsyncFallocate: insufficient free space on file system free_space=1048305664 required_free_space=1073741824",
"msg": "Write: write failed volid=155051 error=BlobStorage::PrepWrite encountered a previous IO error preventing future IO for safety",
"msg": "Write: write failed volid=155051 error=all blob files are full - can not complete write",
"msg": "Write: write failed volid=155051 error=BlobStorage::PrepWrite encountered a previous IO error preventing future IO for safety",
"msg": "SCSI command failed type=write error=FATAL retries=0 time_to_deadline_secs=89",
When trying to attach a volume to a node with a full disk you will see the
following when running kubectl describe pod $POD_USING_VOLUME
:
AttachVolume.Attach failed for volume "pvc-xxx" : rpc error: code = Internal desc = internal error: rpc error: code = Internal desc = rpc error: code = Internal desc = fs: STATUS_FORBIDDEN: create failed in Notify handler error=Failed to create LUN for FsConfigVolume{volume_id=
Consequences of a full disk
Volume Provisioning
When provisioning a new volume, the dataplane checks that there is at least
1GiB
of space left on the nodes’ underlying filesystem for the blob files
located at /var/lib/storageos/data/dev[0-9]+/vol.xxxxxx.0.blob
and
/var/lib/storageos/data/dev[0-9]+/vol.xxxxxx.1.blob
.
If there is insufficient space for both of the blob files that StorageOS uses to store data, then the dataplane fails the volume create request.
Runtime Access
At runtime, if an attempted write to a blob file returns an ENOSPC
exception,
the dataplane marks the file as full. Once both StorageOS blob files in a
volume are marked as full, the dataplane marks the deployment with an error
flag, and all subsequent read/write operations will return an I/O error.
This flag is only stored in memory, therefore, to clear this flag, the StorageOS daemonset pod on the affected node must be restarted after remediating the disk space issue.
Replica Failover
A replica on a full disk can’t be failed over successfully.
Recovering from a full disk
To recover from a disk full error, you can either add new storage space into the affected node, or delete existing persistent volume claims`.
N.B. However you choose to resolve the issue, the StorageOS Daemonset Pod must be restarted.
This is because the blob files disallow operations at runtime via the previously discussed error flag stored in memory. This flag does not survive boot cycles, so after the pod restarts, volumes can operate normally once more.
Adding New Storage Space
If you choose to recover by expanding your capacity, you have two main options:
- Add new storage devices at
/var/lib/storageos/data/dev[0-9]+
as described here. - Expand the underlying filesystem that StorageOS is using as described here.
After expanding capacity remember to restart the StorageOS daemonset pod.
Deleting existing PVCs
If you choose to recover by deleting existing persistent volume claims simply:
kubectl delete pvc <PVCOnFullNodeName>
Then restart the StorageOS daemonset pod.