Etcd

StorageOS uses etcd to store cluster metadata. Because of the strong consistency model that etcd enforces, StorageOS metadata operations are guaranteed to be atomic and consistent.

Installation options

Before installing StorageOS, an etcd cluster needs to be prepared. There are different topologies that fulfil this prerequisite.

External etcd (Production)
etcd as Pods (Testing)

External Etcd

The following topology is designed to provide the highest stability for the etcd cluster. It is necessary for normal StorageOS function to have a reliable metadata cluster. Otherwise, central operations such as provisioning, attachment or failover of volumes cannot be performed. In the event that etcd becomes unavailable, StorageOS clusters become read only, allowing access to data but preventing metadata changes.

It is recommended to install etcd out of the scope of the orchestrator wherever possible. Following CoreOS best practices, a minimum of 3 independent nodes should be dedicated to etcd. StorageOS doesn’t require a high performance etcd cluster as the throughput of metadata to the cluster is low. Depending on the level of redundancy you feel comfortable with you can install etcd on the Kubernetes Master nodes. Take extreme care to avoid collisions of the StorageOS etcd installation with the Kubernetes etcd when using the Kubernetes Master nodes. Precautions such as changing the default configuration for the client and peer ports, and ensuring the etcd data directory is modified. The ansible playbook below will default the etcd installation directory to /var/lib/storageos-etcd.

Installation

If you are familiar with etcd, you can proceed with the CoreOS instructions to install etcd, otherwise this section lays out out an example installation using Ansible.

Clone StorageOS Helper repository

git clone https://github.com/storageos/deploy.git
cd k8s/deploy-storageos/etcd-helpers/etcd-ansible-systemd

Edit the inventory file

Target the nodes that install etcd, where the file hosts serves as an example.

$ cat hosts
[nodes]
centos-1 ip="10.64.10.228"  fqdn="ip-10-64-10-228.eu-west-2.compute.internal"
centos-2 ip="10.64.14.233"  fqdn="ip-10-64-14-233.eu-west-2.compute.internal"
centos-3 ip="10.64.12.111"  fqdn="ip-10-64-12-111.eu-west-2.compute.internal"

# Edit the inventory file
$ vi hosts # Or your own inventory file

Edit the etcd configuration

If targeting Kubernetes Master nodes, you must change etcd_port_client, etcd_port_peers

$ cat group_vars/all
etcd_version: "3.4.9"
etcd_port_client: "2379"
etcd_port_peers: "2380"
etcd_quota_bytes: 8589934592  # 8 GB
etcd_auto_compaction_mode: "revision"
etcd_auto_compaction_retention: "12000"
members: "{{ groups['nodes'] }}"
installation_dir: "/var/lib/storageos-etcd"
advertise_format: 'fqdn' # fqdn || ip
backup_file: "/tmp/backup.db"

tls:
  enabled: false
  ca_common_name: "eu-west-2.compute.internal"
  etcd_common_name: "*.eu-west-2.compute.internal"
  cert_dir: "/etc/etcdtls"
  ca_cert_file: "etcd-ca.pem"
  etcd_server_cert_file: "server.pem"
  etcd_server_key_file: "server-key.pem"
  etcd_client_cert_file: "etcd-client.crt"
  etcd_client_key_file: "etcd-client.key"

$ vi group_vars/all

Choose between using IP addressing or FQDN in the advertise_format parameter. It allows you to decide how Etcd advertise its address to the clients. The format becomes very relevant when using TLS.

If enabling TLS, it is recomended to generate your own CA certificate and key. As an example, you can find it in roles/tls_cert/files/ca.pem and roles/tls_cert/files/ca-key.pem which are generated using cfssl from the roles/tls_cert/files/ca-config.json file. The playbook will generate and distribute the keys and certificates for the client auth on all etcd nodes. Certificates are signed by the CA mentioned.

Install
```
ansible-playbook -i hosts install.yaml
```

Verify installation

The playbook installs the etcdctl binary on the nodes, at /usr/local/bin.

$ ssh $NODE # Any node running the new etcd
$ ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:2379 member list
66946cff1224bb5, started, etcd-b94bqkb9rf,  http://172.28.0.1:2380, http://172.28.0.1:2379
17e7256953f9319b, started, etcd-gjr25s4sdr, http://172.28.0.2:2380, http://172.28.0.2:2379
8b698843a4658823, started, etcd-rqdf9thx5p, http://172.28.0.3:2380, http://172.28.0.3:2379

Managed Services

When running StorageOS on Managed Kubernetes services it may not be possible to deploy with the Production etcd topology described above. It is therefore recommended to deploy etcd on its own as much as possible, even if that means deploying 3 independent VMs for etcd to run on.

As managed services treat nodes as ephemeral resources, if the orchestration deletes the 3 nodes hosting etcd, the result will be catastrophic and a restore from a backup will be needed.

If it is not possible to deploy independent VMs for etcd, etcd can be deployed as pods, inside the cluster. This configuration requires an awareness of the stability that etcd requires. You can use the etcd-as-pods installation option, but be aware of the precautions that need to be taken.

Why External Etcd

etcd is a distributed key-value store database focused on strong consistency. That means that etcd nodes perform operations across the cluster to ensure quorum. In the case that quorum is lost, an etcd node stops and marks its contents as read-only. It cannot guarantee that the data being held is valid. Another peer might have a newer version that has not been delivered. Quorum is fundamental for etcd operations.

In a Kubernetes environment, applications are scheduled across and in some scenarios such as “DiskPressure” they may need to be evicted from a node, and be scheduled onto a different node. With an application such as etcd, the scenario described can result in quorum being lost, making the cluster unable to recover automatically. Usually a 3 node etcd cluster can survive losing one node and recover. However, losing a second node at the same time or even having a network partition between them will result in quorum lost.

Bind Etcd IPs to Kubernetes Service

Kubernetes external services use a DNS name to reference external endpoints. You can use the example from the helper github repository to deploy the external Service. That might be of use when monitoring etcd from Prometheus.

Etcd as Pods

etcd can be deployed in Kubernetes using the official etcd-operator.

Deploying etcd in Kubernetes makes the etcd installation very easy, however be aware that even though the official etcd-operator is maintained by RedHat, it hasn’t been under active development since 2019. As such it may be considered an archived project. For an actively maintained etcd Operator you might want to check the Improbable etcd Operator.

Examples of deploying etcd clusters using the etcd-operator on Kubernetes and OpenShift are available.

Since Kubernetes 1.16 the deployment api uses “apps/v1”. Once you have cloned the coreos etcd operator repository, you will need to change the apiVersion of the file “examples/deployment.yaml” from extensions/v1beta1 to apps/v1.

The official etcd-operator repository also has a backup deployment operator that can help backup etcd data. Make sure you take frequent backups of the etcd cluster as it holds all the StorageOS cluster metadata.

Known etcd-operator issues

This topology is only recommended for deployments where isolated nodes cannot be used.

etcd is a distributed key-value store database focused on strong consistency. That means that etcd nodes perform operations across the cluster to ensure quorum. If quorum is lost, etcd nodes stop and etcd marks its contents as read-only. This is because it cannot guarantee that new data will be valid. Quorum is fundamental for etcd operations. When running etcd in pods it is therefore important to consider that a loss of quorum could arise from etcd pods being evicted from nodes.

Operations such as Kubernetes Upgrades with rolling node pools could cause a total failure of the etcd cluster as nodes are discarded in favor of new ones.

A 3 etcd node cluster can survive losing one node and recover, a 5 node cluster can survive the loss of two nodes. Loss of further nodes will result in quorum being lost.

The etcd-operator doesn’t support a full stop of the cluster. Stopping the etcd cluster is not possible unless a backup is restored.

StorageOS and Etcd

When installing StorageOS, the etcd endpoints are passed in a StorageOSCluster Custom Resource.

For instance:

apiVersion: "storageos.com/v1"
kind: StorageOSCluster
metadata:
  name: "storageos"
spec:
  secretRefName: "storageos-api" # Reference from the Secret created in the previous step
  secretRefNamespace: "default"  # Namespace of the Secret

  (...)

  kvBackend:
    address: 'storageos-etcd-client.etcd:2379' # Example address, change for your etcd endpoint
   #address: '10.42.15.23:2379,10.42.12.22:2379,10.42.13.16:2379' # You can set etcd server ips
    backend: 'etcd'

Note the kvBackend.address section.

For full Custom Resource documentation check StorageOSCluster resource definition.

Best practices

StorageOS uses etcd as a service, whether it is deployed following the above instructions or as a custom installation. It is expected that the user maintains the availability and integrity of the etcd cluster.

It is highly recommended to keep the cluster backed up and ensure high availability of its data. It is also important to keep the latency between StorageOS nodes and the etcd replicas low. Deploying an etcd cluster in a different data center or region can make StorageOS detect etcd nodes as unavailable due to latency. A 10ms latency between StorageOS and etcd would be the maximum threshold for proper functioning of the system.

Monitoring

It is highly recommended to add monitoring to the etcd cluster. etcd serves Prometheus metrics on the client port http://etc-url:2379/metrics.

You can use StorageOS developed Grafana Dashboards for etcd. When using etcd for production, you can use the etcd-cluster-as-service, while the etcd-cluster-as-pod can be used when using etcd from the operator.

Defragmentation

etcd uses revisions to store multiple versions of keys. Compaction removes all key revision prior to a certain revision from etcd. Typically the etcd configuration enables the automatic compaction of keys to prevent performance degradation and limit the storage required. Compaction of revisions can create fragmentation that means space on disk is available for use by etcd but is unavailable for use by the file system. In order to reclaim this space, etcd can be defragmented.

Reclaiming space is important because when the etcd database file grows over the “DB_BACKEND_BYTES” parameter, the cluster triggers an alarm and sets itself read only and only allows reads and deletes. To avoid hitting the db backend bytes limit, compaction and defragmentation are required. How often defragmentation is required depends on the churn of key revisions in etcd.

The Grafana Dashboards mentioned above indicate when nodes require defragmentation. Be aware that defragmentation is a blocking operation that is performed per node, hence the etcd node will be locked for the duration of the defragmentation. Defragmentation usually takes a few milliseconds to complete.