Migrate Etcd

This procedure aims to replace the current nodes of your Etcd cluster with new ones. Migrating Etcd is a delicate operation, so we recommend a careful assessment of the steps required before starting.

The following procedure is scoped for Etcd clusters that are provisioned outside Kubernetes on their own nodes.

Please note that this procedure involves full downtime of StorageOS and access to its data

The following procedure is designed for Etcd clusters that are provisioned externally to Kubernetes, with members that are situated on their own nodes.

Migration

To perform Etcd migrations safely, we recommend adding new Etcd nodes to the existing Etcd cluster, before editing the StorageOS custom resource definition and finally deleting the old Etcd nodes one by one. The full procedure is as follows:

Prepare all the nodes that will host etcd. The Etcd provisioning steps can be found in the etcd section of the StorageOS prerequisites page.

On each new node, edit the /etc/etcd.conf file to contain the IP addresses of all old and new nodes. In the /etc/systemd/system/etcd3.service file, ensure that the --initial-cluster argument includes all old and new nodes. Also ensure that the --initial-cluster-state argument is set to existing.

For example, adding 3 further nodes to an initially 3 node cluster would involve populating /etc/etcd.conf on each of the fourth, fifth and sixth nodes with the IP addresses of all 6 nodes intended for the cluster, as well as environment variables for the client port and peers port:
```
# NODE_IP is the IP of the node where this file resides.
NODE_IP=192.168.152.142
# Node 1 IP
NODE1_IP=192.168.195.168
# Node 2 IP
NODE2_IP=192.168.202.40
# Node 3 IP  
NODE3_IP=192.168.174.117
# Node 4 IP
NODE4_IP=192.168.152.142
# Node 5 IP
NODE5_IP=192.168.198.252
# Node 6 IP
NODE6_IP=192.168.150.122
CLIENT_PORT=2379
PEERS_PORT=2380
```
The executable section of /etc/systemd/system/etcd3.service on each of the new nodes should then be edited to reflect these environment variables. For example, the service file for the second new node to be added to an initially 3 node cluster would refer to the three old nodes and two new nodes. Note that in the --initial-cluster variable we do not include members that have not yet been added to the cluster, apart from the member we are currently adding.
```
...

ExecStart=/usr/local/sbin/etcd3 --name etcd-${NODE_IP} \
    ...
    --initial-cluster-state existing \
    ...
    --initial-cluster etcd-${NODE1_IP}=http://${NODE1_IP}:${PEERS_PORT},etcd-${NODE2_IP}=http://${NODE2_IP}:${PEERS_PORT},etcd-${NODE3_IP}=http://${NODE3_IP}:${PEERS_PORT},etcd-${NODE4_IP}=http://${NODE4_IP}:${PEERS_PORT},etcd-${NODE5_IP}=http://${NODE5_IP}:${PEERS_PORT}

...
                                       
```
At runtime, the value of initial-cluster will resolve to the following:
```
"etcd-192.168.195.168=http://192.168.195.168:2380,etcd-192.168.202.40=http://192.168.202.40:2380,etcd-192.168.174.117=http://192.168.174.117:2380,etcd-192.168.152.1"
```

Connect to an Etcd node that already belongs to the cluster, and add one of the new nodes to the cluster, using the etcdctl member add --learner command per the etcd documentation. The output includes environment variables with the name of the new member and the state and constituents of the cluster now that the new node has been added:



# Name of the new member of etcd
ETCD_NEW_MEMBER="etcd-192.168.152.142"

# Peer url for the new member of etcd, including the port
ETCD_NEW_MEMBER_PEER="http://192.168.152.142:2380"

# Client url for the new member of etcd, including the port
ETCD_NEW_MEMBER_CLIENT="http://192.168.195.168:2379"

# Add the new member to the cluster 
ETCDCTL_API=3 etcdctl member add $ETCD_NEW_MEMBER \ 
 --peer-urls="$ETCD_NEW_MEMBER_PEER" \
 --endpoints="$ETCD_NEW_MEMBER_CLIENT"
Member 4a5820690b65300c added to
cluster 97f788d9ca9b3357
   
ETCD_NAME="etcd-192.168.152.142"
ETCD_INITIAL_CLUSTER="etcd-192.168.202.40=http://192.168.202.40:2380,etcd-192.168.174.117=http://192.168.174.117:2380,etcd-192.168.152.142=http://192.168.152.142:2380,etcd-192.168.195.168=http://192.168.195.168:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.152.142:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

Start the Etcd service on the new node:

systemctl daemon-reload
systemctl enable etcd3.service
systemctl start etcd3.service

List the members in the cluster and confirm that the new node is present and has started, and has therefore successfully been added:

$ ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2379 member list
aa9dec86cc9c2d1, started, etcd-192.168.202.40, http://192.168.202.40:2380, http://192.168.202.40:2379, false
48823f29326a24f7, started, etcd-192.168.174.117, http://192.168.174.117:2380, http://192.168.174.117:2379, false
4a5820690b65300c, started, etcd-192.168.152.142, http://192.168.152.142:2380, http://192.168.152.142:2379, true
gwk09ksjs5a6n862, started, etcd-192.168.152.142, http://192.168.152.142:2380, http://192.168.195.168:2379, false

Promote the learner to be a full voting member of the cluster with the etcdctl member promote <member id> command.

As per the etcd documentation, the learner will fail to be promoted if it is not ready.
Repeat the previous two steps for all the remaining nodes onto which you wish to migrate etcd. Per the Etcd documentation, when adding more than one member to a cluster, it best practice to configure each member one at a time, verifying that each starts successfully before moving on to the next.
Delete the StorageOS Custom Resource per the instructions in the uninstall operations page.

Although deleting the StorageOSCluster Custom Resource will stop StorageOS, the data will be safe. When StorageOS starts again, the Volumes will be available.

Connect to one of the newly-added nodes, and use etcdctl to remove the old Etcd nodes 1 by 1:

$ ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2379 member remove 48823f29326a24f7
Member 48823f29326a24f7 removed from cluster 97f788d9ca9b3357

After removing a node, check that Etcd is healthy and quorum has been maintained with the following commands:

export ETCD_ENDPOINTS="http://192.168.2020.40:2379,http://192.168.174.117.2379,..."
etcdctl --endpoints $ETCD_ENDPOINTS endpoint health -w table
etcdctl --endpoints $ETCD_ENDPOINTS endpoint status -w table

Edit the kvBackend.address parameter of the StorageOS Custom Resource Definition to reflect the locations of the new Etcd nodes, removing the references to the old nodes
Recreate the StorageOS Custom Resource with the newly-edited kvBackend.address parameter

Perform a member list again, to confirm that only the new machines are still present in the cluster:

$ ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2379 member list
4a5820690b65300c, started, etcd-192.168.152.142, http://192.168.152.142:2380, http://192.168.152.142:2379, false
b1780933f495adb3, started, etcd-192.168.150.122, http://192.168.150.122:2380, http://192.168.150.122:2379, false
ee47ff85984afcc0, started, etcd-192.168.198.252, http://192.168.198.252:2380, http://192.168.198.252:2379, false

At this stage Etcd has been successfully migrated.

Etcd Learners

As of v3.4, Etcd includes the functionality to add a new member as a “learner” (a non-voting member of the cluster). This makes the process of adding new members safer, as it allows the new member to complete the process of synchronisation before participating in quorum, minimising cluster downtime.

If you are using an earlier version of etcd, make the following substitutions to the above process:

In step 2, replace the etcdctl member add --learner command with etcdctl member add
Skip theetcdctl member promote <member id> command

Using Etcd learners provides added resilience, so consider upgrading to a version of Etcd that supports this as soon as possible.