Kubernetes: Calico (canal?)
A while back I started diving in to using Kubernetes for my home server, and I’ve been really enjoying it. Everything new I try to do it in it is quite a learning experience, and there’s inevitably a lot of yak-shaving involved, and this weekend was no exception.
I want to free myself from needing to run containers on the same node the storage is on (currently using the local-storage
storage class). Since my file server is ZFS-based, iSCSI is possible, but in doing so for things like Plex would mean moving a heap of files from a ZFS dataset to an EXT4 FS on a ZFS volume, and frankly I don’t have the space to do that except for extremely slowly.
So what about using NFS? I haven’t used NFS in a long time, but it looks pretty reasonable. I’ve never actually used kerberos auth for it, but even if I can’t make that work my needs are not great and if someone gets access to the NFS server I’ll have bigger issues.
Actually, speaking of that, let’s sort that out… at present, my work machine and my personal Linux laptop have access to the hosts on the raw LAN. Everything else is a step above that and cannot access them. There’s the home VLAN, which cannot access either the hardware LAN or the work VLAN. Then there’s the IoT VLAN, which cannot access any of them. However, if you’re inside the containers, that changes, so let’s fix that, with NetworkPolicy!
It looked pretty easy, but after a quick search it seemed it was not to be - when I set things up, I used Flannel for the network, which is dead-easy but doesn’t support anything but dumb networking, so no NetworkPolicy support! Supposedly you can use “Canal” which uses Calico to do the network policies and other stuff while Flannel still does the heavy lifting, but try as I might I couldn’t make this work on my cluster. I then tried migrating from Flannel to Calico which is supposed to be mostly automatic, and that didn’t work either.
I ended up tearing everything down and rebuilding it, which only made things worse. It turned out that the root cause of my grief seems to have been the Alpine Linux control plane. The first issue on launching the calico containers:
Warning Failed 4s kubelet Error: failed to generate container "281e52d2012d5ac98ddd07128db0ff483415787e97febc67cbf04250a807fbd7" spec: failed to generate spec: path "/sys/fs/" is mounted on "/sys" but it is not a shared mount
This is easy enough to solve, just make /sys a shared mount with: mount --make-shared /sys
Not sure how to make this permanent? I can’t seem to work out if it’s doable with fstab, but I moved on pretty quickly, I should return to that so my cluster will come up correctly after a reboot.
Next, Calico’s installer installs the binaries to the wrong place for Alpine, I downloaded them following the instructions in “Calico the hard way” but I could have just symlinked it as it turned out. Here’s how I downloaded it:
wget -O /usr/libexec/cni/calico https://github.com/projectcalico/cni-plugin/releases/download/v3.14.0/calico-amd64
chmod 755 /usr/libexec/cni/calico
wget -O /usr/libexec/cni/calico-ipam https://github.com/projectcalico/cni-plugin/releases/download/v3.14.0/calico-ipam-amd64
chmod 755 /usr/libexec/cni/calico-ipam
Once everything came up, DNS resolution was not working… I had to run:
kubectl delete pod -n kube-system -l k8s-app=kube-dns
This gave me working DNS, and we’re away! My first NetworkPolicy worked correctly, and I can lock down all the containers to prevent access to the host, which will prevent a compromised container from being able to drive-by my NFS servers.
Oh yeah, those… another day…
Update: 2022-07-04: So it turned out, while I didn’t realise at the time, I’d managed to leave my cluster in a broken-but-working-by-accident state, because Calico and Flannel were running at the same time. Calico was trying to use BIRD, but BIRD was already running due to MetalLB… but the reason it still worked was Flannel’s VXLAN was still configured, but as I was using Calico and not Canal, this wouldn’t survive a full reboot. Which happened when we did the switchboard upgrade.
I later fixed this by using Canal, which took some further effort.