Hello Longhorn!

After the disk server upgrades, but which also somewhat underscored the need for it, I decided to finally “shit or get off the pot” on looking at some durable storage for things. Because at the moment when the disk server goes down, pretty much everything else does too. Some things, like Plex, this is unavoidable. For others, like Home Assistant, it’s very avoidable, I’m just lazy.

But so I pointed the latest version of my “Put Kubernetes on the Thing” Ansible playbook at the first two compute nodes, then one at a time drained them, powered them off, and fitted a 500GB SSD to them (one Samsung Evo 850 and one 860), and fired them back up.

I installed Longhorn to the cluster, deleted the default disks (I don’t want them on the boot SSDs), created new ext4 partitions, pointed the two compute nodes at them, and then for completeness I created a ZFS dataset on the disk server and pointed Longhorn at it too. My theory was that if I used taint to keep things on the compute nodes, but a third extra replica on the disk machine, I should be fine.

To test things out, I figured Traefik is probably a good candidate: it’s needs for persistant storage are just to stop it from needing to get a new ACME certificate on every start-up, so the consequences for shit going wrong are not particularly dire. So I reasoned that I’d add a second volumeMount to it, and I could copy the acme.json file across - this was not to be because I got the error message:

code = Unknown desc = file extent is unsupported: operation not supported

A quick check of old Googs showed a pretty obvious explanation: ZFS-backed data is not supported. Ahh well. So I deleted that disk, set the replicas to 2, and off it went.

I then struggled on a restart because my Traefik pod is rootless with an immutable root partition, and the permissions were coming up wrong. I can use fsGroup to set the mount point of the volume to the gid that the container runs as, and that lets it create acme.json once, before it fails to do anything because it complains that the permissions are too wide - 0660 instead of 0600, which makes sense as the file and dir are owned by root.

I ended up giving up for the night by relaxing the policy on that namespace and just running Traefik as root. I was able to point Longhorn at the former dataset on the disk server, via NFS, for backups though, so that’s nice.

So yeah, I think this will do the job, I just really need a third SSD for the other compute node. I have a 240GB one out of Sabriena’s desktop but I’d kinda like to try find another 500GB… I just don’t want to pay the price of it.

Horsham, VIC, Australia fwaggle

Published:


Modified:


Filed under:


Location:

Horsham, VIC, Australia

Navigation: Older Entry Newer Entry