-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replicating encrypted child dataset + change-key + incremental receive overwrites master key of replica, causes permission denied on remount #12614
Comments
I can reproduce this issue on a somewhat current master. In short the problem is that the incremental receive overwrites the master key of the replica/encrypted/a dataset with the received key which is encrypted with the wrapping key form src/encrypted. Since the unencrypted master key is cached in memory this goes unnoticed until it is unloaded by the unmount. A subsequent mount tries to decrypt the master key with the wrapping key from replica/encrypted which obviously fails. Please see #12000 (comment) for the terminology used above. For incremental receives we need to detect if the encryption root on the receiving side changed since the last receive and refuse to receive if so. This would break replication from this point on but keep the existing data intact. Not sure how to accomplish this though but I'll have a look. |
For incremental receives i think there should be no need of updating/changing wrapped master key as it is done now (wrapped key hasn't changed in source in example above but is updated in replica ? ) so there won't be problem with decrypting replica. Master key is one so we should be able to decrypt whether enc root is inherited or not like in changing key only case |
Hey all, I hit this issue today. I was trying to send incremental snapshots to my backup pool
After this, I could not mount. Since this is my backup pool, I tried to fix it by changing the key, and that gave me the panic and stack trace mentioned in this issue. I can't tell if my issue is #12000 or not seeing how I got into this state. While I'm in this state, I'm happy to provide anything that helps resolve the issue. Also, if someone with a better understanding of ZFS encryption can help me resolve the issue, I'd really appreciate it. I reeeaaaally don't want to have to rebuild my backup pool if I can help it. I can't even say for sure right now that I haven't lost data, though I don't think I have, luckily. My setup is something like: I'm pretty sure I got into this mess by doing a |
Ah, I think I understand the situation from @AttilaFueloep's comment in the other issue:
Has anyone built such a tool? Is the master key accessible via The plaintext from which the key was derived is the same on both of my systems, but I'm guessing due to salt or IV that the generated key must be different, or the problem does not make sense to me. A fix feels so close, yet so far right now... |
@endotronic I tried everything to rescue my 10TB of backup data but did not succeed. I had to start over. |
@brenc I really appreciate the response here! Real bummer. I'm going to learn from your efforts (thank you so much) and just rebuild my backup pool then. I'm so glad I didn't actually lose anything. |
I have a really nasty workaround that prevents this that I'd like to refine further. (This specific case, the problem is that it resets the pbkdf2salt, and only the pbkdf2salt, on the child, while it's still the same encryptionroot, and without actually rewrapping the key.) (It also should be trivial to work around, I think, if someone is burned by this...I have another hacky branch that would allow that.) |
So in summary, we can't trust native (raw) send / receive with ZFS encryption. And an open issue for two years so far. Lovely. |
I haven't had any issues sending and receiving encrypted datasets as long as I don't try to do it recursively ( |
I documented some testing with -R raw sends back and forth here and didn't encounter any issues #12123 (comment) |
I just fell for the same trap. Luckily, I noticed this quite quickly. My backup of a 2.4 TB dataset was just unmountable. I wonder, if my zvols (another TB) suffer from the same problem. They look fine so far. Will try to mount them soon to see, if they are readable or utterly garbage [Edit] The zvols are also affected. Full backup is completely garbage now. I really wonder, why this is not yet solved over all these years. Sounds like a major breaking bug. |
@systemofapwne I'm planning an encrypted zfs system and am trying to figure out if I will run into this problem. My testing so far did not result in any issues (per my last comment, and more details in this FR: #15687). Did you do a One mitigation I'm considering is to create all datasets in advance (E.g. 100-1000) so I never have to add new ones and thus never need to |
First: my zfs version is
Since I am using TrueNAS and not ZFS directly via CLI, I will try to answer your questions as good as possible. Tank pools root is encrypted, as are all child datasets/zvols, inheriting from the pools root dataset.
Backup pools root is encrypted too, but is distinct to the Tank pools (hence: different keying / encryptionroot)
Child datasets have been replicated from Tank -> Backup. I don't remember, if I had recursive snapsthos on at first, but it is definitely off since first replication went fine and the system went into production. Those snapshots sent to Backup were encrypted and could be unlocked via the password, that I used on Tank. I then enabled inheritance via the UI (equivalent to zfs change-key -i) on all datasets and zvols on Backup. I then lockded the Backup pool. Replicated snapshots kept flying in. Today, a few weeks after, I randomly decided to unlock the backup pool and bam: Datasets couldn't be mounted and zvols seem to be unusable too.
Can you elaborate a bit on this? |
I'll have maybe 4 systems I want to keep in sync with ZFS replication. Let
Initially, the other 3 servers do not have this filesystem structure. I'll need to replicate this to the other 3 servers.
And now whenever a snapshot occurs on any of my datasets, my replication can push incremental snapshot to the severs
But now in the future, all of a sudden we have a new project! So on the main server we create
So that's where I understand I might use And aside from any data loss risk, it's extra complexity to have to do these extra operations. And some severs you might not want to ever load-key (that's a major benefit of zfs encryption to have an "untrusted" server that never sees the keys). So my workaround idea is to create say 1000 datasets.
and then the initial I would use zfs properties or maybe a privileged text file containing the map of dataset names to actual filesystem names, and use some script to sort out mountpoints. (No need to mount the empty slots). |
Sorry for spam @systemofapwne, FYI I accidentally smashed Send midway through typing my last response and had to spend another 5 mins editing it. So if you read it via email it won't make sense. |
I understand. That is almost my situation but for a subtile difference:
I see, what you suggest here. TBH, I myself don't mind so much, if I had to use |
@systemofapwne thanks for explaining your situation. I re-read the original issue, which I realize is more like your situation. I think what I missed/confused reading it before was, yes they are using raw sends, but the replica encryption root is not the same (different IV set), and they are doing the I.e. here the src and replica encryption roots are created differently:
So I appreciate this is still a valid issue, but I wonder why make the encryption roots different initially, and why not raw send |
Exactly!
I completely understand the way how you avoid this problem now and why your strategy works. The problem here is, that people without and in-depth insight into ZFS encryption inheritance (and replication messing with it) easily fall into this trap, like I and others did. My current mitigation strategy now is similar to this one, which basically is, what you stated: Use the same encryptionroot. Luckily, I can easily change my production pool to this setup and redo a replication to my backups, that should now not break: My current pool setup that breaks encryption when replication
My future setup:
On a test-system, all replicated datasets from Tank/zroot/* -> Backup/zroot/* now seem to remain functional, since they share the same encryptionroot of Tank/zroot, Backup/zroot respectively. NOTE: NOTE2: |
For the original "permission denied" issue, I think I have a couple solutions that work so the data can be mounted again. Setup to reproduce the bug
dd if=/dev/zero of=/root/zpool bs=1M count=4096
zpool create testpool /root/zpool -m /mnt/testpool
echo "12345678" | zfs create -o canmount=off -o encryption=on -o keylocation=prompt -o keyformat=passphrase testpool/enc
zfs create testpool/enc/data
zfs snapshot -r testpool/enc@1
the output is
Oops, this may give
Now we have the
still permission denied. Solution 1This involves sending an incremental change from the source encryption root after the key has changed. Which may not be possible if for some reason you didn't keep common snapshots on the zfs snapshot testpool/enc@newkey
zfs send -wi @1 testpool/enc@newkey | zfs recv testpool/enc_copy
zfs mount -a
# still denied. reload keys
zfs unload-key testpool/enc_copy
echo "87654321" | zfs load-key testpool/enc_copy
zfs mount -a
zfs mount this results in everything mounting
Solution 2Send a completely new copy of the encryption root with the new key. Luckily this is probably an empty dataset. zfs send -w testpool/enc@1 | zfs recv testpool/enc_copy2 We need to transplant the broken dataset over to this, and then inherit the encryption root. The following doesn't work:
"cannot rename 'testpool/enc_copy/data': cannot move encrypted child outside of its encryption root" We could always zfs send it, but that is costly. The solution I found is to use the crypt command Current encryptionroots:
Now force new key
Now we can move it
Final step is to force inherit
Now if we load the key, mounting works
output:
the bad one can be deleted
Maybe there are some other solutions based on similar ideas. Note the |
Same issue here, with kernel 5.15.143-1-pve (proxmox) and zfsutils-linux 2.1.14-pve1: Dataset fails to mount with permission denied after reboot. This dataset had a somehow complicated history as described above:
I suspected an encryption key issue and tried to change the key on the not mounting replica, and got this nice OOPS (this one was with kernel 5.15.126, but same happens on all version I tested: 5.15.107, 5.15.126, 5.15.131, 5.15.143) @digitalsignalperson , am I right understanding your solution 2 only works if you still have both the source and replica? So if I am left only with the replica, I have no option to recover my data? |
Coincidentally, I also just ran into this issue and thought I lost a bunch of data. Managed to recover thanks to @digitalsignalperson's advice and I figured it's worth noting that Solution 2 also works for partially broken streams. In my case, I still had the original source of the dataset, but only intermittent access (would crash every couple of minutes). However, the dataset created even by the first couple of MBs of the I think that this might also come in handy for folks that do not have an empty |
This was posted in a comment to #12000. I was asked to open up a new bug report.
Just started using ZoL with native encryption and think I have hit the same or a similar bug (related to #6624 as well).
All good at this point. Everything works as expected. Now, do an incremental send:
Again, all good. Now unmount/mount:
Yikes! This appears to have corrupted 10TB of backup filesystems. I've been trying to recover from this but no luck so far.
If I don't run
change-key
then I can send incrementals, unmount, and mount no problem (I just have to enter the password in twice). If I runchange-key
then unmount/mount still no problem. It's when I runchange-key
and then send an incremental snapshot that seems to render the filesystem unmountable.After running
change-key
and sending an incremental, once the filesystem is unmounted it can't be mounted again. It looks like the encryption root absolutely has to be replicated to prevent this from happening. If I replicate the encryption root then everything works as expected.I may have also uncovered another bug in trying to recover from this. If I run
zfs change-key -o keylocation=prompt -o keyformat=passphrase replica/encrypted/a
, after entering the new passwords the command hangs forever due to a panic. I have to completely reset the system.I've tested this on (all x64_64):
The text was updated successfully, but these errors were encountered: