RFE: document why lvm2 has to revalidate the metadata for every command #74

DemiMarie · 2022-03-25T17:04:50Z

lvm2 having to revalidate metadata for every command is highly non-obvious, and not understanding the reasons behind it leads to confusion as to why lvm2’s shell mode doesn’t perform seemingly obvious optimizations.
#65 (comment) has some explanation, but I would prefer this to be in the lvm2 documentation.

zkabelac · 2022-03-25T17:17:09Z

We have seen to many problems over the history of linux & lvm2 devel - that having this validation always ON was never a big deal. It protects us from releasing bad code, and capture various bugs we have seen at various stages (kernel, virtualization, users providing us with invalid hand made data...)

Of course if the user is processing massive metadata size - this can be consider for 'optional' feature of having an configurable knob to enable deep validation.

However lvm2 was never really designed to work on multi MiB metadata sets - so cutting of 'validation' time is not a 'major' time rescuer either...

DemiMarie · 2022-03-25T17:18:11Z

Would it be possible to document this in man lvm or similar?

zkabelac · 2022-03-25T17:23:34Z

Document what exactly - Lvm2 slows down with increase of metadata size ?
The bigger the size is - the slower the command execution gets - seems expectable .

But patches are always welcome to enhance readability and understanding..

DemiMarie · 2022-03-31T02:26:49Z

Document what exactly - Lvm2 slows down with increase of metadata size ? The bigger the size is - the slower the command execution gets - seems expectable .

That lvm2 can’t perform the seemingly obvious optimizations I have asked about on the mailing list and on GitHub, such as caching the metadata between commands in the shell.

zkabelac · 2022-03-31T08:31:30Z

Document what exactly - Lvm2 slows down with increase of metadata size ? The bigger the size is - the slower the command execution gets - seems expectable .

That lvm2 can’t perform the seemingly obvious optimizations I have asked about on the mailing list and on GitHub, such as caching the metadata between commands in the shell.

This is not easy to answer - historical primary goal of lvm2 has been always bullet proof correctness - so whenever lvm2 drops the VG lock - after reacquiring the lock back - we always reload an validate all data directly from disk (and you wouldn't probably believe how various VM machines suffered from many kinds of disk errors...)

So the 'rapid' fire of lvm2 commands was never a primary goal - where the 'temporary' caching of live 'metadata' in RAM would be the 'primary' source of VG info (instead of written content on disk).

I could see some potential use-case but so far there never was any 'customer' behind such request.
(lvm2 being a volume manager to manage 'volumes' as long-time living objects - there never was any plan so far to work with some 'short' living volumes where the time of command would actually be a thing to consider....)

Also if the 'data' are not validated from being obtained from 'disk' - the possibly risk of big data loss is getting bigger - although there can be environments where such 'risk' could be taken to perform commands in much higher rate.

DemiMarie · 2022-03-31T08:38:22Z

This is not easy to answer - historical primary goal of lvm2 has been always bullet proof correctness - so whenever lvm2 drops the VG lock - after reacquiring the lock back - we always reload an validate all data directly from disk (and you wouldn't probably believe how various VM machines suffered from many kinds of disk errors...)

That makes a LOT of sense. I was hoping for a “never drop the lock” mode, but at that point I am not sure if dm-thin is the right choice. For instance, it has a limit at 2^24 transactions.

zkabelac · 2022-03-31T08:48:10Z

This is not easy to answer - historical primary goal of lvm2 has been always bullet proof correctness - so whenever lvm2 drops the VG lock - after reacquiring the lock back - we always reload an validate all data directly from disk (and you wouldn't probably believe how various VM machines suffered from many kinds of disk errors...)

That makes a LOT of sense. I was hoping for a “never drop the lock” mode, but at that point I am not sure if dm-thin is the right choice. For instance, it has a limit at 2^24 transactions.

Certainly if you do plan to write a disk managment system to control a Universe - it's a seriously limiting factor, but ATM we have never faced any problems with this limitation in any real-life customer cases.

Surely for higher scaled range of transactions you would need to find some other products (although I admit I'd curious to know about them myself and how they are comparable in performance).

Also note - there are other limitation factors like fragmentation of data and metadata with current thin-pool format v1.5 - if you would have plan to make such massive deployment (some of them going to be addressed with upcoming v2.0) .

DemiMarie · 2022-03-31T09:21:05Z

(some of them going to be addressed with upcoming v2.0)

Has there been any work on this that I can see?

zkabelac · 2022-03-31T10:17:09Z

The best I can advice is to look at: https://github.com/jthornber

But it will take still couple months before getting this fully upstream....

DemiMarie · 2022-03-31T11:07:17Z

The best I can advice is to look at: https://github.com/jthornber

But it will take still couple months before getting this fully upstream....

I see. Will that fix QubesOS/qubes-issues#3244?

zkabelac · 2022-03-31T11:12:05Z

Well in the 1st. place I do not exactly understand what is your exact problem with 24bit value.
This snapshot (or rather thin-volume id) could be easily wrapped around - it has no other big meaning than to identify things together. So unless you dream to use 2^24 thin volumes at the same time from a single thin-pool - I don't see a problem.

Of course the size of thin-pool metadata is way many times bigger problem - and that limitation should be getting better with newer format - as well as better support for smaller chunk sizes then 64k.

DemiMarie · 2022-03-31T11:18:15Z

Well in the 1st. place I do not exactly understand what is your exact problem with 24bit value.
This snapshot (or rather thin-volume id) could be easily wrapped around - it has no other big meaning than to identify things together. So unless you dream to use 2^24 thin volumes at the same time from a single thin-pool - I don't see a problem.

According to the reporter, there was an unchecked overflow, which would result in other data being corrupted. Also, what if thin 0 was still in use when the counter wrapped? It’s not too hard to imagine this situation, since often the first thin is the root filesystem, which will always be in use.

DemiMarie · 2022-03-31T11:21:37Z

Of course the size of thin-pool metadata is way many times bigger problem - and that limitation should be getting better with newer format - as well as better support for smaller chunk sizes then 64k.

This will be very helpful for Qubes OS. My suspicion is that Qubes OS is very much a breaking-sharing intensive workload, especially with a CoW filesystem inside the guest. Right now initializing writes seem to be slower than writes to preallocated space by a factor of 2 or more in a trivial dd benchmark.

zkabelac · 2022-03-31T11:43:07Z

Each thin volume has its DeviceID (24bit) - but that's about it. It keeps the list of associated chunks - snapshot is just like any other thin LV - difference between 'regular' thin LV and snapshot thin LV is - it's started with pre-populated mappings.

Of course it's up-to your user-space app to ensure you are not 'using' existing DeviceId when you are making new thinLV - but as said - it's unsupported to have >2^24 thinLV in a single thin-pool - if that's what you want - you need to seek for other solution.

And yes - block provisioning IS expensive - the added value of thin-pool usage is not coming for free....

Also - if you are oriented at 'dd' benchmarking - setting proper buffer size & direct write - all that matters.
And also - the bigger the chunk is - the better the performance gets (but with less efficient snapshoting)

DemiMarie · 2022-03-31T13:36:00Z

Each thin volume has its DeviceID (24bit) - but that's about it. It keeps the list of associated chunks - snapshot is just like any other thin LV - difference between 'regular' thin LV and snapshot thin LV is - it's started with pre-populated mappings.

The DeviceID isn’t what I am referring to. I am referring to the timestamp that dm-thin uses internally to know if it needs to break sharing.

zkabelac · 2022-03-31T16:22:20Z

Each thin LV has some 'mapping' bTree - as soon as it needs to 'write' to any of its chunk - it needs to own such chunk exclusively - not sure how the timestamp may affect this...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFE: document why lvm2 has to revalidate the metadata for every command #74

RFE: document why lvm2 has to revalidate the metadata for every command #74

DemiMarie commented Mar 25, 2022

zkabelac commented Mar 25, 2022

DemiMarie commented Mar 25, 2022

zkabelac commented Mar 25, 2022

DemiMarie commented Mar 31, 2022

zkabelac commented Mar 31, 2022

DemiMarie commented Mar 31, 2022

zkabelac commented Mar 31, 2022

DemiMarie commented Mar 31, 2022

zkabelac commented Mar 31, 2022

DemiMarie commented Mar 31, 2022

zkabelac commented Mar 31, 2022

DemiMarie commented Mar 31, 2022

DemiMarie commented Mar 31, 2022 •

edited

Loading

zkabelac commented Mar 31, 2022

DemiMarie commented Mar 31, 2022

zkabelac commented Mar 31, 2022

RFE: document why lvm2 has to revalidate the metadata for every command #74

RFE: document why lvm2 has to revalidate the metadata for every command #74

Comments

DemiMarie commented Mar 25, 2022

zkabelac commented Mar 25, 2022

DemiMarie commented Mar 25, 2022

zkabelac commented Mar 25, 2022

DemiMarie commented Mar 31, 2022

zkabelac commented Mar 31, 2022

DemiMarie commented Mar 31, 2022

zkabelac commented Mar 31, 2022

DemiMarie commented Mar 31, 2022

zkabelac commented Mar 31, 2022

DemiMarie commented Mar 31, 2022

zkabelac commented Mar 31, 2022

DemiMarie commented Mar 31, 2022

DemiMarie commented Mar 31, 2022 • edited Loading

zkabelac commented Mar 31, 2022

DemiMarie commented Mar 31, 2022

zkabelac commented Mar 31, 2022

DemiMarie commented Mar 31, 2022 •

edited

Loading