-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFE: document why lvm2 has to revalidate the metadata for every command #74
Comments
We have seen to many problems over the history of linux & lvm2 devel - that having this validation always ON was never a big deal. It protects us from releasing bad code, and capture various bugs we have seen at various stages (kernel, virtualization, users providing us with invalid hand made data...) Of course if the user is processing massive metadata size - this can be consider for 'optional' feature of having an configurable knob to enable deep validation. However lvm2 was never really designed to work on multi MiB metadata sets - so cutting of 'validation' time is not a 'major' time rescuer either... |
Would it be possible to document this in |
Document what exactly - Lvm2 slows down with increase of metadata size ? But patches are always welcome to enhance readability and understanding.. |
That lvm2 can’t perform the seemingly obvious optimizations I have asked about on the mailing list and on GitHub, such as caching the metadata between commands in the shell. |
This is not easy to answer - historical primary goal of lvm2 has been always bullet proof correctness - so whenever lvm2 drops the VG lock - after reacquiring the lock back - we always reload an validate all data directly from disk (and you wouldn't probably believe how various VM machines suffered from many kinds of disk errors...) So the 'rapid' fire of lvm2 commands was never a primary goal - where the 'temporary' caching of live 'metadata' in RAM would be the 'primary' source of VG info (instead of written content on disk). I could see some potential use-case but so far there never was any 'customer' behind such request. Also if the 'data' are not validated from being obtained from 'disk' - the possibly risk of big data loss is getting bigger - although there can be environments where such 'risk' could be taken to perform commands in much higher rate. |
That makes a LOT of sense. I was hoping for a “never drop the lock” mode, but at that point I am not sure if dm-thin is the right choice. For instance, it has a limit at 2^24 transactions. |
Certainly if you do plan to write a disk managment system to control a Universe - it's a seriously limiting factor, but ATM we have never faced any problems with this limitation in any real-life customer cases. Surely for higher scaled range of transactions you would need to find some other products (although I admit I'd curious to know about them myself and how they are comparable in performance). Also note - there are other limitation factors like fragmentation of data and metadata with current thin-pool format v1.5 - if you would have plan to make such massive deployment (some of them going to be addressed with upcoming v2.0) . |
Has there been any work on this that I can see? |
The best I can advice is to look at: https://github.com/jthornber But it will take still couple months before getting this fully upstream.... |
I see. Will that fix QubesOS/qubes-issues#3244? |
Well in the 1st. place I do not exactly understand what is your exact problem with 24bit value. Of course the size of thin-pool metadata is way many times bigger problem - and that limitation should be getting better with newer format - as well as better support for smaller chunk sizes then 64k. |
According to the reporter, there was an unchecked overflow, which would result in other data being corrupted. Also, what if thin 0 was still in use when the counter wrapped? It’s not too hard to imagine this situation, since often the first thin is the root filesystem, which will always be in use. |
This will be very helpful for Qubes OS. My suspicion is that Qubes OS is very much a breaking-sharing intensive workload, especially with a CoW filesystem inside the guest. Right now initializing writes seem to be slower than writes to preallocated space by a factor of 2 or more in a trivial |
Each thin volume has its DeviceID (24bit) - but that's about it. It keeps the list of associated chunks - snapshot is just like any other thin LV - difference between 'regular' thin LV and snapshot thin LV is - it's started with pre-populated mappings. Of course it's up-to your user-space app to ensure you are not 'using' existing DeviceId when you are making new thinLV - but as said - it's unsupported to have >2^24 thinLV in a single thin-pool - if that's what you want - you need to seek for other solution. And yes - block provisioning IS expensive - the added value of thin-pool usage is not coming for free.... Also - if you are oriented at 'dd' benchmarking - setting proper buffer size & direct write - all that matters. |
The DeviceID isn’t what I am referring to. I am referring to the timestamp that dm-thin uses internally to know if it needs to break sharing. |
Each thin LV has some 'mapping' bTree - as soon as it needs to 'write' to any of its chunk - it needs to own such chunk exclusively - not sure how the timestamp may affect this... |
lvm2 having to revalidate metadata for every command is highly non-obvious, and not understanding the reasons behind it leads to confusion as to why lvm2’s shell mode doesn’t perform seemingly obvious optimizations.
#65 (comment) has some explanation, but I would prefer this to be in the lvm2 documentation.
The text was updated successfully, but these errors were encountered: