-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Possible architecture simplification #119
Comments
Well, the primary reason for pursuing the current logic where we have an instance of the DB ("snapshot") in each worker process besides the main instance of the DB is that:
The issue with left-over call stack in forked process - that's something I haven't yet looked into in much detail. But I still believe there must be a way how to get rid of it. Of course, this requires some support from the event loop library too - at least having a possibility to clean up/close all the inherited resources. (If we wanted to have the only one DB, without real snapshots, we might as well put a DB file inside ramfs like /run and then open/map it in each worker process, taking proper locks when reading/updating the DB. So we could probably even avoid the RPC. But there are those requirements as noted above which lead me to following the "separate snapshot DB for each worker" way.) |
I think the disparity here is in that we have differing views about what a "DB" is. When I see DB, I think of an actual transactional database management system (MySQL, PostgreSQL, Oracle, MS SQL etc.). When I read: #118 (comment) I see that "DB" does not fit what I consider an actual database. I now understand the problems you're trying to workaround by using something like
An actual data base provides "Read Consistency", actually many support a number of different levels/variations of this.
With an actual database you don't specify locks. That's handled for you with what is needed to fulfill the constraints of the DB. Better DB implementations provide row level granularity of locks, but you need not worry about it.
Again, not needed with an actual database
In database design, this would violate 1st normal form.
IMHO I think it's still better to avoid this situation to begin with. There are small embeddable DB's that can give you a lot of functionality without a lot of overhead which I believe would solve all of the things you're trying to workaround with using tkrzw. For example, sqlite. Perhaps something to consider? |
OK, we can call it a "store" if you prefer and is less confusing. Actually, the wrapping resource in SID is called that way.
Sure, that is clear thing from theory... But I need to consider "write" as well. And here, inside usual databases, either the whole database is locked or more granular locks are held, but still locking the record itself at least. And then any others trying to access would simply need to wait for the lock to get released. The processing of an event (which is always related to a single device) in SID consists of several phases and in each phase, different modules can process records (add, update, remove). The set of records which can be changed by certain module is quite limited. If there's a record which is shared out of a module with other modules, it is protected by ownership or the operation itself just adds/removes the single device that is just being processed to/from a shared list. The only exception are future devices that certain subsystem is waiting on based on metadata scanned - but they're not real devices yet, just something we expect to appear and wait for its real event. Events for the same device can never be processed in parallel, they are always queued - this is assured even at lower level - in While the processing in SID takes several phases, switching modules to execute, we need to consider this as long-running operation with respect to the store access. Now, each module can read and also write. These writes must be visible within processing the single concrete event, during all the phases, among all the modules and still separated from processing events of any other device that may be happening in parallel. I don't want to take a long-running "write" transaction (that in turn, inside the db, would take a form of a lock) that would lock others - I really can't do this. So to resolve this, I either need to:
I decided to take the "snapshot + collecting changes + syncing at the end of processing" approach. I use in-memory store that is automatically snapshotted after fork. And that is, in it's essence, a COW, so quite efficient. Note that we do have a way to detect whether there was incoming change in between starting the snapshot and syncing the collected changes - there are indicators for this (event sequence numbers, generation numbers I'm wrapping each record with...). So we can still detect there was a change in between and react to that (in the worst case, repeating the processing of the event). But as I said, the set of records which can be changed by processing the one single concrete event for a concrete device is limited, so that situation is minimized.
Of course. I meant what the DB itself would take inside to perform the transaction. I do need to worry about how DB does this and how it would lock because I have events coming in parallel/being processed in parallel and I really can't afford to wait for one transaction to complete before starting another one - I can't serialize this way - that's going against the nature of the event processing. The worst case is with databases which simply lock complete DB when a write transaction is held. Simply, we need to look at this from a different perspective that the usual rigid DB transactional and relational way. Sure, some databases support more granular locks, but still the locks are there and usually, such DBs with fine-grained locking are the more advanced ones which are not really embeddable with tiny footprint.
...yeah, as noted above, I didn't mean taking the lock myself, I meant the database itself taking a global lock...
I'm not considering use of relational database. Also, the values which I store can be either single or vector values. There's an API in SID's
Yeah, that would be ideal. Unfortunately, I wasn't aware that systemd event loop library can't do at least the cleanup after fork. I started using it because I'm already linking against udev (which is in systemd's hands too) so it seemed natural to me to use that event loop library. And it had all the hooks I needed. Then I noticed there was some quirk with trying to call these cleanups but I had to continue with other things and not spend much time inspecting and delving into this. Then, when I had time for that, I noticed the that it's refusing to cleanup after fork because all the functions are checking the PID the structure got created and current PID. For me, just closing the inherited FDs after fork and resetting the state would be enough, exiting the call stack properly. Which is, I think, something it should be able to do... (Then, just for my curiosity, I looked how udevd does that because it spawns processes too. So here, the main loop uses systemd event loop and then at certain point with a considerable call stack, it forks. It keeps the call stack in the child process as it got inherited and creates its own new simple epoll (or whatever else it was) event loop there. So even project like that doesn't care about the inherited call stack - which I'm not saying I like, it's just they hit their own restrictions I think...)
I considered sqlite at the very beginning, well, because I was raised with relational transactional databases like that and this was just first thing I could think of. But as I realized later, the rigid form of relational DB doesn't quite suite my needs. The records I need to store are not with rigid pre-defined schemes. For example, each module can store any information and the design would need count with the modules providing their own db schemes. Also, values can be vectors (sets/lists...) of different sizes. And that would get really complicated. The key-value store is something that much better fits this (....well, a graph database would probably fit event more, but when trying to look for some small embeddable one, I wasn't really successful). Also, you have much more chance to really find an embedable small-footprint db in the area of KV "stores". Also, as for the The |
The hash was used as a KV store backend for the prototype with the intention for it be replaced/extended by a backend providing other features needed (like that range selections now). The project is evolving in this step-by-step manner, because the development is done by me and just a few (1-2) volunteers which are coming (...and going :)). |
Sqllite supports blob type, you could put whatever you wanted in it to allow modules the ability to store arbitrary data.
It's a little more advanced than that, but not as advanced as the larger more complex implementations which have low level locking granularity. I'm having a hard time wrapping my head around all the focus on concurrency concerns. This problem seems IO bound to me and having anything block for any length of time would only be caused by holding a transaction open while interrogating disk signatures, which I think you could avoid. My initial thought is this problem would be well suited to an async/wait solution. |
...and with that, I'd be losing most of the benefits of relational database - it'll boil down to key-value store - which is what I'm using right now.
Ah, yes, I remember reading this once... But still, the locking is there. But that is perfectly fine and understandable. It's just not what I'm looking for.
It's not just about taking a transaction while the disk is scanned and actually it's not only about the scanning part (though this one is important one). The infrastructure in SID aims to provide a consistent view throughout the whole processing phases of the single event where we switch into different modules. Scanning is only one part of that, but there might be logic attached which can result in storing/accessing the info and sharing in between calling out to modules (e.g. blkid does basic device scan, writes the result to the store (to the snapshot only at this moment), based on this info we switch into a module handling the specific subsystem which in turn can do more scanning of current device or read associated config and based on that write more information (again, to the snapshot), and/or reading other properties which in turn can trigger further actions and at the very end, we take collected changes and sync with main store). So I don't want the store state to be interleaved among processing the events. It's one event --> one snapshot of the store --> one set of changes with respect to the state the snapshot was taken --> one sync point of the collected changes from processing the event throughout all phases and modules. This makes it clear to follow and see the change of the state in one go for each event.
You mean initiating a scan from a module and then wait for that while other modules can do other things? Then a bit later, receiving results and writing them to the store? Well, thing is, the information could be passed from one module to another or the records can be a deciding point on next steps to take (like that blkid info). That would be hard to manage, if at all. If it was just the scanning and storing and nothing else, then OK, but there are other actions associated with the results, chained in a way... |
I'm referring to: https://en.wikipedia.org/wiki/Async/await . However, that requires language support. |
After looking at #114 and reading comments on that thread, I'm wondering if something like this would work?
This prevents multiple instances of the DB, having to merge updates between database instances, prevents duplication of the
entire systemd event loop etc., and the problem of cleaning up the call stack etc. Concurrency to the db is managed via the
parent process handling of RPC API interface. Also depending on how the API is designed/written you could possibly even get away without requiring a full ACID compliant DB. If the persistence part is only for hints I think this could be workable.
The text was updated successfully, but these errors were encountered: