Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test using lmdb as the kv store #114

Closed
wants to merge 1 commit into from

Conversation

trgill
Copy link
Contributor

@trgill trgill commented Jul 21, 2021

No description provided.

@trgill
Copy link
Contributor Author

trgill commented Jul 21, 2021

@prajnoha this isn't ready for review (doesn't work) - but I thought you might want to have a glance to see if you think the approach has potential. I think it might be something we could work out.

The database requires name value pairs, so we'd need to rework the KV_STORE_VALUE_REF uses. We'd need to switch to the values in continuous memory to be stored in the lmdb. The keys work fine, but I'm still working out storing/decoding the values.

I think the child processes could make updates to the KV pairs and the parent could refresh to get the updates.

I added a couple tests for the kv_store. Might be worth you having a look to make sure the tests make sense and test how you intended types F and E to work.

The tests are failing because the Jenkins nodes don't have lmdb-devel installed. I thought I'd check to make sure you think this is a possibility before updating the test systems.

@prajnoha
Copy link
Member

Thanks, I'll have a look...

@prajnoha
Copy link
Member

prajnoha commented Jul 23, 2021

The database requires name value pairs, so we'd need to rework the KV_STORE_VALUE_REF uses. We'd need to switch to the values in continuous memory to be stored in the lmdb. The keys work fine, but I'm still working out storing/decoding the values.

I see...

The KV_STORE_VALUE_REF is currently used only in places where we call kv_store_set_value with _kv_delta update function. The reason is to avoid a copy of the value when creating the struct kv_store_value wrapper around the value (so we simply reference the value inside struct kv_store_value instead of copying the whole value there). The reason we use a reference here and not a copy is because the _kv_delta callback will create a new struct kv_store_value wrapper anyway.

I mean, the sequence here is this:

  • 1 calling kv_store_set_value(..., value=X, KV_STORE_VALUE_VECTOR | KV_STORE_VALUE_REF, _kv_delta, ...)
  • 2 creating kv_store_value wrapper with reference to value X
  • 3 calling hash_update --> _hash_update_fn --> _kv_delta
  • 4 _kv_delta replacing the kv_store_value that has reference to X with final vector and hence creating a completely new kv_store_value (either adding to or removing from any existing vector value, e.g. the value ending A, B, C, X in case of adding X to existing A, B, C)

So the kv_store_value created in step 2) is just a temporary one, we'll end up with final value after executing kv_delta in step 4).

Right now, we don't make use of KV_STORE_VALUE_REF in any other scenario.

What we could simply do here is to state that we do not support KV_STORE_VALUE_REF flag for certain backends. The hash backend would support it, but the lmdb backend would not. The price we will pay for not supporting it would be the extra value copy (allocating them) even for the temporary items (like that step 2) described above) and then discarding them (freeing them) in a short while (step 4) above).

The KV_STORE_VALUE_REF make sense only if the reference stays within current process. If we ever used the reference in another process, it wouldn't make sense because the reference is within the original process memory. For this to work somehow, the memory region from which the allocation was made and for which we have a reference would always need to be contiguous in every process and the references would need to be offsets from the start of the memory region and then just changing the start of the memory region (...very similar to how "PIC - position independent code" works in shared libs, just here it would apply to data, not code). IOW, supporting this to work out of process boundaries would be a bit complex. So think it's OK to say we don't simply support KV_STORE_VALUE_REF with certain backends because they don't even need to work in the same process due to the way the backend is implemented internally.

I think the child processes could make updates to the KV pairs and the parent could refresh to get the updates.

Hmm, a few questions come to my mind here...

Looking at the patch, I noticed you always allocate new key/value when returning it from kv_store_get_value/kv_store_iter_next. Can we simply return the pointer directly to the db record here or are there any risks related to that? (...is this because lmdb might move the records around internally when new records are added and so the original pointer won't be valid anymore?)

Another thing I'm thinking about are the transactions. I assume the db is shared among processes so that the other process opens the database of the same name and when a transaction is started, the "snapshot" is created (somehow internally, using the mmap I suppose).

But in the patch, we create a separate transaction for each kv_store_set/get_value call. I think what we need instead is for us to be able to create a transaction within which we would be able to change several records at once - so more kv_store_set/get_value calls in between mdb_txn_begin and mdb_txn_commit (...well, I suppose that once we call mdb_txn_commit, the db record is not in the "snapshot" anymore, but gets written to the "main" database, right?).

If that's the case, then we probably need to change the kv_store interface a bit - adding "transtaction_begin" and "transaction_commit" hooks (for the hash backend, the "transaction_begin" is actually the fork itself and "transaction_commit" would be the functionality that's currently around this code:

if ((r = _send_fd_over_unix_comms(sid_buffer_get_fd(ucmd_ctx->exp_buf), conn->fd)) < 0)

Hope my comments make sense...

I believe we can use lmdb as backend, we just need to find a good way of how to map the functionality we need (...getting close to what we have with the hash backend and fork for the snapshot and transaction with more that one set/get call inside...).

I'm still thinking about this, so these are just my preliminary thoughts/questions related to this... We'll surely discuss this a little bit more.

I added a couple tests for the kv_store. Might be worth you having a look to make sure the tests make sense and test how you intended types F and E to work.

Yup, the tests look to test the essence of those types...

The tests are failing because the Jenkins nodes don't have lmdb-devel installed. I thought I'd check to make sure you think this is a possibility before updating the test systems.

Sure...

@prajnoha
Copy link
Member

(...while looking at the kv code, I realized we could do a minor cleanup here: #115 )

@trgill
Copy link
Contributor Author

trgill commented Jul 26, 2021

Looking at the patch, I noticed you always allocate new key/value when returning it from kv_store_get_value/kv_store_iter_next. Can we simply return the pointer directly to the db record here or are there any risks related to that? (...is this because lmdb might move the records around internally when new records are added and so the original pointer won't be valid anymore?)

Yeah, here is a small snip of the lmdb documentation about returned memory. It looks like they'll release or reuse any returned memory when the transaction is completed.

Note
    The memory pointed to by the returned values is owned by the database. The caller need not dispose of the memory, and may not modify it in any way. For values returned in a read-only transaction any modification attempts will cause a SIGSEGV. 
    Values returned from the database are valid only until a subsequent update operation, or the end of the transaction. 

@trgill
Copy link
Contributor Author

trgill commented Jul 26, 2021

What we could simply do here is to state that we do not support KV_STORE_VALUE_REF flag for certain backends. The hash backend would support it, but the lmdb backend would not. The price we will pay for not supporting it would be the extra value copy (allocating them) even for the temporary items (like that step 2) described above) and then discarding them (freeing them) in a short while (step 4) above).

The KV_STORE_VALUE_REF make sense only if the reference stays within current process. If we ever used the reference in another process, it wouldn't make sense because the reference is within the original process memory. For this to work somehow, the memory region from which the allocation was made and for which we have a reference would always need to be contiguous in every process and the references would need to be offsets from the start of the memory region and then just changing the start of the memory region (...very similar to how "PIC - position independent code" works in shared libs, just here it would apply to data, not code). IOW, supporting this to work out of process boundaries would be a bit complex. So think it's OK to say we don't simply support KV_STORE_VALUE_REF with certain backends because they don't even need to work in the same process due to the way the backend is implemented internally.

I see. I hadn't considered we might want to support both the hash table and lmdb as possible back ends. For SID there is value having the prior boot device mapping available to the modules, right? In the case of the hash table back end, it wouldn't be available? Or would we also implement save to disk for the hash table?

@trgill
Copy link
Contributor Author

trgill commented Jul 26, 2021

(...while looking at the kv code, I realized we could do a minor cleanup here: #115 )

Thanks. Once you merge it, I'll rebase.

@prajnoha
Copy link
Member

prajnoha commented Jul 27, 2021

Yeah, here is a small snip of the lmdb documentation about returned memory. It looks like they'll release or reuse any returned memory when the transaction is completed.

Note
    The memory pointed to by the returned values is owned by the database. The caller need not dispose of the memory, and may not modify it in any way. For values returned in a read-only transaction any modification attempts will cause a SIGSEGV. 
    Values returned from the database are valid only until a subsequent update operation, or the end of the transaction. 

...so as far as we're under a transaction, we should be OK to use directly the returned value without creating a copy. Now if we started the transaction at the very beginning of processing in the worker and end/commit the transaction at the end of the processing (...around the code in _cmd_handler where we currently call _build_kv_buffer and worker_channel_send for the diffs in hash backed db in worker to get synced with the db in main process).

I wonder if it's OK with lmdb do so longer-running transactions, but I assume it is - all the transaction should be working on top of a snapshot in LMDB. But that's something we need to make sure... I haven't looked in more detail here whether it poses any restrictions and/or limitations for "main db" when the transaction/snapshot is active.

Then supporting the vector values where we might append or remove items from existing vector would be another thing to check whether we can map it onto the LMDB somehow... This is actually the part I'm not yet sure how to map onto lmdb because with the hash, we use kv_store_update_fn_t (that's declared in kv-store.h) callback to decide whether the old value should be kept as is, completely overwritten or just edited (e.g. we have old value X1={A}, new value X2={B} and we want to end up with X3={A, B} actually written as new value in db in case we're adding an item to a vector).

@prajnoha
Copy link
Member

I see. I hadn't considered we might want to support both the hash table and lmdb as possible back ends. For SID there is value having the prior boot device mapping available to the modules, right? In the case of the hash table back end, it wouldn't be available? Or would we also implement save to disk for the hash table?

I'm considering the kv-store as general purpose "resource" so it can be reused in other (or future) parts of SID and/or modules to store information aside if needed and that might be used besides the main database SID uses. So I'd still like to keep the hash back end in...

I mean, we can still have kv-store to support both hash and lmdb backend (or even another back end if we ever needed), just the hash wouldn't support the persistence/flushing to disk part. The lmdb-backed kv-store would be used for the main SID db because we want that persistence, while the hash-backed kv-store could still be used for quick lookups, caches etc in another parts of the code/modules, for temporary kv-stores...

@trgill
Copy link
Contributor Author

trgill commented Jul 28, 2021

Another thing I'm thinking about are the transactions. I assume the db is shared among processes so that the other process opens the database of the same name and when a transaction is started, the "snapshot" is created (somehow internally, using the mmap I suppose).

Yes - records are not written to the database until mdb_txn_commit().

But in the patch, we create a separate transaction for each kv_store_set/get_value call. I think what we need instead is for us to be able to create a transaction within which we would be able to change several records at once - so more kv_store_set/get_value calls in between mdb_txn_begin and mdb_txn_commit (...well, I suppose that once we call mdb_txn_commit, the db record is not in the "snapshot" anymore, but gets written to the "main" database, right?).

Right. We can easily expand the scope of the transactions to include more updates.

If that's the case, then we probably need to change the kv_store interface a bit - adding "transtaction_begin" and "transaction_commit" hooks (for the hash backend, the "transaction_begin" is actually the fork itself and "transaction_commit" would be the functionality that's currently around this code:

if ((r = _send_fd_over_unix_comms(sid_buffer_get_fd(ucmd_ctx->exp_buf), conn->fd)) < 0)

Hope my comments make sense...

Yes, I think I understand. Do you think we should add a kv_database.c and put the interface there? I've been attempting to restore the hashtable back end to the kv_store along with adding the lmdb and it quickly gets complicated.

I'm thinking it will get more complicated when we start using the functionality of the database to sync key/value pairs across processes.

Maybe I'm missing something? I can push my attempts to github on another branch if that is helpful.

@prajnoha
Copy link
Member

prajnoha commented Jul 30, 2021

If that's the case, then we probably need to change the kv_store interface a bit - adding "transtaction_begin" and "transaction_commit" hooks (for the hash backend, the "transaction_begin" is actually the fork itself and "transaction_commit" would be the functionality that's currently around this code:

if ((r = _send_fd_over_unix_comms(sid_buffer_get_fd(ucmd_ctx->exp_buf), conn->fd)) < 0)

Hope my comments make sense...

Yes, I think I understand. Do you think we should add a kv_database.c and put the interface there? I've been attempting to restore the hashtable back end to the kv_store along with adding the lmdb and it quickly gets complicated.

You mean creating an internal interface? For example like we have for buffers where there's either linear or vector buffer with internal interface on top and then calling specific functions based on type? That way we have the code in separate files to handle each type (...in case of kv-store, that would be separate file for each backend).

I'm thinking it will get more complicated when we start using the functionality of the database to sync key/value pairs across processes.

I see, it might get complicated a bit if we kept the code to handle all backends in one place, so maybe including that "internal interface" would help, I hope.

When it comes to syncing key/value pairs across processes with the hash backend, we probably won't support the "transaction begin" and "commit" directly there, because that functionality (currently) depends on the fork (for creating the snapshot) and own syncing mechanism and this functionality is simply layered on top of kv-store itself (that's why it's currently handled in ubridge code).

Practically, it means that calling "transaction begin" and "commit" would return something like "not supported" (-ENOTSUP) for the hash backend. Of course, unless we added support for transactions/snapshots somehow even fort he hash (but different way than we do it now, not depending on the fork itself - because creating workers is a separate action and it shouldn't be related to database actions... the thing that we have it currently bound is just because it was easy to make use of the fork's COW feature and I didn't need to implement it my own).

But if we had another backend with direct support for transactions (like lmdb), we wouldn't need to bother about adding transactions to hash, we'd just use the lmdb backend instead. And the hash backend would stay there for another purpose - for creating simple look-aside db with kv pairs if needed.

The advantage is that we're using the same interface for all the backends. Just certain more advanced functionality (like those transactions) don't necessarily need to be supported in each backend. But we could still store vectors, mark values either as copies or references, or merging values in a vector before storing, automatically freeing up the value if it is marked that way etc. with the kv-store interface.

With this, if we could find a way how to make use of the lmdb as backend, the code in ubridge.c that's handling the db snapshotitng and syncing should get simpler (I hope) because we'd just call "begin transaction" after we fork a new worker and then "commit transaction" at the end of processing in the worker...

@trgill
Copy link
Contributor Author

trgill commented Aug 3, 2021

I'm considering the kv-store as general purpose "resource" so it can be reused in other (or future) parts of SID and/or modules to store information aside if needed and that might be used besides the main database SID uses. So I'd still like to keep the hash back end in...

@prajnoha I'm looking at writing a test for the 'kv-store' - currently it is coupled with other parts of SID, right? I added test_kvstore_iterate() to tests/test_kv_store.c. It is difficult to create a kv-store without other parts of SID (I think). Am I missing something?

Edit: I now see that the resource is only coupled with SID if the .init field is set to _init_sid.

I'd like to have a test for the hashtable back end to the kv-store, so I can quickly test the lmdb code to make sure I haven't broken either interface.

I'm having some troubles using the following... but still walking through the debugger to make sure a kvstore is initialized properly. Maybe I should add a _init_kvstore?

const sid_resource_type_t sid_resource_type_sid = {
	.name            = "testkvstore",
	.with_event_loop = 1,
	.with_watchdog   = 1,
	.init            = NULL,
};

@prajnoha
Copy link
Member

prajnoha commented Aug 4, 2021

(Sorry, was away for the last 2 days.)

With the snippet of code you pasted above, you'd be actually defining a completely new resource type ("class"). But what we need, is to create an instance ("object") of kv_store with sid_resource_create, like we do here:

sid/src/resource/ubridge.c

Lines 4434 to 4443 in f2f0456

if (!(kv_store_res = sid_resource_create(internal_res,
&sid_resource_type_kv_store,
SID_RESOURCE_RESTRICT_WALK_UP,
MAIN_KV_STORE_NAME,
&main_kv_store_res_params,
SID_RESOURCE_PRIO_NORMAL,
SID_RESOURCE_NO_SERVICE_LINKS))) {
log_error(ID(res), "Failed to create main key-value store.");
goto fail;
}

Then you'd get the resource instance which you can pass to all the kv-store interface functions (the ones defined in kv-store.h).

The 5th arg that is passed there (&main_kv_store_res_params) are the parameters for creating an instance of that type. In case of sid_resource_type_kv_store it's:

struct sid_kv_store_resource_params {
kv_store_backend_t backend;
union {
struct kv_store_hash_backend_params hash;
};
};

...so stating what backend the kv-store resource instance should use and then what are the parameters for that particular backend. In case of hash it's just the initial hash size:

struct kv_store_hash_backend_params {
size_t initial_size;
};

...but we can add more params for other backends if needed. In case we created a new lmdb backend in addition to the existing hash backend, we would probably end up with:

typedef enum
{
        KV_STORE_BACKEND_HASH,
        KV_STORE_BACKEND_LMDB,
} kv_store_backend_t;

struct kv_store_lmdb_backend_params
{
  ...whatever we find good to be parametrized for lmdb backend...
}

struct sid_kv_store_resource_params {
        kv_store_backend_t backend;
        union {
                struct kv_store_hash_backend_params hash;
                struct kv_store_lmdb_backend_params lmdb;
        };
};

(Of course the internals of _init_kv_store, which is the resource's "init" hook, needs to recognize what backend it's being asked to use and based on that switch among different initializations of backends.)

@prajnoha
Copy link
Member

prajnoha commented Aug 4, 2021

(...and yes, then for testing, you only need to create the kv-store instance as a unit, we don't need all the other parts like the daemon itself, other resources which when glued together make up all the damon. This was actually one of the goals for "resources" - to have the code clearly separated so we can test it separately too... and, of course, also to be able to reuse such units in the code.)

@prajnoha
Copy link
Member

prajnoha commented Aug 4, 2021

When it comes to kv-store use in SID daemon itself, there are 3 layers:

  • the backend (hash, ...)
  • the kv-store resource (value wrapped as struct kv_store_value supporting VECTOR/REF/AUTOFREE flags, storing size of the value itself and internal flags for internal management)
  • more abstraction in ubridge.c (value wrapped as struct kv_value that also records event sequence numbers and KV_PERSISTENT/PROTECTED/PRIVATE/RESERVED flags for access control)

@trgill
Copy link
Contributor Author

trgill commented Nov 23, 2021

@prajnoha

  • LMDB requires the database connection be closed by the child process after a fork(). The child needs to reestablish a unique connection to the database.
  • The ordering of database updates is not guaranteed if we don't implement some type of cross process locking/ordering. I think this is true for both the database and the hash table. I think currently when we are always forking new processes the chances of a transaction being applied out of order is very small - but when we re-use the processes there will be a bigger window.
  • It isn't clear to me there is a good way to unwind the systemd call stack in the children.

I started to look at using a thread pool rather than forking the workers. Is that something you'd consider? I think you've said "no" to that, but I'm having a hard time finding your response.

@prajnoha
Copy link
Member

  • LMDB requires the database connection be closed by the child process after a fork(). The child needs to reestablish a unique connection to the database.

That is a bit inconvenient, but unless there's a tangible performance degradation, I think it's still feasible...

  • The ordering of database updates is not guaranteed if we don't implement some type of cross process locking/ordering. I think this is true for both the database and the hash table. I think currently when we are always forking new processes the chances of a transaction being applied out of order is very small - but when we re-use the processes there will be a bigger window.

When it comes to ordering, there's one thing we already need to be counting in - the uevents are first processed in udev and it does that in its own processes where the udev rules are evaluated (where one of the rules is the usid scan that passes the uevent to SID). That means we can receive reordered uevents in SID (when looking at the uevents' seqnos).

Now, these udevd workers are running in parallel, BUT udevd already implies some restrictions:

  • udevd queues a uevent for a device where any previous uevent processing hasn't been finished yet
  • also queues uevents for a device where a uevent for any parent/child device is still being processed (e.g. disk and its partitions)

Which means that for a single device, we will never receive reordered uevents - we need to finish processing the current one before starting to process another for the same device (...and usid scan is synchronous, hence udevd waits for it to complete before it continues processing another udev rule in sequence).

Looking at SID and its records in db, we have 4 namespaces:

  • UDEV which is per device and because of the restriction already posed by udevd, we never update the same UDEV record at once.

  • DEVICE which is per device too, hence the same restriction applies as for UDEV.

  • MODULE which is per module so here we can get into situation where the same record might be updated from parallel running udevd (and hence also SID) processes. But since it's per module record, it's always handled by the same and exact module. And the module can apply it's own locks/ordering checks if needed.

  • GLOBAL which is, well, global so updated within any module on any device's event. This one is the most dangerous one where there's highest risk of uncoordinated updates to the db. Honestly, I've been thinking about removing this namespace and just rely on the previous ones where it's much easier to control the access to db... depends on whether any module will ever need this kind of GLOBAL record presence and whether it can be simply replaced wisely by MODULE type record(s).

I've been trying to direct the access to db in a way (by restricting the API fns) so that we only add/remove/update records that are only related to a device for which we're just processing the event. That means, we shouldn't be allowed to change records which are completely unrelated. Whether a device is related to another device(s) is given by placing the devices in a group. Then on each uevent, we could only:

  • change device-related records only if this is a device for which we're currently processing a uevent
  • the only operation permitted for other devices' records would be placing such devices in a group (say, we have a disk that is an LVM PV that belongs to a VG and based on the VG metadata we know that we expect more devices - either they have already been recognized or they are about to come - this is a part where we need to match against current db state)

So what I mean to say is that instead of using locks for a db, I'm trying to direct (and restrict) the db usage so that it's controlled and tries to avoid conflicts from concurrent writes. And if there are cases where the conflicts are unavoidable, we still have a possibility to hook a module for a solution - we have the callback on db updates where we can possibly call out to a module to decide which record takes precedence or whether any other action should be taken (e.g. we get different metadata info from different PVs - like with the out of sync VG seqnos - then we can call out to LVM to resolve the conflict coming from the inconsistency before we actually store any records in SID db).

  • It isn't clear to me there is a good way to unwind the systemd call stack in the children.

Yeah, that's a pain. I don't know either. Right now, we have three ways:

  • using a template process without systemd's sd-event hooks which would fork itself on a request from main process and the child would then initialize all the sd-event loop machinery
  • finding a different event loop implementation that can deal with forking (at least cleaning up after the fork)
  • or simply keep the things uncleaned after fork (I think this is exactly what udevd does)

I started to look at using a thread pool rather than forking the workers. Is that something you'd consider? I think you've said "no" to that, but I'm having a hard time finding your response.

I think it was here, but very short: #33 (comment)

Simply, besides making use of COW-based memory after fork for practically creating a snapshot for the in-memory db, it also seems to me much safer to have the module code running in a separate process. I think we'll have more control over resources used that way (e.g. we can very easily kill if we detect timeouts/long-running ones; set various limits etc...). Also, with threads, we'd need to be very careful when accessing the structures ("resources") as they would all be shared and it would be harder to confine the module code. I'm not saying threads are not a possibility, it just seems more complex to me, hence more prone to bugs...

@trgill trgill changed the title test using lmdb as the kv store - not ready for review test using lmdb as the kv store Jan 5, 2022
@trgill
Copy link
Contributor Author

trgill commented Jan 5, 2022

@prajnoha I think it would make sense for you to have a high level look at the code. The tests work - but it is a lot of change.

Any feedback is welcome. Just want to make sure I'm headed in the right direction.

@prajnoha
Copy link
Member

prajnoha commented Jan 5, 2022

Sure, I'll have a look. Thanks!

@prajnoha
Copy link
Member

prajnoha commented Jan 7, 2022

I think we need to consider using longer-running transactions. Right now, we begin and end the lmdb transaction right in the kv_store_{set,get}_value_db function itself. This means we have consistent view only during the exact function call. However, we need the consistent view for the whole processing within a worker, passing through all the phases and call outs to modules, that is practically whole _cmd_handler and in a concrete case of the usid scan, the whole _cmd_exec_scan:

sid/src/resource/ubridge.c

Lines 3406 to 3421 in 449e21d

for (phase = CMD_SCAN_PHASE_A_INIT; phase <= CMD_SCAN_PHASE_A_EXIT; phase++) {
log_debug(ID(exec_arg->cmd_res), "Executing %s phase.", _cmd_scan_phase_regs[phase].name);
ucmd_ctx->scan_phase = phase;
if (_cmd_scan_phase_regs[phase].exec(exec_arg) < 0) {
log_error(ID(exec_arg->cmd_res), "%s phase failed.", _cmd_scan_phase_regs[phase].name);
/* if init or exit phase fails, there's nothing else we can do */
if (phase == CMD_SCAN_PHASE_A_INIT || phase == CMD_SCAN_PHASE_A_EXIT)
return -1;
/* otherwise, call out modules to handle the error case */
if (_cmd_scan_phase_regs[CMD_SCAN_PHASE_ERROR].exec(exec_arg) < 0)
log_error(ID(exec_arg->cmd_res), "error phase failed.");
}
}

Now, it looks lmdb supports as many concurent readers as we need, as well as concurent readers and one writer. The tricky part is that while processing the phases, usually both the core and modules need to write/updated/remove records from db as results of scans and processing. That practically means we'd need to take a "write" transaction for the whole processing. Considering that we could have more than one worker running in parallel, there could certainly be more than one request for "write" transaction during a period of time - lmdb would serialize that, I suppose. The second and all the other requests would block/queue.

So, if possible, we need to find out how to deal with this in way that lmdb use is still feasible for us.

With the hash table, we're using a snapshot of the db because of the COW-based memory nature after forking a new worker. Then we allow additions/edits/removals in the snapshot and we're actually collecting changes from the snapshot along the way and at the end of processing, we send those changes over to main process to synchronize them with the main db. So that can be considered as a kind of a "write batch". When synchronizing the changes with the main db, we are able to do additional checks before we actually update the records (that is the update_fn hook for the backend's hash_update call), for example, we can check sequence numbers and other properties that may help us to order the writes (not saying we have that 100% complete as of now, but we're prepared for that).

With lmdb, we still need to find a way how to:
A) allow for changes ("writing") to be still possible in the snapshot and yet not blocking others
B) have a way to chime in before the actual update in main db is done so we can decide about the order of writes and/or validity of the update itself

This is something that I suppose lmdb doesn't support directly, so we need to find our way here on top of it.

It might be that we would only take "read" transactions in the workers - we could still read the db in consistent way and we would allow for several snapshots in parallel (because "read-only" transactions/snapshots are not blocked). Then any changes would need to be written aside somehow (either a new db instance or anything else). But this feels a bit complicated - we'd have two databases in a worker, one with the frozen snapshot ("read-only" db transaction) and one with the "changes" - then if we wanted to look for a value while doing further processing in the worker, we'd need to look in the "changes" first and if not found there, in the "read-only" db. But it looks really fragile, maybe there's a better way...

That said, I understand why lmdb (as well as other databases) do not allow more than one writer at a time. Thing is that our use case is narrowed down when it comes to database updates - as I described above, we have those 4 namespaces, where the same record in UDEV and DEVICE namespace will never be updated at the same time - hence we avoid conflicts from parallel writes. The MODULE namespace is in the hand of the module itself so if really needed, it can use its own locks (or we can provide an API for the module to do so). So we need to specialize the db for our needs here a bit.

@prajnoha
Copy link
Member

prajnoha commented Jan 7, 2022

As for the code, just a few things I noticed:

  • I think we don't need to define kv store with lmdb backend as a completely new resource. We can just go with passing the backend type in params for sid_resource_create and then remember that inside (which we already do). Any further calls to kv store are then redirected to proper backend. So we can keep the kv-store-db and kv-store-ht internal, while exposing only KV_STORE_BACKEND_HASH and KV_STORE_BACKEND_LMDB types with their params for resource creations (sid_kv_store_resource_params). And we'd have only one const sid_resource_type_t sid_resource_type_kv_store that covers all the backends.

  • The _create_kv_store_value should be shared if there's no change (or very minimal we can parametrize for the function).

  • I noticed that the patches expose struct kv_store, struct kv_store_value, struct kv_update_fn_relay and struct kv_store_iter - these should be internal and hidden from external interface for the kv-store itself (so not a part of kv-store.h).

@prajnoha
Copy link
Member

prajnoha commented Jan 28, 2022

Todd, in the meantime, before we sculpt lmdb for our purpose, I'm trying to make use of tkrzw. This is just to replace the hash with a btree that has ordered keys and hence it would allow us to search for key ranges. That is, for us to be able to iterate records with keys that have certain prefix (e.g. give me all records belonging to certain device, that is, having D:<dev_id> prefix; or all records belonging to a module, that is having M:<mod_name> prefix and so on).

The tkrzw doesn't have any support for snapshots or whatsoever, so still relying on the fork to do the COW of the db, but we'll have the btree with key range iterator instead of the hash with shuffled keys.

There's a downside that tkrzw has a dependency on c++ lib, but once we manage to include the lmdb, we can easily switch and compile support for tkrzw conditionally.

@trgill
Copy link
Contributor Author

trgill commented Jan 29, 2022

@prajnoha Ok. Thanks for letting me know.

Do you think we could get away with a single back end for the kv-store for v1?

We could put the lmdb branch on hold until we need it?

lmdb adds a fair number of dependencies and isn't ideally suited to the fork/COW design. In the Caveats section of the lmdb docs it warns "Use an MDB_env* in the process which opened it, without fork()ing." http://www.lmdb.tech/doc/

I had though we might consider a multi-thread model, but I understand your decisions to avoid it. lmdb would be better suited if we went in that direction.

I'm still concerned about how we are going to unwind the systemd call stack in the fork'd processes. Maybe we could discuss in the next meeting? My inability to figure out a clean solution was what inspired my thoughts about moving towards threads rather than processes.

Are you thinking we will switch to a C++ compiler - but keep a lot of stuff in plain C?

I looked at the docs/examples for tkrzw. I need to look more closely at the details for transactions/recover - but at first glance it looks like it provides a lot of what lmdb does, right? : https://dbmx.net/tkrzw/#tips_acid

@prajnoha
Copy link
Member

@prajnoha Ok. Thanks for letting me know.

Do you think we could get away with a single back end for the kv-store for v1?

We could put the lmdb branch on hold until we need it?

OK, we can park that for now and revisit later then...

I'm still concerned about how we are going to unwind the systemd call stack in the fork'd processes. Maybe we could discuss in the next meeting? My inability to figure out a clean solution was what inspired my thoughts about moving towards threads rather than processes.

Yes, this is something that troubles me a bit too. We can collect ideas and see what is possible. One of them being replacement of the systemd event loop, if there's a better alternative out there. Or even, our own simple event loop, but would be better if there's anything we could just reuse, of course...

Are you thinking we will switch to a C++ compiler - but keep a lot of stuff in plain C?

You mean related to the possible use of tkrzw which is written in C++? That one has a C wrapper (which is directly included with it, not an external wrapper) and libs so we can use that with C quite easily. I haven't checked how's the situation in various distros with availability of tkrzw as a package (Fedora has it). Truth is that it is relatively new (2020) - it's successor to Kyoto cabinet, which in turn is successor to Tokyo cabinet, which in turn is inspired by dbm. (lmdb is actually in the same category)

I looked at the docs/examples for tkrzw. I need to look more closely at the details for transactions/recover - but at first glance it looks like it provides a lot of what lmdb does, right? : https://dbmx.net/tkrzw/#tips_acid

Well, yes, it's a simple key-value DB with support for a few backends on its own (hash, skip list, b-tree... either in-memory only or also backed by a file).

I'm thinking about making use of the BabyDBM (in-memory b-tree) from that list - but still keeping the forking/COW as it is now, just replacing the hash we have with the b-tree that would allow us to do key range selections. The logic in SID would stays the same for now - so having a snapshot of memory after fork in worker process (including the DB), then sending changes/differences back to main process. So the SID db backend replacement would be just about having a possibility for ordered keys with key range lookup.

From the transactions, we'd make use the atomicity - so taking a transaction when we receive request to sync the db in main process, then applying all the changes/diffs, then committing the transaction. If anything goes wrong in the middle, we'd discard the transaction (...and probably restart the operation that caused the db changes/diffs). Currently, we don't have this kind of atomicity when syncing changes/diffs from snapshots with main db with just a simple hash.

I've also noticed that tkrzw has support for calling callbacks on record changes - very similar to the hash_update_fn we use currently for the hash (...to do checks and/or futher record edits before actually writing the record to db). Which looks nice as we can (I hope - that's what I'm playing with a bit right now) map to our existing logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants