-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update render queue limits #152
base: master
Are you sure you want to change the base?
Conversation
Adjust render queue depth so that a rendering server dirty queue length is shifted from several minutes (1000 / 7 ~= 142 s ~= 2.3 minutes) to a whole day.
includes/render_config.h
Outdated
#define DIRTY_LIMIT (1000) | ||
|
||
#define DIRTY_LIMIT (216000) | ||
#define HASHIDX_SIZE (477856) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is a higher HASHIDX_SIZE needed for with metatiles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to @gardster, this is not required and is redefined anyway later. I did it by analogy to the below. It looks like it's safe to remove both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic of this seems tailored specifically to OSM.org. Does it also apply in general? Is a large queue desirable behavior?
includes/render_config.h
Outdated
#define DIRTY_LIMIT (1000) | ||
|
||
#define DIRTY_LIMIT (216000) | ||
#define HASHIDX_SIZE (477856) | ||
#else | ||
#define QUEUE_MAX (1024) | ||
#define REQ_LIMIT (512) | ||
#define DIRTY_LIMIT (10000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the non-metatile limit be higher than the metatile limit?
The values are tailored based on osm.org statistics. On non-osm.org installs:
|
hello, what do we do to get it merged? |
#else | ||
#define QUEUE_MAX (1024) | ||
#define REQ_LIMIT (512) | ||
#define DIRTY_LIMIT (10000) | ||
#define HASHIDX_SIZE 22123 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is HASHIDX_SIZE
and why are you removing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is no-op.
It is used here:
Line 39 in 62c4a5a
return key % queue->hashidxSize; |
and redefined here:
mod_tile/includes/request_queue.h
Line 31 in 62c4a5a
#define HASHIDX_SIZE 2213 |
(see includes order in
Line 29 in 62c4a5a
#include "render_config.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So does the hash table have 2213 buckets? or 22123 buckets? Not that either is anywhere big enough once you have a two million entry queue - that will give an average chain length of either 100 or 1000 either of which seems way too large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like somebody wants to increase HASHIDX_SIZE from default 2213 to 22123 when METATILE is not defined. But it's so ugly and include order dependent that should be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomhughes I've updated numbers to match ratio.
We're not using non-metatile mode, so it's not 2M, but 200k records.
@vng I've updated request_queue.h
to not rewrite the value.
Not having our servers running in full meltdown mode 24x7 could of course be considered a feature rather than a bug... |
Incidentally I believe the current queue lengths are the results of previous experimentation by @apmon although I can't claim to remember all the details. I have some sort of memory that there were performance issues managing the queue if it was too big? |
There's a lot of time between 2 minutes and the whole day. Is it possible to test a single server with something like an hour or 10 minutes for example? |
@kocio-pl I can provide such a branch. Does anybody have a machine to run it on, with a proper load? |
Not easily. At the very least you would need to provide Ubuntu compatible binary packages. |
I guess you could ask @rrzefox if he's willing to test it. |
Maybe it makes sense to start with making these variables configurable through |
Sorry, no can do - I'm using tirex, not renderd. Which incidentally does not have a queue limit, at least none I've ever hit...
You probably mean renderd.conf? mod_tile.conf would be something read by apache, and apache is not handling the queue, the render-daemon is. |
Making it configurable would certainly be a good idea - then we could do a new build and test different configurations. |
I'm a bit confused by this proposed change. There are basically 5 different queue classes, which are being handled by
As long as there are queue entries in a higher class, they will be processed first. (I wonder why this doesn't lead to starvation issues, but that's another issue). To my understanding, changing the Isn't this queue only relevant, if someone adds |
hey @mmd-osm, Dirty queue holds all the tiles that were rendered previously and marked dirty due to someone making an edit in the area. They can be rendered on change instead of being rendered later on the first request. |
Thanks, your munin link makes a lot more sense than mine. So yes, Dirty Queue as prio 4 both has to discard entries due to its currently limited size, and there's also a starvation issue here, because most everything else is more important than processing those dirty tiles. Bounded overtaking would be one potential approach to limit those starvation effects, but that's currently not implemented. It may be even better in the long run than just increasing the dirty queue size. Also, I really thought, only |
Thought I'd throw in some clarifying remarks. Assuming this information is still accurate, I haven't looked at the code in quite a while so I don't know if things have changed in this respect. Priority Request Queue, Request Queue, Low Priority Request Queue are all meant for on the fly rendering. i.e. if a user requests a tile in mod_tile and it is out of date in some way or another, mod_tile delays serving that tile while it tries to render it. So these renderings are "time critical" as they delay the serving of a tile. Priority Request Queue thereby handles tiles that have no stale cached tile. I.e. if the on the fly rendering fails, mod_tile will deliver a 404 tile not found response. That is why this is the highest priority. Request Queue servers tiles that have a cached stale tile for which we know the data has changed, so if mod_tile ends up serving the cached version as renderd doesn't render the tile in time, the user will see "wrong"/"old" geometry/features. Low Priority Queue is for when the style sheet has changed, but not the data it self. As all of these queues are designed for on the fly rendering, their size is limited to 32 entries. It is not reasonable to believe renderd will render more than 32 metatiles in the time before mod_tile times out and serves either a stale or 404 tile. So there is no point in having a longer queue there, as that would just add to the starvation issue. If a tile tries to be added to lets say the request queue, but the queue is full, it spills over into the dirty queue, which is the "large batch background" rendering queue which isn't "time critical" anymore, as there is no user directly waiting for the answer. This queue is there to try and steady the load on the server and make sure that the server is still doing something useful during low load times. So increasing the queue length beyond 2 minutes definitely seems good. However, there are definitely a couple of issue with increasing the dirty queue size. As as been pointed out, the dirty queue can easily be starved out. Now if during a burst of high load the priority queue with missing tiles overflow, some of those missing tiles request end up in the dirty queue. If the load then drops a little so that the priority queue is no longer full, but e.g. the low priority queue is still saturated, then renderd will only render those style sheet changed tiles, but never get around to rendering the missing tile that has dropped into the dirty queue. If a new user now tries to look at that dirty tile, mod_tile will submit it to renderd again, but as it is already queued (in the dirty queue) renderd won't requeue it into the priority queue, even if the priority queue could now hold the request. So it is possible to starve out missing metatiles in the dirty queue, which is very bad. With a shorter dirty queue this risk of starving important tiles out is somewhat reduced. Furthermore, back then (not sure if that has changed by now), renderd didn't really use any library routines. So instead of using a decent quality existing implementation of a priority queue, I coded up my own implementation. And that implementation was in the category "good enough at the time", not really in the "good" category. So ideally that hole queuing implementation gets replaced by a proper library implementation of a priory queue with re-queueing of higher priority requests and with something like an epsilon greedy dequeueing strategy to reduce starvation issues. |
Adjust render queue depth so that a rendering server dirty queue length is shifted from several minutes (1000 / 7 ~= 142 s ~= 2.3 minutes) to a whole day.
Metrics are taken from http://munin.osm.org/static/dynazoom.html?cgiurl_graph=/munin-cgi/munin-cgi-graph&plugin_name=openstreetmap/yevaud.openstreetmap/renderd_processed&size_x=800&size_y=400&start_epoch=1483213021&stop_epoch=1483904221