Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update render queue limits #152

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

Komzpa
Copy link

@Komzpa Komzpa commented Jan 8, 2017

Adjust render queue depth so that a rendering server dirty queue length is shifted from several minutes (1000 / 7 ~= 142 s ~= 2.3 minutes) to a whole day.

Metrics are taken from http://munin.osm.org/static/dynazoom.html?cgiurl_graph=/munin-cgi/munin-cgi-graph&plugin_name=openstreetmap/yevaud.openstreetmap/renderd_processed&size_x=800&size_y=400&start_epoch=1483213021&stop_epoch=1483904221

Adjust render queue depth so that a rendering server dirty queue length is shifted from several minutes (1000 / 7 ~= 142 s  ~= 2.3 minutes) to a whole day.
#define DIRTY_LIMIT (1000)

#define DIRTY_LIMIT (216000)
#define HASHIDX_SIZE (477856)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a higher HASHIDX_SIZE needed for with metatiles?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to @gardster, this is not required and is redefined anyway later. I did it by analogy to the below. It looks like it's safe to remove both.

Copy link
Contributor

@pnorman pnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic of this seems tailored specifically to OSM.org. Does it also apply in general? Is a large queue desirable behavior?

#define DIRTY_LIMIT (1000)

#define DIRTY_LIMIT (216000)
#define HASHIDX_SIZE (477856)
#else
#define QUEUE_MAX (1024)
#define REQ_LIMIT (512)
#define DIRTY_LIMIT (10000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the non-metatile limit be higher than the metatile limit?

@Komzpa
Copy link
Author

Komzpa commented Jan 8, 2017

The values are tailored based on osm.org statistics.

On non-osm.org installs:

  • if it's a local small install, the queue just won't grow. No major change.
  • if it's a popular but not overloaded install, the queue will grow in the same fashion as on osm.org, and will be shrinking on non-popular hours. Starting from day 2 the number of fresh tiles served from cache directly grows, outweighing the queuing overhead by a lot.
  • if it's an overloaded install, the queue would be large and won't shrink. This adds some time for queuing, but this shouldn't be noticeable as the install is overloaded and is unable to serve tiles on time anyway.

@Komzpa
Copy link
Author

Komzpa commented Nov 15, 2017

hello, what do we do to get it merged?

#else
#define QUEUE_MAX (1024)
#define REQ_LIMIT (512)
#define DIRTY_LIMIT (10000)
#define HASHIDX_SIZE 22123
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is HASHIDX_SIZE and why are you removing it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is no-op.
It is used here:

return key % queue->hashidxSize;

and redefined here:
#define HASHIDX_SIZE 2213

(see includes order in
#include "render_config.h"
)

Copy link
Member

@tomhughes tomhughes Nov 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does the hash table have 2213 buckets? or 22123 buckets? Not that either is anywhere big enough once you have a two million entry queue - that will give an average chain length of either 100 or 1000 either of which seems way too large.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like somebody wants to increase HASHIDX_SIZE from default 2213 to 22123 when METATILE is not defined. But it's so ugly and include order dependent that should be fixed.

Copy link

@vng vng Nov 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apmon According to the code history, here is the initial commit to use different values for HASHIDX_SIZE:
da19811

But later after the refactoring, it was broken, and now HASHIDX_SIZE is always (not obvious) is 2213:
707e125

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomhughes I've updated numbers to match ratio.
We're not using non-metatile mode, so it's not 2M, but 200k records.

@vng I've updated request_queue.h to not rewrite the value.

@tomhughes
Copy link
Member

Not having our servers running in full meltdown mode 24x7 could of course be considered a feature rather than a bug...

@tomhughes
Copy link
Member

Incidentally I believe the current queue lengths are the results of previous experimentation by @apmon although I can't claim to remember all the details.

I have some sort of memory that there were performance issues managing the queue if it was too big?

@kocio-pl
Copy link

There's a lot of time between 2 minutes and the whole day. Is it possible to test a single server with something like an hour or 10 minutes for example?

@Komzpa
Copy link
Author

Komzpa commented Nov 26, 2017

@kocio-pl I can provide such a branch. Does anybody have a machine to run it on, with a proper load?
@tomhughes can we do a split-test on tile.osm.org? :)

@tomhughes
Copy link
Member

Not easily. At the very least you would need to provide Ubuntu compatible binary packages.

@kocio-pl
Copy link

I guess you could ask @rrzefox if he's willing to test it.

@kocio-pl
Copy link

Maybe it makes sense to start with making these variables configurable through mod_tile.conf. Once the code and package are updated, it would be much easier to experiment and even tweak it the way any given server admins like it.

@rrzefox
Copy link

rrzefox commented Nov 26, 2017

I guess you could ask @rrzefox if he's willing to test it.

Sorry, no can do - I'm using tirex, not renderd. Which incidentally does not have a queue limit, at least none I've ever hit...

Maybe it makes sense to start with making these variables configurable through mod_tile.conf.

You probably mean renderd.conf? mod_tile.conf would be something read by apache, and apache is not handling the queue, the render-daemon is.
But indeed, something like queue length is a value I'd expect to be configurable, not hardcoded.

@tomhughes
Copy link
Member

Making it configurable would certainly be a good idea - then we could do a new build and test different configurations.

@mmd-osm
Copy link

mmd-osm commented Dec 3, 2017

I'm a bit confused by this proposed change. There are basically 5 different queue classes, which are being handled by request_queue_fetch_request in the following order:

  1. Priority Request Queue,
  2. Request Queue,
  3. Low Priority Request Queue,
  4. Dirty Queue, and
  5. Bulk Request queue.

As long as there are queue entries in a higher class, they will be processed first. (I wonder why this doesn't lead to starvation issues, but that's another issue).

To my understanding, changing the DIRTY_LIMIT will only affect entries in category (4), that is the Dirty Queue. However, this queue seems pretty empty most of the time anyway: munin link

Isn't this queue only relevant, if someone adds /dirty to an URL to manually trigger a re-rendering of a tile?

@Komzpa
Copy link
Author

Komzpa commented Dec 3, 2017

hey @mmd-osm,
Dirty Queue is not rendered during high load times, and not kept until low load times.
https://munin.openstreetmap.org/openstreetmap/render.openstreetmap/renderd_queue.html

Dirty queue holds all the tiles that were rendered previously and marked dirty due to someone making an edit in the area. They can be rendered on change instead of being rendered later on the first request.

@mmd-osm
Copy link

mmd-osm commented Dec 3, 2017

Thanks, your munin link makes a lot more sense than mine. So yes, Dirty Queue as prio 4 both has to discard entries due to its currently limited size, and there's also a starvation issue here, because most everything else is more important than processing those dirty tiles.

Bounded overtaking would be one potential approach to limit those starvation effects, but that's currently not implemented. It may be even better in the long run than just increasing the dirty queue size.

Also, I really thought, only /dirty would add new entries to the Dirty Queue, thanks for clarifying that bit.

@apmon
Copy link
Member

apmon commented Dec 3, 2017

Thought I'd throw in some clarifying remarks.

Assuming this information is still accurate, I haven't looked at the code in quite a while so I don't know if things have changed in this respect.

Priority Request Queue, Request Queue, Low Priority Request Queue are all meant for on the fly rendering. i.e. if a user requests a tile in mod_tile and it is out of date in some way or another, mod_tile delays serving that tile while it tries to render it. So these renderings are "time critical" as they delay the serving of a tile. Priority Request Queue thereby handles tiles that have no stale cached tile. I.e. if the on the fly rendering fails, mod_tile will deliver a 404 tile not found response. That is why this is the highest priority. Request Queue servers tiles that have a cached stale tile for which we know the data has changed, so if mod_tile ends up serving the cached version as renderd doesn't render the tile in time, the user will see "wrong"/"old" geometry/features. Low Priority Queue is for when the style sheet has changed, but not the data it self.

As all of these queues are designed for on the fly rendering, their size is limited to 32 entries. It is not reasonable to believe renderd will render more than 32 metatiles in the time before mod_tile times out and serves either a stale or 404 tile. So there is no point in having a longer queue there, as that would just add to the starvation issue.

If a tile tries to be added to lets say the request queue, but the queue is full, it spills over into the dirty queue, which is the "large batch background" rendering queue which isn't "time critical" anymore, as there is no user directly waiting for the answer. This queue is there to try and steady the load on the server and make sure that the server is still doing something useful during low load times.

So increasing the queue length beyond 2 minutes definitely seems good. However, there are definitely a couple of issue with increasing the dirty queue size. As as been pointed out, the dirty queue can easily be starved out. Now if during a burst of high load the priority queue with missing tiles overflow, some of those missing tiles request end up in the dirty queue. If the load then drops a little so that the priority queue is no longer full, but e.g. the low priority queue is still saturated, then renderd will only render those style sheet changed tiles, but never get around to rendering the missing tile that has dropped into the dirty queue. If a new user now tries to look at that dirty tile, mod_tile will submit it to renderd again, but as it is already queued (in the dirty queue) renderd won't requeue it into the priority queue, even if the priority queue could now hold the request. So it is possible to starve out missing metatiles in the dirty queue, which is very bad. With a shorter dirty queue this risk of starving important tiles out is somewhat reduced.

Furthermore, back then (not sure if that has changed by now), renderd didn't really use any library routines. So instead of using a decent quality existing implementation of a priority queue, I coded up my own implementation. And that implementation was in the category "good enough at the time", not really in the "good" category.

So ideally that hole queuing implementation gets replaced by a proper library implementation of a priory queue with re-queueing of higher priority requests and with something like an epsilon greedy dequeueing strategy to reduce starvation issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants