Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serve most recent augmented diff from disk cache only #342

Closed
mmd-osm opened this issue Nov 27, 2016 · 6 comments
Closed

Serve most recent augmented diff from disk cache only #342

mmd-osm opened this issue Nov 27, 2016 · 6 comments

Comments

@mmd-osm
Copy link
Contributor

mmd-osm commented Nov 27, 2016

Today, several clients are requesting current augmented diffs more or less at the same time via a call to /api/augmented_diff?id=xyz. This approach does not scale to a higher number of clients, as each query takes about 5-30 seconds, consumes quite some CPU cycles, but effectively returns exactly the same result.

I'm proposing a few changes to this process:

  • Keep the most recent 60-120 augmented diffs precalculated on disk, possibly as compressed files. Maybe also for a longer period of time.
    Compared to the previous approach which kept everything on disk, this approach should keep the disk space requirements low
  • Precalculate most current augmented diff right after update_from_disk completes
  • Announce newly available diff via augmented_diff_status only after precalculation
  • Serve most recent augmented diffs via disk cache only, avoiding the query execution altogether. Older augmented diffs (outside 60-120 buffer range) may still trigger query.

Pros:

  • Much better scalability to a larger number of data consumers
  • Less overall system load
  • Less frustration for data consumers due to 429 error messages

Cons:

  • None
  • (maybe a few seconds more delay for data consumers)

Thanks to @pa5cal for suggestions.

@tyrasd
Copy link
Contributor

tyrasd commented Nov 27, 2016

each query takes about 5-30 seconds

in some of my tests (especially when requesting slightly older adiffs), it was even sometimes the case that generating one minutely diff required more than 1 minute of server time.

Keep the most recent 60-120 augmented diffs precalculated on disk.

Why not more? At least having a day or two (better: a week or two) of cache would make sense IMHO. I think a common use-case of adiffs is the updating of statistics that was initially generated from a planet dump (e.g. for something like tyrasd/taghistory#10). Since those planets are typically generated on a daily or weekly basis, that requires fetching adiffs up to more than ~10 days back.

@mmd-osm
Copy link
Contributor Author

mmd-osm commented Nov 28, 2016

@tyrasd : 60-120 was just an initial starting point for the discussion based on current log files. I guess once we have this as a configurable parameter it could be easily extended to longer periods of time, if disk space permits.

@mmd-osm
Copy link
Contributor Author

mmd-osm commented Dec 5, 2016

Fixed in ea9bb55 & f77be3a

@mmd-osm mmd-osm closed this as completed Dec 5, 2016
@mmd-osm
Copy link
Contributor Author

mmd-osm commented Dec 5, 2016

@tyrasd : I think Roland put in a default of 60 minutes now. For bigger values you could try a bit of lobbying for your use case... ;)

@pa5cal
Copy link

pa5cal commented Dec 5, 2016

Perfecto!
Thank you very much.

@pa5cal
Copy link

pa5cal commented Feb 17, 2017

Could we increase the caching to a min of 3h or better 6h?

In the last seven days, I got many timeouts and 429 response code during processing the augmented diff files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants