-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What should we do with Total-Records
?
#1624
Comments
Speaking personally now, I feel that the 2nd option ("best effort" Thinking about this more, I think my current preference is something like category 3, but with a caveat that if a user does a HEAD request, we calculate the |
Yes! That! Pagination is best done with a cursor-technique. Like, if there is a Next-URL header, keep going. Any client that tries to do...: res = requests.get(url)
all=[res.json()]
pages = int(res.headers['Total-Records'] / 10000) + 1
for p in range(pages):
res = requests.get(res.headers['next-url'])
all.extend(res.json()) ...is doomed to fail and burn. Category 3 talks about "could stick a constant value in there" which I think is best avoided. I'd vote for something like: if request.headers.get('prefer')=='include-total-records' or request.method == 'HEAD' or request.query.get('include')=='total-records':
response.set_header('Total-Records', self._count_records(collection_id, parent_id, filters))
else:
response.set_header('Total-Records-Excluded', "See documentation if you want the 'Total-Records' header") |
Because I'm not sure where else to share this; here's a benchmark I did by manually breaking up the What it shows is that for a collection with 10 records, it takes...
That's approximately 10x difference. For collections with 100 records, it's also about 10x difference. And for 1,000 records it's about 5x. When there was filtering, I edited 1 record per collection to have some JSON and then I used More to come! In particular, this idea done properly in the kinto python code and just as a pure SQL exercise. |
That spreadsheet "proved" that the individual SQL statements were a performance boost but I wanted to test this by running the whole thing end-to-end. That would give me a number of how long it takes to make, say, 100 GET requests via HTTP. Using
Using my branch:
For each of the 4 different collections (named per their cardinality), the performance gain is 0.1 seconds. Not very impressive. But it's what we have to deal with. I think what's going on is that the more performant SQL query is just a tiny minority of all the other stuff such as the pyramind HTTP stuff, auth, and all these inserts for the timestamp. |
I know I've said that we shouldn't worry about Buildhub's crazy 1M records per collection but at least this branch gives some hope.
versus
Same output, obviously different headers, but it's 33 seconds vs. 0.152 seconds. |
Extracted from #1622, #1615, #1507, etc. One of the fundamental issues here is whether we provide a
Total-Records
header in plural endpoint responses. Doing so has a performance cost, because we have to enumerate every single record that matches the request rather than just the first N (where N is given bylimit=N
or the maximum response length in the config).Here are some proposals for what to do with
Total-Records
, each of which has a different performance cost and accuracy tradeoff.Compute
Total-Records
by counting every record that matches the request's current filter, in a way that is free from race conditions. This is what 9.0.0 does and the behavior I try to preserve in Cleanup Postgres get_all #1622. This is the most expensive option and the most accurate.Compute
Total-Records
but with some amount of latitude for race conditions. This is what is implied by [WIP] Optimize postgresql storage get all fixes 1507 #1615 and @peterbe's suggestions on Cleanup Postgres get_all #1622 of using a cache to store the previously-computed value for a short duration. Different approaches can still fall into this category even if they have different tolerances for race conditions. In general, the greater the tolerance for race conditions, the better the performance gains.Admit defeat with the
Total-Records
header. We can't outright remove it from the response without breaking clients, but we could stick a constant value in there or otherwise provide something that will allow pagination code to continue working but without actually computingTotal-Records
. For example, if there were any records to return, we could putoffset+1
in there, and otherwiseprevious_offset
. Or if there were any records, we could putmax_int
and otherwise 0.Of course, we could cut a new API version where
Total-Records
is not computed by default. There are some other alternatives that could alleviate the performance cost ofTotal-Records
, such as allowing clients to tell us (e.g.Prefer: no-total-records
or?select=!total_records
or something) not to compute it, but because the current default is to compute it and because we can't change that default without breaking the API, I feel that this is not going to help much in practice.The text was updated successfully, but these errors were encountered: