Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What should we do with Total-Records? #1624

Closed
glasserc opened this issue May 8, 2018 · 5 comments
Closed

What should we do with Total-Records? #1624

glasserc opened this issue May 8, 2018 · 5 comments
Labels

Comments

@glasserc
Copy link
Contributor

glasserc commented May 8, 2018

Extracted from #1622, #1615, #1507, etc. One of the fundamental issues here is whether we provide a Total-Records header in plural endpoint responses. Doing so has a performance cost, because we have to enumerate every single record that matches the request rather than just the first N (where N is given by limit=N or the maximum response length in the config).

Here are some proposals for what to do with Total-Records, each of which has a different performance cost and accuracy tradeoff.

  1. Compute Total-Records by counting every record that matches the request's current filter, in a way that is free from race conditions. This is what 9.0.0 does and the behavior I try to preserve in Cleanup Postgres get_all #1622. This is the most expensive option and the most accurate.

  2. Compute Total-Records but with some amount of latitude for race conditions. This is what is implied by [WIP] Optimize postgresql storage get all fixes 1507 #1615 and @peterbe's suggestions on Cleanup Postgres get_all #1622 of using a cache to store the previously-computed value for a short duration. Different approaches can still fall into this category even if they have different tolerances for race conditions. In general, the greater the tolerance for race conditions, the better the performance gains.

  3. Admit defeat with the Total-Records header. We can't outright remove it from the response without breaking clients, but we could stick a constant value in there or otherwise provide something that will allow pagination code to continue working but without actually computing Total-Records. For example, if there were any records to return, we could put offset+1 in there, and otherwise previous_offset. Or if there were any records, we could put max_int and otherwise 0.

Of course, we could cut a new API version where Total-Records is not computed by default. There are some other alternatives that could alleviate the performance cost of Total-Records, such as allowing clients to tell us (e.g. Prefer: no-total-records or ?select=!total_records or something) not to compute it, but because the current default is to compute it and because we can't change that default without breaking the API, I feel that this is not going to help much in practice.

@glasserc
Copy link
Contributor Author

glasserc commented May 8, 2018

Speaking personally now, I feel that the 2nd option ("best effort" Total-Records) is the worst of both worlds. We pay the performance penalty, but we don't get an accurate result.

Thinking about this more, Total-Records is essentially always exposed to race conditions, because even if it's accurate now, there's no guarantee that it will be accurate when the client receives it, parses it, or sends another request. In other words, relying on Total-Records to do correct pagination is essentially always wrong; any client that relies on the first request's Total-Records is doomed, and even relying on the most recent request's Total-Records is wrong because new records could have been inserted before or after the current page. So my feeling is that we should scrap the options in category 1.

I think my current preference is something like category 3, but with a caveat that if a user does a HEAD request, we calculate the Total-Records correctly (this is what kinto-http.js's getTotalRecords function does). There are use cases where category 2 is useful, but I think such cases are better solved by forcing clients to request total records explicitly, and I think category 3 is the closest we can come without actually breaking the API.

@peterbe
Copy link
Contributor

peterbe commented May 9, 2018

relying on Total-Records to do correct pagination is essentially always wrong

Yes! That!

Pagination is best done with a cursor-technique. Like, if there is a Next-URL header, keep going. Any client that tries to do...:

res = requests.get(url)
all=[res.json()]
pages = int(res.headers['Total-Records'] / 10000) + 1
for p in range(pages):
    res = requests.get(res.headers['next-url'])
    all.extend(res.json())

...is doomed to fail and burn.

Category 3 talks about "could stick a constant value in there" which I think is best avoided. I'd vote for something like:

if request.headers.get('prefer')=='include-total-records' or request.method == 'HEAD' or request.query.get('include')=='total-records':
    response.set_header('Total-Records', self._count_records(collection_id, parent_id, filters))
else:
    response.set_header('Total-Records-Excluded', "See documentation if you want the 'Total-Records' header")

@peterbe
Copy link
Contributor

peterbe commented Dec 6, 2018

Because I'm not sure where else to share this; here's a benchmark I did by manually breaking up the WITH collection_filtered AS (SELECT, COUNT) and two individual just SELECT COUNT(*) & SELECT id, data, ... LIMIT.
https://docs.google.com/spreadsheets/d/1l7UgSJWxg_PRoCF1gIn8Y1gqODJxeT3YgDRmWaRpFdM/edit?usp=sharing

What it shows is that for a collection with 10 records, it takes...

  • 1.46ms to do the combined COUNT + SELECT
  • 0.09ms to do just the SELECT
  • 0.12ms to do just the COUNT

That's approximately 10x difference.

For collections with 100 records, it's also about 10x difference. And for 1,000 records it's about 5x.

When there was filtering, I edited 1 record per collection to have some JSON and then I used (data->'name' IS NOT NULL AND data->'name' = '"peter"') which means that each individual count would yield exactly 1 record.

More to come! In particular, this idea done properly in the kinto python code and just as a pure SQL exercise.

@peterbe
Copy link
Contributor

peterbe commented Dec 10, 2018

That spreadsheet "proved" that the individual SQL statements were a performance boost but I wanted to test this by running the whole thing end-to-end. That would give me a number of how long it takes to make, say, 100 GET requests via HTTP.

Using master:

▶ python bench-queries.py 100 http://localhost:8888/v1/buckets/mybucket/collections/10/records http://localhost:8888/v1/buckets/mybucket/collections/100/records http://localhost:8888/v1/buckets/mybucket/collections/1000/records http://localhost:8888/v1/buckets/mybucket/collections/10000/records
==================http://localhost:8888/v1/buckets/mybucket/collections/10/records==================
SUM    1.7185s
 BEST   0.0126s
 WORST  0.0392s
 MEAN   0.0172s
 MEDIAN 0.0173s


=================http://localhost:8888/v1/buckets/mybucket/collections/100/records==================
SUM    1.9184s
 BEST   0.0151s
 WORST  0.0534s
 MEAN   0.0192s
 MEDIAN 0.0185s


=================http://localhost:8888/v1/buckets/mybucket/collections/1000/records=================
SUM    2.4350s
 BEST   0.0205s
 WORST  0.0322s
 MEAN   0.0244s
 MEDIAN 0.0242s


================http://localhost:8888/v1/buckets/mybucket/collections/10000/records=================
SUM    8.8490s
 BEST   0.0734s
 WORST  0.1346s
 MEAN   0.0885s
 MEDIAN 0.0817s

Using my branch:

▶ python bench-queries.py 100 http://localhost:8888/v1/buckets/mybucket/collections/10/records http://localhost:8888/v1/buckets/mybucket/collections/100/records http://localhost:8888/v1/buckets/mybucket/collections/1000/records http://localhost:8888/v1/buckets/mybucket/collections/10000/records
==================http://localhost:8888/v1/buckets/mybucket/collections/10/records==================
SUM    1.6176s
 BEST   0.0120s
 WORST  0.0482s
 MEAN   0.0162s
 MEDIAN 0.0159s


=================http://localhost:8888/v1/buckets/mybucket/collections/100/records==================
SUM    1.6034s
 BEST   0.0125s
 WORST  0.0210s
 MEAN   0.0160s
 MEDIAN 0.0160s


=================http://localhost:8888/v1/buckets/mybucket/collections/1000/records=================
SUM    2.3436s
 BEST   0.0177s
 WORST  0.0350s
 MEAN   0.0234s
 MEDIAN 0.0228s


================http://localhost:8888/v1/buckets/mybucket/collections/10000/records=================
SUM    8.7577s
 BEST   0.0706s
 WORST  0.1923s
 MEAN   0.0876s
 MEDIAN 0.0796s

For each of the 4 different collections (named per their cardinality), the performance gain is 0.1 seconds. Not very impressive. But it's what we have to deal with.

I think what's going on is that the more performant SQL query is just a tiny minority of all the other stuff such as the pyramind HTTP stuff, auth, and all these inserts for the timestamp.

@peterbe
Copy link
Contributor

peterbe commented Dec 10, 2018

I know I've said that we shouldn't worry about Buildhub's crazy 1M records per collection but at least this branch gives some hope.

dev/KINTO/kinto  master ✗
▶ time curl http://localhost:8888/v1/buckets/build-hub/collections/releases/records\?_limit\=1
{"data":[{"build":{"as":"z:/build/build/src/vs2017_15.8.4/VC/bin/Hostx64/x64/ml64.exe","cc":"z:/build/build/src/clang/bin/clang-cl.exe -Xclang -std=gnu99 -fms-compatibility-version=19.15.26726","id":"20181122182000","cxx":"z:/build/build/src/clang/bin/clang-cl.exe -fms-compatibility-version=19.15.26726","date":"2018-11-22T18:20:00Z","host":"x86_64-pc-mingw32","number":1,"target":"x86_64-pc-mingw32"},"source":{"tree":"releases/mozilla-beta","product":"firefox","revision":"4fcfd15911a112ef9444082b78bfcc41ec974216","repository":"https://hg.mozilla.org/releases/mozilla-beta"},"target":{"os":"win","locale":"zh-TW","channel":"beta","version":"64.0b12","platform":"win64"},"download":{"url":"https://archive.mozilla.org/pub/firefox/releases/64.0b12/win64/zh-TW/Firefox Setup 64.0b12.exe","date":"2018-11-23T06:11:35Z","size":44851760,"mimetype":"application/msdos-windows"},"id":"firefox_beta_64-0b12_win64_zh-tw","last_modified":1543262158074}]}curl   0.01s user 0.02s system 0% cpu 33.094 total

versus

▶ time curl http://localhost:8888/v1/buckets/build-hub/collections/releases/records\?_limit\=1
{"data":[{"build":{"as":"z:/build/build/src/vs2017_15.8.4/VC/bin/Hostx64/x64/ml64.exe","cc":"z:/build/build/src/clang/bin/clang-cl.exe -Xclang -std=gnu99 -fms-compatibility-version=19.15.26726","id":"20181122182000","cxx":"z:/build/build/src/clang/bin/clang-cl.exe -fms-compatibility-version=19.15.26726","date":"2018-11-22T18:20:00Z","host":"x86_64-pc-mingw32","number":1,"target":"x86_64-pc-mingw32"},"source":{"tree":"releases/mozilla-beta","product":"firefox","revision":"4fcfd15911a112ef9444082b78bfcc41ec974216","repository":"https://hg.mozilla.org/releases/mozilla-beta"},"target":{"os":"win","locale":"zh-TW","channel":"beta","version":"64.0b12","platform":"win64"},"download":{"url":"https://archive.mozilla.org/pub/firefox/releases/64.0b12/win64/zh-TW/Firefox Setup 64.0b12.exe","date":"2018-11-23T06:11:35Z","size":44851760,"mimetype":"application/msdos-windows"},"id":"firefox_beta_64-0b12_win64_zh-tw","last_modified":1543262158074}]}curl   0.01s user 0.01s system 12% cpu 0.152 total

Same output, obviously different headers, but it's 33 seconds vs. 0.152 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants