-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of plural endpoint with millions of records #1507
Comments
|
The SELECT it does looks like this: WITH collection_filtered AS (
SELECT id, last_modified, data, deleted
FROM records
WHERE parent_id = '/buckets/build-hub/collections/releases'
AND collection_id = 'record'
AND NOT deleted
),
total_filtered AS (
SELECT COUNT(id) AS count_total
FROM collection_filtered
WHERE NOT deleted
),
paginated_records AS (
SELECT DISTINCT id
FROM collection_filtered
)
SELECT count_total,
a.id, as_epoch(a.last_modified) AS last_modified, a.data
FROM paginated_records AS p JOIN collection_filtered AS a ON (a.id = p.id),
total_filtered
ORDER BY last_modified DESC
LIMIT 10; Even though it's limited to the first 10, the whole query becomes a "beast" for Postgres:
~9 seconds!! Compare with... select id, as_epoch(last_modified) AS last_modified, data
from records
WHERE parent_id = '/buckets/build-hub/collections/releases'
AND collection_id = 'record' and deleted=false
order by last_modified desc limit 10; which looks like this:
The count is quite expensive. SELECT count(*)
FROM records
WHERE parent_id = '/buckets/build-hub/collections/releases'
AND collection_id = 'record'
AND NOT deleted; That returns 772,688 here in my database.
Note, if I create an index...: create index records_parent_id_collection_id_not_deleted_idx ON records (parent_id, collection_id) where not deleted;
So it takes ~9 seconds. But if you break it up into a count and a plain select it could be 4.152 ms + 414.734 ms. And with the index, it would be 4.152 ms + 122.577 ms instead. Basically, what takes 9 seconds can take 0.5 seconds if we break it up. |
Used best-explain-analyze.py to run the big fat query a bunch of times to see what the best possible time you can get. The result is:
Meaning, the best it can do (after 10 attempts) is 8543.617ms. For the record, if I change it from
In conclusion, doing |
And in the same vein, if I run the simple select select id, as_epoch(last_modified) AS last_modified, data
from records
WHERE parent_id = '/buckets/build-hub/collections/releases'
AND collection_id = 'record' and deleted=false
order by last_modified desc limit 10; 100 times, I get:
So, instead of ~4ms the actual number is |
The scary thing about this is that during those 9 seconds, it's really CPU intensive and can cause other more pressing and complex queries to be blocked and build up. I have a question though: Why the select DISTINCT id, as_epoch(last_modified) AS last_modified, data
from records
WHERE parent_id = '/buckets/build-hub/collections/releases'
AND collection_id = 'record' and deleted=false
order by last_modified desc limit 10; Then you get:
It looks like this:
If there are multiple different IDs in that parent_id/collection_id the After all you can't ever have something like this:
THIS IS IMPOSSIBLE. (because of the index |
Ah! I get it. If the parent_id is a wildcard expression you will a benefit from the
So this is what I propose;
I'm still a bit confused about the DISTINCT at all. If you ask to get all records in |
I don't recall but I am pretty sure if you remove that DISTINCT and run the postgresql storage tests one test will fail and explain it. |
I am skeptical that separating the |
The select one doesn't scan the table at all. It just uses an index based on
Will do! 🤠 |
I guess I should also mention that keeping it in one query is probably better for reentrancy, although it may not matter much for this particular use case. |
Tried that. Running I still thing it's a weird thing to use distinct at all. But I don't want to solve that as part of this issue. This issue can ignore that debate and walk around the "problem". |
It seems like we're going to discuss this next week when we're all together in Orlando. In the interests of readiness, could we have some information about what the context/use case here is? It seems like these requests are happening in Buildhub, which we know has a ton of records all in one collection. What kinds of requests are being made? What are they used for? I want to rule out the risk that we implement a "fix" but we end up still slow because of other requests that end up repeatedly filtering a large collection. Since we acknowledge that this is largely specific to the pathological use that Buildhub does of Kinto, maybe the right fix is to just add Buildhub-specific indexes to the Buildhub Kinto DB. |
Mind you, Buildhub is being slowly replaced with Buildhub2 which doesn't use Kinto. It's a slowly dying project. One thing I'm curious about is; are there other collections in the wild (I guess we are restricted to knowing about Mozilla projects) that have large number of records per collection? ▶ cat ~/count.sql
select parent_id, collection_id, count(*) as count from records
group by parent_id, collection_id
order by count desc limit 20;
▶ psql buildhub < ~/count.sql
parent_id | collection_id | count
-----------------------------------------------+---------------+---------
/buckets/build-hub/collections/releases | record | 1037997
| bucket | 2
/buckets/2ca20155-dd42-fa2b-8e18-7102d2c3af79 | collection | 1
/buckets/build-hub | collection | 1
(4 rows)
▶ psql workon < ~/count.sql
parent_id | collection_id | count
-----------------------------------------------------------------+---------------+-------
/buckets/22cf4aa5-9bf3-2539-da76-a7d382a7c354/collections/todos | record | 58
/buckets/396037c3-8d15-495d-2e0a-037068da6dfa/collections/todos | record | 56
/buckets/f6d78db1-bcd7-8518-2f94-01545e622769/collections/todos | record | 8
/buckets/4a96c0e8-bed0-5c26-dcad-3b0a7619a6a5/collections/todos | record | 6
| bucket | 5
/buckets/f6d78db1-bcd7-8518-2f94-01545e622769 | collection | 1
/buckets/396037c3-8d15-495d-2e0a-037068da6dfa | collection | 1
/buckets/ff27fd9d-5a5b-5d94-ef7c-c7e8103bb25c | collection | 1
/buckets/4a96c0e8-bed0-5c26-dcad-3b0a7619a6a5 | collection | 1
/buckets/22cf4aa5-9bf3-2539-da76-a7d382a7c354 | collection | 1
(10 rows) If Mozilla's Buildhub1 is a globally unique instance then your point @glasserc is extremely relevant. By the way, would it be an idea to release a plugin called "kinto-without-counts" or something that basically overrides the SELECT queries as per my naive patches mentioned above? I don't think I would even personally care about using it for Buildhub, under the circumstances, even if it existed. |
Per my comments in #1624, I think having two SELECT queries violates abstractions without any real benefit. If the only problem it solves is Buildhub, and if you don't want to use it for Buildhub, I would say no, let's not release it. I'm open to trying to optimize Buildhub more generally, which is a broader topic, but in order to know how best to attack it, I'd need to have some information about what the context is. What kinds of requests are being made? What are they used for? |
I think the gist of #1624 isn't about doing two SELECT queries. It's about doing 1 |
Is this issue still outstanding, or can we close it now that #1931 landed? |
Yes! Thanks! |
Reaching a plural endpoint with a million record should be super fast when using pagination (
?_limit=10
) or filtering. Apparently, it's not the case.I suspect #1267 to be responsible for the regression.
For anyone interested in tackling this, here is a quick way to fill up a collection with fake records:
The text was updated successfully, but these errors were encountered: