Do we need to support get_all with a wildcard parent_id? #1616

glasserc · 2018-04-26T19:56:18Z

While investigating #1507, @peterbe discovered that the DISTINCT doesn't really make sense unless you have a wildcard parent_id. He proved this to his own satisfaction by removing DISTINCT and observing that no tests break.

Under what circumstances do we even have a wildcard parent_id? Unlike delete, where you might want to delete a thing and all its children, I couldn't think of a way to invoke this mechanism using the HTTP API. Indeed, there's a subtle bug in the current query and the fact that nobody has reported it makes me suspect that nobody ever actually uses it. Can we get rid of it?

The text was updated successfully, but these errors were encountered:

leplatrem · 2018-04-26T20:37:39Z

My position on stuff like this has always been: tests are specifications. If tests don't fail you can remove whatever you want :)

glasserc · 2018-04-26T20:50:00Z

It turns out there actually are two tests that cover get_all with a wildcard parent_id. But my assertion is that outside of those tests, nobody actually uses it. (The only code path I could see is if someone creates a Model subclass with a parent_id that has a wildcard in it.) What if I delete the functionality and the two tests that cover it, can I still remove it? 😀

Natim · 2018-04-26T22:02:17Z

Are you sure it's not a feature that we added to support the quota features?

leplatrem · 2018-04-27T10:29:32Z

Can you give us pointers to the tests?

Which PR / blame ? Any mention in CHANGELOG? Are those tests present in the original cliquet repo?

leplatrem · 2018-04-27T10:32:09Z

OK, I grep the usage of get_all and found this in the CHANGELOG:
The storage backend now allows ``parent_id`` pattern matching in ``kinto.core.storage.get_all``. (#821)

@Natim implemented it and I approved it in #821

leplatrem · 2018-04-27T10:42:33Z

I found its usage by searching "pattern matching" :)

kinto/kinto/plugins/accounts/views.py

Lines 82 to 83 in 5904085

    
           # Admin see all accounts. 
        
           return '*'

I remember now that we suffered when implementing the accounts parent id stuff. Each account entry is isolated using parent_id=user_id. As admin, I can see every entries, thus parent_id=*.

There might other ways to implement this, and I don't think the «account admin» feature is important enough to justify having lower perfs on get_all()

Natim · 2018-04-27T11:50:06Z

I also found why I implemented it in the first place:

https://github.com/Kinto/kinto-webpush/pull/4/files#diff-737c25e4e87da6befa4a70fef60cc91aR38

peterbe · 2018-04-27T12:31:37Z

Check this out. I messed with my buildhub_copy database and now I have two records that have identical ID, last_modified and data. The only difference is that they have different parent_id. This is normal.

(Note: In this database every single row belongs to the same collection_id)

First I run the original query from get_all:

WITH collection_filtered AS (
    SELECT id, last_modified, data, deleted
      FROM records
     WHERE parent_id LIKE '/buckets/build-hub/collections/%'
       AND collection_id = 'record'
       AND NOT deleted

),
total_filtered AS (
    SELECT COUNT(id) AS count_total
      FROM collection_filtered
     WHERE NOT deleted
),
paginated_records AS (
    SELECT DISTINCT id
      FROM collection_filtered

)
 SELECT count_total,
       a.id, as_epoch(a.last_modified) AS last_modified,
       MD5(jsonb_pretty(a.data)) AS data
  FROM paginated_records AS p JOIN collection_filtered AS a ON (a.id = p.id),
       total_filtered
  ORDER BY last_modified DESC
 LIMIT 10;

and I get:

 count_total |                      id                      | last_modified |               data
-------------+----------------------------------------------+---------------+----------------------------------
      183241 | firefox_beta_58-0b15rc1_linux-i686_cs        | 1523875992587 | 3d29a75fcf0ed7dfff86d3db8f92fc69
      183241 | firefox_beta_58-0b15rc1_linux-i686_cs2       | 1523861592005 | 3d29a75fcf0ed7dfff86d3db8f92fc69
      183241 | firefox_beta_60-0b10rc1_win32-eme-free_en-za | 1523630817174 | 54efb0c5bcfae989a74c1a9fd68daeb3
      183241 | firefox_beta_60-0b10rc1_win32-eme-free_en-gb | 1523630817158 | 1e7f84a3ae2c7b4f66e045d99473fff7
      183241 | firefox_beta_60-0b10rc1_macosx_ur            | 1523630817149 | 3d29a75fcf0ed7dfff86d3db8f92fc69
      183241 | firefox_beta_60-0b10rc1_macosx_ur            | 1523630817149 | 3d29a75fcf0ed7dfff86d3db8f92fc69
      183241 | firefox_beta_60-0b10rc1_win32-eme-free_el    | 1523630817147 | d5681a5f6f040ff5c44910acb9812d41
      183241 | firefox_beta_60-0b10rc1_macosx_uk            | 1523630817138 | ef2ab418fb87dc88fb49c38d2e289740
      183241 | firefox_beta_60-0b10rc1_win32-eme-free_dsb   | 1523630817137 | 9d4c8cb0ba8d1d4e0328453fe9c5e4cc
      183241 | firefox_beta_60-0b10rc1_macosx_tr            | 1523630817128 | 2098bd485166b7b5088a8d0fa71a75f9

Note the two identical rows with ID firefox_beta_60-0b10rc1_macosx_ur. Arguably useless and useful too as a JSON response. Who am I to judge.

Next, change that SQL to NOT do a DISTINCT:

 WITH collection_filtered AS (
     SELECT id, last_modified, data, deleted
       FROM records
      WHERE parent_id LIKE '/buckets/build-hub/collections/%'
        AND collection_id = 'record'
        AND NOT deleted

 ),
 total_filtered AS (
     SELECT COUNT(id) AS count_total
       FROM collection_filtered
      WHERE NOT deleted
 ),
 paginated_records AS (
     SELECT id
       FROM collection_filtered

 )
  SELECT count_total,
        a.id, as_epoch(a.last_modified) AS last_modified,
        MD5(jsonb_pretty(a.data)) AS data
   FROM paginated_records AS p JOIN collection_filtered AS a ON (a.id = p.id),
        total_filtered
   ORDER BY last_modified DESC
  LIMIT 10;

Now you get this:

 count_total |                      id                      | last_modified |               data
-------------+----------------------------------------------+---------------+----------------------------------
      183241 | firefox_beta_58-0b15rc1_linux-i686_cs        | 1523875992587 | 3d29a75fcf0ed7dfff86d3db8f92fc69
      183241 | firefox_beta_58-0b15rc1_linux-i686_cs2       | 1523861592005 | 3d29a75fcf0ed7dfff86d3db8f92fc69
      183241 | firefox_beta_60-0b10rc1_win32-eme-free_en-za | 1523630817174 | 54efb0c5bcfae989a74c1a9fd68daeb3
      183241 | firefox_beta_60-0b10rc1_win32-eme-free_en-gb | 1523630817158 | 1e7f84a3ae2c7b4f66e045d99473fff7
      183241 | firefox_beta_60-0b10rc1_macosx_ur            | 1523630817149 | 3d29a75fcf0ed7dfff86d3db8f92fc69
      183241 | firefox_beta_60-0b10rc1_macosx_ur            | 1523630817149 | 3d29a75fcf0ed7dfff86d3db8f92fc69
      183241 | firefox_beta_60-0b10rc1_macosx_ur            | 1523630817149 | 3d29a75fcf0ed7dfff86d3db8f92fc69
      183241 | firefox_beta_60-0b10rc1_macosx_ur            | 1523630817149 | 3d29a75fcf0ed7dfff86d3db8f92fc69
      183241 | firefox_beta_60-0b10rc1_win32-eme-free_el    | 1523630817147 | d5681a5f6f040ff5c44910acb9812d41
      183241 | firefox_beta_60-0b10rc1_macosx_uk            | 1523630817138 | ef2ab418fb87dc88fb49c38d2e289740

What the hell?! 4 records with ID firefox_beta_60-0b10rc1_macosx_ur. I can see the benefit of the DISTINCT for this particular query.

Compare this with my prototype in #1615
Ignore the count_total thing and let's just look at the SELECT:

select id, as_epoch(last_modified) AS last_modified,
MD5(jsonb_pretty(data)) AS data
from records
WHERE parent_id LIKE '/buckets/build-hub/collections/%'
ORDER BY last_modified DESC
LIMIT 10;

YOu get:

                      id                      | last_modified |               data
----------------------------------------------+---------------+----------------------------------
 firefox_beta_58-0b15rc1_linux-i686_cs        | 1523875992587 | 3d29a75fcf0ed7dfff86d3db8f92fc69
 firefox_beta_58-0b15rc1_linux-i686_cs2       | 1523861592005 | 3d29a75fcf0ed7dfff86d3db8f92fc69
 firefox_beta_60-0b10rc1_win32-eme-free_en-za | 1523630817174 | 54efb0c5bcfae989a74c1a9fd68daeb3
 firefox_beta_60-0b10rc1_win32-eme-free_en-gb | 1523630817158 | 1e7f84a3ae2c7b4f66e045d99473fff7
 firefox_beta_60-0b10rc1_macosx_ur            | 1523630817149 | 3d29a75fcf0ed7dfff86d3db8f92fc69
 firefox_beta_60-0b10rc1_macosx_ur            | 1523630817149 | 3d29a75fcf0ed7dfff86d3db8f92fc69
 firefox_beta_60-0b10rc1_win32-eme-free_el    | 1523630817147 | d5681a5f6f040ff5c44910acb9812d41
 firefox_beta_60-0b10rc1_macosx_uk            | 1523630817138 | ef2ab418fb87dc88fb49c38d2e289740
 firefox_beta_60-0b10rc1_win32-eme-free_dsb   | 1523630817137 | 9d4c8cb0ba8d1d4e0328453fe9c5e4cc
 firefox_beta_60-0b10rc1_macosx_tr            | 1523630817128 | 2098bd485166b7b5088a8d0fa71a75f9

Just like the original query (the one currently in master, first in this comment). Two identical rows with ID firefox_beta_60-0b10rc1_macosx_ur.
Meaning, even though the ID, last_modified and data are identical, you get repeated IDs with the code we have in master. At least you only get 2 identical rows and not 4.

If you really want to get distinct ID, last_modified, data combos you need to run this query:

select DISTINCT id, as_epoch(last_modified) AS last_modified,
MD5(jsonb_pretty(data)) AS data
from records
WHERE parent_id LIKE '/buckets/build-hub/collections/%'
ORDER BY last_modified DESC
LIMIT 10;

Then you get:

                      id                      | last_modified |               data
----------------------------------------------+---------------+----------------------------------
 firefox_beta_58-0b15rc1_linux-i686_cs        | 1523875992587 | 3d29a75fcf0ed7dfff86d3db8f92fc69
 firefox_beta_58-0b15rc1_linux-i686_cs2       | 1523861592005 | 3d29a75fcf0ed7dfff86d3db8f92fc69
 firefox_beta_60-0b10rc1_win32-eme-free_en-za | 1523630817174 | 54efb0c5bcfae989a74c1a9fd68daeb3
 firefox_beta_60-0b10rc1_win32-eme-free_en-gb | 1523630817158 | 1e7f84a3ae2c7b4f66e045d99473fff7
 firefox_beta_60-0b10rc1_macosx_ur            | 1523630817149 | 3d29a75fcf0ed7dfff86d3db8f92fc69
 firefox_beta_60-0b10rc1_win32-eme-free_el    | 1523630817147 | d5681a5f6f040ff5c44910acb9812d41
 firefox_beta_60-0b10rc1_macosx_uk            | 1523630817138 | ef2ab418fb87dc88fb49c38d2e289740
 firefox_beta_60-0b10rc1_win32-eme-free_dsb   | 1523630817137 | 9d4c8cb0ba8d1d4e0328453fe9c5e4cc
 firefox_beta_60-0b10rc1_macosx_tr            | 1523630817128 | 2098bd485166b7b5088a8d0fa71a75f9
 firefox_beta_60-0b10rc1_win32-eme-free_de    | 1523630817126 | 98d706e1bd14644d0378259fc6714475

All truly distinct rows.

peterbe · 2018-04-27T12:37:51Z

What's weird about this is that the count_total is likely to be off. At the very least; very confusing.
If you have a certain identical ID+last_modified+data combination all spread across 100 different parent_id. E.g. parent_id='/parent001', parent_id='/parent002', etc.
And you do: SELECT COUNT(*) FROM records WHERE parent_id LIKE '/parent%' the answer will 100. But then when you do the actual records you'll get exactly 1 single row. But there are 100 rows in the database but you only return 1. It's a bit like doing a query like:
get_all(first_name='peter', last_name='*', age=38, location='USA') and getting 1 row when clearly there are many different 38 year old Peters in the USA.
It smells like a weird result and I honestly don't get it.

peterbe · 2018-04-27T12:57:01Z

Another (perhaps unimportant) reason why I'm pessimistic towards the use of DISTINCT is that it's really expensive.
Look at these two queries:

explain analyze
select id, as_epoch(last_modified) AS last_modified,
MD5(jsonb_pretty(data)) AS data
from records
WHERE parent_id LIKE '/buckets/build-hub/collections/%'
ORDER BY last_modified DESC
LIMIT 10;

explain analyze
select DISTINCT id, as_epoch(last_modified) AS last_modified,
MD5(jsonb_pretty(data)) AS data
from records
WHERE parent_id LIKE '/buckets/build-hub/collections/%'
ORDER BY last_modified DESC
LIMIT 10;

The results:

                                                                            QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..4.75 rows=10 width=87) (actual time=0.064..0.170 rows=10 loops=1)
   ->  Index Scan Backward using idx_records_last_modified_epoch on records  (cost=0.42..79412.79 rows=183242 width=87) (actual time=0.064..0.168 rows=10 loops=1)
         Filter: (parent_id ~~ '/buckets/build-hub/collections/%'::text)
         Rows Removed by Filter: 1
 Planning time: 0.127 ms
 Execution time: 0.199 ms
(6 rows)

                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=101126.69..101126.79 rows=10 width=87) (actual time=1833.516..1833.522 rows=10 loops=1)
   ->  Unique  (cost=101126.69..102959.11 rows=183242 width=87) (actual time=1833.515..1833.521 rows=10 loops=1)
         ->  Sort  (cost=101126.69..101584.80 rows=183242 width=87) (actual time=1833.515..1833.517 rows=11 loops=1)
               Sort Key: (as_epoch(last_modified)) DESC, id COLLATE "C", (md5(jsonb_pretty(data)))
               Sort Method: external merge  Disk: 18352kB
               ->  Seq Scan on records  (cost=0.00..76337.24 rows=183242 width=87) (actual time=0.067..1699.084 rows=183241 loops=1)
                     Filter: (parent_id ~~ '/buckets/build-hub/collections/%'::text)
                     Rows Removed by Filter: 3
 Planning time: 0.091 ms
 Execution time: 1841.550 ms
(10 rows)

1.8 seconds versus. 1.3 milliseconds.

peterbe · 2018-04-27T13:00:13Z

Yikes! The above comment was on my little buildhub_copy database. This is what happens when I try it with my larger buildhub database:

                                                                             QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..11.48 rows=10 width=84) (actual time=0.073..0.185 rows=10 loops=1)
   ->  Index Scan Backward using idx_records_last_modified_epoch on records  (cost=0.42..854442.25 rows=772692 width=84) (actual time=0.073..0.183 rows=10 loops=1)
         Filter: (parent_id ~~ '/buckets/build-hub/collections/%'::text)
         Rows Removed by Filter: 2
 Planning time: 0.201 ms
 Execution time: 0.227 ms
(6 rows)

                                                              QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=541965.59..541965.69 rows=10 width=84) (actual time=13976.616..13976.623 rows=10 loops=1)
   ->  Unique  (cost=541965.59..549692.51 rows=772692 width=84) (actual time=13976.615..13976.620 rows=10 loops=1)
         ->  Sort  (cost=541965.59..543897.32 rows=772692 width=84) (actual time=13976.614..13976.616 rows=10 loops=1)
               Sort Key: (as_epoch(last_modified)) DESC, id COLLATE "C", (md5(jsonb_pretty(data)))
               Sort Method: external merge  Disk: 75328kB
               ->  Seq Scan on records  (cost=0.00..392443.11 rows=772692 width=84) (actual time=0.558..13193.046 rows=775553 loops=1)
                     Filter: (parent_id ~~ '/buckets/build-hub/collections/%'::text)
                     Rows Removed by Filter: 4
 Planning time: 0.117 ms
 Execution time: 14020.070 ms

14 seconds versus 0.2 milliseconds.

Natim · 2018-04-27T13:01:56Z

Ok we need to fix that, thanks for investigating ❤️

peterbe · 2018-04-27T13:29:13Z

Ok we need to fix that, thanks for investigating

"fixed" :)

glasserc · 2018-04-27T16:17:51Z

Each account entry is isolated using parent_id=user_id. As admin, I can see every entries, thus parent_id=*.

I also found why I implemented it in the first place: [kinto webpush uses it]

Thanks for digging this up, I would never have thought to look here.

In both of these use cases, the intent is to support some kind of scoping (typically by user) but having some kind of special functionality to which the scoping doesn't apply. In these circumstances you use UserResource instead of ShareableResource.

Naively, I guess I would have assumed that we use the permission backend to do this kind of scoping, because it seems cleaner and more powerful. But I guess doing that is pretty tricky because permissions are inherited -- if you have read on a collection, you have read on all its children.

I will open a PR to document this a little better. Thanks!

peterbe · 2018-04-27T19:21:47Z

This issue title is "Do we need to support get_all with a wildcard parent_id?" So I guess the answer to that is Yes.

I uncovered that if you remove the DISTINCT from the get_all query, no tests broke and that was sad.

Something that's still a weird is that if you have two identical rows in records with only the parent_id being different, without the DISTINCT you'd get these duplicated. I.e. You'd see the same ID 4 times. With the DISTINCT you only see the ID 2 times.

If you want to I can attempt to add a test that simulates that. I.e. that you get records in a list that are indistinguishable. I.e.

self.storage.clear()
now = datetime.utcnow()
data = {'foo': 'bar'}
self.storage.add(
  collection_id='a', 
  parent_id='a1', 
  last_modified=now,
  data=data,
  id='myid'
)
self.storage.add(
  collection_id='a', 
  parent_id='a2',  # Different
  last_modified=now,
  data=data,
  id='myid'
)

total, records = self.storage.get_all(collection_id='a', parent_id='a*')
assert total == 2
assert len(records) == 2
assert records[0] == records[1]

glasserc · 2018-04-27T19:45:26Z

I think it's clear that we don't have robust tests on the wildcard matching. I think it would be good to add some. I think it would be prudent to figure out what the correct behavior is and write tests that reflect that correct behavior before updating with the query.

peterbe · 2018-05-01T12:38:37Z

@glasserc It sucked that removing the DISTINCT from the query didn't break any tests. And I agree that that indicates that we lack a test.

It's more pressing to me that, today, get_all will yield two identical records in the final JSON with no distinct difference. I.e. you do:

now = datetime.now()
store..create(id='peter', last_modified=now, collection_id='guys', data={}, parent_id='Bengtsson')
store..create(id='peter', last_modified=now, collection_id='guys', data={}, parent_id='Bennet')

If I then query for: get_all(parent_id='B*', collection_id='guys') I'm going to get:

[
    {"id": "peter", "last_modified": "2018-05-01T12:00:00", "data": {}},
    {"id": "peter", "last_modified": "2018-05-01T12:00:00", "data": {}},
]

(it's implicit that I don't need to know the returned records' collection_id because I just specified that in my get_all query)

I think what I'm saying is that, let's think about that first and then we can either fix this or add a test case.

glasserc · 2018-05-02T03:10:22Z

Yes, I think we should think about that too. That's why I wrote we should figure out what the correct behavior should be and write tests that reflect that. What do you think the correct behavior should be?

Because parent_id with wildcards can match different objects that have the same ID, this old use of records indexed by IDs to get the union of a bunch of records no longer works. Instead, rely on the fact that the same record, matched multiple times, is the same object in memory, and use its memory location as its unique ID. This is kind of a hack, but I couldn't think of a better solution. Although this is one obvious place where we rely on every object in a given "result set" having a unique ID, there may be others hidden elsewhere. Refs: Kinto#1616

peterbe mentioned this issue Apr 27, 2018

[WIP] Optimize postgresql storage get all fixes 1507 #1615

Closed

6 tasks

glasserc mentioned this issue Apr 27, 2018

Document the use of wildcard parent_ids in Storage.get_all #1617

Merged

leplatrem closed this as completed in #1617 May 2, 2018

glasserc mentioned this issue May 4, 2018

Cleanup Postgres get_all #1622

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do we need to support get_all with a wildcard parent_id? #1616

Do we need to support get_all with a wildcard parent_id? #1616

glasserc commented Apr 26, 2018

leplatrem commented Apr 26, 2018

glasserc commented Apr 26, 2018

Natim commented Apr 26, 2018

leplatrem commented Apr 27, 2018

leplatrem commented Apr 27, 2018

leplatrem commented Apr 27, 2018

Natim commented Apr 27, 2018

peterbe commented Apr 27, 2018

peterbe commented Apr 27, 2018

peterbe commented Apr 27, 2018

peterbe commented Apr 27, 2018

Natim commented Apr 27, 2018 •

edited

Loading

peterbe commented Apr 27, 2018

glasserc commented Apr 27, 2018

peterbe commented Apr 27, 2018 •

edited

Loading

glasserc commented Apr 27, 2018

peterbe commented May 1, 2018 •

edited

Loading

glasserc commented May 2, 2018

Do we need to support get_all with a wildcard parent_id? #1616

Do we need to support get_all with a wildcard parent_id? #1616

Comments

glasserc commented Apr 26, 2018

leplatrem commented Apr 26, 2018

glasserc commented Apr 26, 2018

Natim commented Apr 26, 2018

leplatrem commented Apr 27, 2018

leplatrem commented Apr 27, 2018

leplatrem commented Apr 27, 2018

Natim commented Apr 27, 2018

peterbe commented Apr 27, 2018

peterbe commented Apr 27, 2018

peterbe commented Apr 27, 2018

peterbe commented Apr 27, 2018

Natim commented Apr 27, 2018 • edited Loading

peterbe commented Apr 27, 2018

glasserc commented Apr 27, 2018

peterbe commented Apr 27, 2018 • edited Loading

glasserc commented Apr 27, 2018

peterbe commented May 1, 2018 • edited Loading

glasserc commented May 2, 2018

Natim commented Apr 27, 2018 •

edited

Loading

peterbe commented Apr 27, 2018 •

edited

Loading

peterbe commented May 1, 2018 •

edited

Loading