-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete unacceptable thumbnails from catalog DB after the image data refresh is finished #1816
Comments
Since the data refresh process is now finished, I will proceed with this issue. First backing up the thumbnail data, just in case. CREATE TEMPORARY TABLE small_image_thumbnails AS (
SELECT identifier, provider, thumbnail FROM IMAGE
WHERE thumbnail IS NOT NULL AND provider IN
('rijksmuseum', 'sketchfab', 'sciencemuseum', 'thingiverse')
);
\copy small_image_thumbnails TO '/tmp/small_image_thumbnails_2023_04_05.tsv' DELIMITER E'\t' CSV HEADER; Uploaded to The next step is to apply the update. |
The changes for the small providers were applied quite fast: UPDATE image SET thumbnail = NULL WHERE provider IN
('rijksmuseum', 'sketchfab', 'sciencemuseum', 'thingiverse');
UPDATE 198754
Time: 150.083s (2 minutes 30 seconds), executed in: 150.083s (2 minutes 30 seconds) I tried to do the same for Flickr separately since this provider is considerably larger, and as expected, it took much longer to create the temporary table. -- Backup
CREATE TEMPORARY TABLE flickr_thumbnails_deleted_2023_04_05 AS (
SELECT identifier, thumbnail FROM image WHERE provider='flickr' AND thumbnail IS NOT NULL
);
SELECT 497009314
Time: 1984.467s (33 minutes 4 seconds), executed in: 1984.466s (33 minutes 4 seconds) And since the resulting table is huge SELECT pg_size_pretty(pg_relation_size('flickr_thumbnails_deleted_2023_04_05'));
+----------------+
| pg_size_pretty |
|----------------|
| 51 GB |
+----------------+ I tried to upload the tsv directly to s3 but it seems that our postgres version doesn't support exports to Amazon S3 🫤 aws rds describe-db-engine-versions --region us-east-1 --engine postgres --engine-version 13.2 | grep s3Export
# returns nothing So we have two options before applying the update for Flickr:
CREATE TABLE flickr_thumbnails_deleted_2023_04_05 AS (
identifier uuid, # not sure is this would be allowed since in theory uuid are unique across the DB
thumbnail varchar(3000)
);
-- this would be the backup
INSERT INTO flickr_thumbnails_deleted_2023_04_05
SELECT identifier, thumbnail FROM image
WHERE provider='flickr' AND thumbnail IS NOT NULL; I slightly prefer the first option as it is the simplest, in case we really need these thumbnails we could probably get them from one of the whole DB backup, although it will be more cumbersome. @WordPress/openverse What do you think? |
Dang, 51 GB for just two values 🥲 I think we have no shortage of space on the catalog at this point, so my preference would be to save it to a non-temporary table! |
Ditto, unless there's a reason to save those 51GB, keeping it is probably going to be easier to recover from than needing to extract them from a backup. |
Done! Flickr thumbnails are backed up to the persistent It's been 4 hours and counting... |
The previous manual update query was terminated on May 16th before it was completed since it was not known when it would end and was blocking other work. I created a simple DAG to do this progressively. |
@krysal Should we reopen this and wait to close it until the thumbnails themselves are deleted? |
@AetherUnbound that's right. |
This is completed. |
Description
In #1811 the need arose to delete the thumbnails of the following providers: 'rijksmuseum', 'sketchfab', 'sciencemuseum', 'thingiverse' and 'flickr'.
Description
Make a backup (just in case something goes wrong) and apply the update to nullify those providers' thumbnails:
Additional context
Related to #675.
The text was updated successfully, but these errors were encountered: