diff --git a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md index 5bb09340401..c16384559cc 100644 --- a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md +++ b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md @@ -88,7 +88,7 @@ which only had indexes for the `identifier` field. These files are saved to the disk of the Ingestion Server EC2 instances, and worked fine for files with URL corrections since this type of fields is relatively short, but became a problem when trying to save tags, as the file turned too large and filled up the disk, -causing problems to the data refresh execution. +causing issues to the data refresh execution. [aws_mpu]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html @@ -105,17 +105,19 @@ writing this IP, except for tags, which we don't have accurate registries since file saving was disabled. The alternative is to upload TSV files to the Amazon Simple Storage Service -(S3), creating a new bucket or using `openverse-catalog` with a subfolder. The +(S3), creating a new bucket or using a subfolder within `openverse-catalog`. The benefit of using S3 buckets is that they have streaming capabilities and will allow us to read the files in chunks later if necessary for performance. The -downside is that objects in S3 don't allow appending, so it may require to -upload files with different part numbers or evaluate if the [multipart upload -process][aws_mpu] will serve us here. +downside is that objects in S3 don't allow appending natviely, so it may require +to upload files with different part numbers or evaluate if the [multipart upload +process][aws_mpu] or more easily, the `smart_open` package could serve us here. + +[smart_open]: https://github.com/piskvorky/smart_open ### Make and run a batched update DAG for one-time cleanup A batched catalog cleaner DAG (or potentially a `batched_update_from_file`) -should take the files of the previous step to perform an batched update on the +should take the files of the previous step to perform a batched update on the catalog's image table, while handling deadlocking and timeout concerns, similar to the [batched_update][batched_update]. This table is constantly in use by other DAGs, such as those from providers ingestion or the data refresh process, @@ -126,10 +128,15 @@ and ideally can't be singly blocked by any DAG. A [proof of concept PR](https://github.com/WordPress/openverse/pull/3601) consisted of uploading each file to temporary `UNLOGGED` DB tables (which provides huge gains in writing performance while their disadventages are not -relevant to us, they won't be permanent), and including a `row_id` serial number -used later to query it in batches. Adding an index in this last column after -filling up the table could improve the query performance. An adaptation will be -needed to handle the column type of tags (`jsonb`). +relevant to us, they won't be permanent), and include a `row_id` serial number +used later to query it in batches. The following must be included: + +- Add an index for the `identifier` column in the temporary table after filling + it up, to improve the query performance +- An adaptation to handle the column type of tags (`jsonb`) and modify the + `metadata` +- Include an DAG task for reporting the number of rows affected by column to + Slack ### Run an image data refresh to confirm cleaning time is reduced @@ -149,10 +156,11 @@ later. No changes needed. The Ingestion Server already has the credentials required to [connect with AWS](https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/ingestion_server/ingestion_server/indexer_worker.py#L23-L28). - + + +Requires installing and familiarizing with the [smart_open][smart_open] utility. ### Other projects or work