diff --git a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md new file mode 100644 index 00000000000..2f6d830dc56 --- /dev/null +++ b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md @@ -0,0 +1,177 @@ +# 2024-02-27 Implementation Plan: Catalog Data Cleaning + +**Author**: @krysal + +## Reviewers + +- [ ] TBD +- [ ] TBD + +## Project links + + + +- [Project Thread](https://github.com/WordPress/openverse/issues/430) + +This project does not have a project proposal because the scope and rationale of +the project are clear, as defined in the project thread. In doubt, check the +[Expected Outcomes](#expected-outcomes) section below. + +## Overview + +One of the steps of the [data refresh process for images][img-data-refresh] is +cleaning the data that is not fit for production. This process is triggered +weekly by an Airflow DAG, and then runs in the Ingestion Server, taking +approximately just over **20 hours** to complete, according to a inspection of +latest executions. The cleaned data is only saved to the API database, which is +replaced each time during the same data refresh, causing it to have to be +repeated each time to make the _same_ corrections. + +This cleaning process was designed this way to speed the rows update up since +the relevant part was to provide the correct data to users via the API. Most of +the rows affected were added previous to the creation of the `MediaStore` class +in the Catalog (possibly by the discontinued CommonCrawl ingestion) which is +nowadays responsible for validating the provider data. However, it entails a +problem of wasting resources both in time, which continues to increase, and in +the machines (CPU) it uses, which could easily be avoided making the changes +permanent by saving them in the upstream database. + +This implementation plan (IP) describe a path to save these resources and +finally normalise the catalog DB data, pushing cleaning steps to the media +storage class and/or providers API DAGs. + +[img-data-refresh]: + https://github.com/WordPress/openverse-catalog/blob/main/DAGs.md#image_data_refresh + +## Expected Outcomes + + + +- The catalog database (upstream) preserves the cleaned data results of the + current Ingestion Server's cleaning steps +- The image Data Refresh process is simplified by reducing the cleaning steps + time to nearly zero (and optionally removing them). + +## Step-by-step plan + +The cleaning functions that the Ingestion Server applies are already ported to +the Catalog in the `MediaStore` class: see its `_tag_blacklisted` method (which +should probably be renamed) and the [url utilities][url_utils] file. The only +part that it's not there and can't be ported is the filtering of low-confidence +tags, since provider scripts don't save an "accuracy" by tag. + +With this the plan then starts in the Ingestion Server with the following steps: + +1. [Save TSV files of cleaned data to AWS S3](#save-tsv-files-of-cleaned-data-to-aws-s3) +1. [Make and run a batched update DAG for one-time cleanup](#make-and-run-a-batched-update-dag-for-one-time-cleanup) +1. [Run an image Data Refresh to confirm cleaning time is reduced](#run-an-image-data-refresh-to-confirm-cleaning-time-is-reduced) + +[url_utils]: + https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/catalog/dags/common/urls.py + +## Step details + +### Save TSV files of cleaned data to AWS S3 + +In a previous exploration, it was set to store TSV files of the cleaned data in +the form of ` `, which can be used later to perform +the updates efficiently in the catalog DB, which only had indexes for the +`identifier` field. These files are saved to the disk of the Ingestion Server +EC2 instances, and worked fine for files with URL corrections since this type of +fields is relatively short, but became a problem when trying to save tags, as +the file turned too large and filled up the disk, causing problems to the data +refresh execution. + +The alternative is to upload TSV files to the Amazon Simple Storage Service +(S3), creating a new bucket or using `openverse-catalog` with a subfolder. The +benefit of using S3 buckets is that they have streaming capabilities and will +allow us to read the files in chunks later if necessary for performance. The +downside is that objects in S3 don't allow appending, so it may require to +upload files with different part numbers or evaluate if the [multipart upload +process][aws_mpu] will serve us here. + +[aws_mpu]: + https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html + +| timestamp (UTC) | 'url' | 'creator_url' | 'foreign_landing_url' | 'tags' | +| ------------------- | :---: | :-----------: | :-------------------: | :----: | +| 2024-02-27 04:05:26 | 22156 | 9035458 | 8809213 | 0 | +| 2024-02-20 04:06:56 | 22157 | 9035456 | 8809209 | 0 | +| 2024-02-13 04:41:22 | 22155 | 9035451 | 8809204 | 0 | + +The previous table shows the number of records cleaned by field for last runs at +the moment of writing this IP, except for tags, which we don't have accurate +registries since file saving was disabled. + +### Make and run a batched update DAG for one-time cleanup + +A batched catalog cleaner DAG (or potentially a `batched_update_from_files`) +should take the files of the previous step to perform an arbitrary batched +update on a Catalog media table, while handling deadlocking and timeout +concerns, similar to the [batched_update][batched_update]. A +[proof of concept PR](https://github.com/WordPress/openverse/pull/3601) proved +to work locally for URL fields, so it will just need to be adapted to the tags. + +[batched_update]: ./../../../catalog/reference/DAGs.md#batched_update + +### Run an image Data Refresh to confirm cleaning time is reduced + +Finally, after the previous steps are done, running a data refresh will confirm +there are no more updates applied at ingestion. If time isn't significantly +reduced then it will be necessary to check what was missing in the previous +steps. + +## Dependencies + +### Infrastructure + + + +No changes needed. The Ingestion Server already has the credentials required to +[connect with AWS](https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/ingestion_server/ingestion_server/indexer_worker.py#L23-L28). + + + +### Other projects or work + +Once the steps have been completed and proved the method works we could make +additional similar corrections following the same procedure. Some potentially +related issues are: + +- [Some images have duplicate incorrectly decoded unicode tags #1303](https://github.com/WordPress/openverse/issues/1303) +- [Provider scripts may include html tags in record titles #1441](https://github.com/WordPress/openverse/issues/1441) +- [Fix Wikimedia image titles #1728](https://github.com/WordPress/openverse/issues/1728) + +This will also open up space for more structural changes to the Openverse DB +schemas in a [second phase](https://github.com/WordPress/openverse/issues/244) +of the Data Normalization endeavor. + +## Alternatives + + + +## Rollback + + + +In the rare case we need the old data back, we can resort to DB backups, which +are performed [weekly][db_snapshots]. + +[db_snapshots]: ./../../../catalog/reference/DAGs.md#rotate_db_snapshots + + + +## Prior art + +- Previous attempt from cc-archive: + [Clean preexisting data using ImageStore #517](mathemancer_pr) +- @obulat's PR to + [add logging and save cleaned up data in the Ingestion Server](https://github.com/WordPress/openverse/pull/904) + +[mathemancer_pr]: https://github.com/cc-archive/cccatalog/pull/517 diff --git a/documentation/projects/proposals/data_normalization/index.md b/documentation/projects/proposals/data_normalization/index.md new file mode 100644 index 00000000000..2d2f7d8e966 --- /dev/null +++ b/documentation/projects/proposals/data_normalization/index.md @@ -0,0 +1,8 @@ +# Data Normalization + +```{toctree} +:titlesonly: +:glob: + +* +```