Add Data Cleaning implementation plan

WordPress · Feb 29, 2024 · 933cbd3 · 933cbd3
1 parent a930ee0
commit 933cbd3
Show file tree

Hide file tree

Showing 2 changed files with 185 additions and 0 deletions.
diff --git a/...posals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md b/...posals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md
@@ -0,0 +1,177 @@
+# 2024-02-27 Implementation Plan: Catalog Data Cleaning
+
+**Author**: @krysal
+
+## Reviewers
+
+- [ ] TBD
+- [ ] TBD
+
+## Project links
+
+<!-- Enumerate any references to other documents/pages, including milestones and other plans -->
+
+- [Project Thread](https://github.com/WordPress/openverse/issues/430)
+
+This project does not have a project proposal because the scope and rationale of
+the project are clear, as defined in the project thread. In doubt, check the
+[Expected Outcomes](#expected-outcomes) section below.
+
+## Overview
+
+One of the steps of the [data refresh process for images][img-data-refresh] is
+cleaning the data that is not fit for production. This process is triggered
+weekly by an Airflow DAG, and then runs in the Ingestion Server, taking
+approximately just over **20 hours** to complete, according to a inspection of
+latest executions. The cleaned data is only saved to the API database, which is
+replaced each time during the same data refresh, causing it to have to be
+repeated each time to make the _same_ corrections.
+
+This cleaning process was designed this way to speed the rows update up since
+the relevant part was to provide the correct data to users via the API. Most of
+the rows affected were added previous to the creation of the `MediaStore` class
+in the Catalog (possibly by the discontinued CommonCrawl ingestion) which is
+nowadays responsible for validating the provider data. However, it entails a
+problem of wasting resources both in time, which continues to increase, and in
+the machines (CPU) it uses, which could easily be avoided making the changes
+permanent by saving them in the upstream database.
+
+This implementation plan (IP) describe a path to save these resources and
+finally normalise the catalog DB data, pushing cleaning steps to the media
+storage class and/or providers API DAGs.
+
+[img-data-refresh]:
+  https://github.com/WordPress/openverse-catalog/blob/main/DAGs.md#image_data_refresh
+
+## Expected Outcomes
+
+<!-- List any succinct expected products from this implementation plan. -->
+
+- The catalog database (upstream) preserves the cleaned data results of the
+  current Ingestion Server's cleaning steps
+- The image Data Refresh process is simplified by reducing the cleaning steps
+  time to nearly zero (and optionally removing them).
+
+## Step-by-step plan
+
+The cleaning functions that the Ingestion Server applies are already ported to
+the Catalog in the `MediaStore` class: see its `_tag_blacklisted` method (which
+should probably be renamed) and the [url utilities][url_utils] file. The only
+part that it's not there and can't be ported is the filtering of low-confidence
+tags, since provider scripts don't save an "accuracy" by tag.
+
+With this the plan then starts in the Ingestion Server with the following steps:
+
+1. [Save TSV files of cleaned data to AWS S3](#save-tsv-files-of-cleaned-data-to-aws-s3)
+1. [Make and run a batched update DAG for one-time cleanup](#make-and-run-a-batched-update-dag-for-one-time-cleanup)
+1. [Run an image Data Refresh to confirm cleaning time is reduced](#run-an-image-data-refresh-to-confirm-cleaning-time-is-reduced)
+
+[url_utils]:
+  https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/catalog/dags/common/urls.py
+
+## Step details
+
+### Save TSV files of cleaned data to AWS S3
+
+In a previous exploration, it was set to store TSV files of the cleaned data in
+the form of `<identifier> <cleaned_field>`, which can be used later to perform
+the updates efficiently in the catalog DB, which only had indexes for the
+`identifier` field. These files are saved to the disk of the Ingestion Server
+EC2 instances, and worked fine for files with URL corrections since this type of
+fields is relatively short, but became a problem when trying to save tags, as
+the file turned too large and filled up the disk, causing problems to the data
+refresh execution.
+
+The alternative is to upload TSV files to the Amazon Simple Storage Service
+(S3), creating a new bucket or using `openverse-catalog` with a subfolder. The
+benefit of using S3 buckets is that they have streaming capabilities and will
+allow us to read the files in chunks later if necessary for performance. The
+downside is that objects in S3 don't allow appending, so it may require to
+upload files with different part numbers or evaluate if the [multipart upload
+process][aws_mpu] will serve us here.
+
+[aws_mpu]:
+  https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html
+
+| timestamp (UTC)     | 'url' | 'creator_url' | 'foreign_landing_url' | 'tags' |
+| ------------------- | :---: | :-----------: | :-------------------: | :----: |
+| 2024-02-27 04:05:26 | 22156 |    9035458    |        8809213        |   0    |
+| 2024-02-20 04:06:56 | 22157 |    9035456    |        8809209        |   0    |
+| 2024-02-13 04:41:22 | 22155 |    9035451    |        8809204        |   0    |
+
+The previous table shows the number of records cleaned by field for last runs at
+the moment of writing this IP, except for tags, which we don't have accurate
+registries since file saving was disabled.
+
+### Make and run a batched update DAG for one-time cleanup
+
+A batched catalog cleaner DAG (or potentially a `batched_update_from_files`)
+should take the files of the previous step to perform an arbitrary batched
+update on a Catalog media table, while handling deadlocking and timeout
+concerns, similar to the [batched_update][batched_update]. A
+[proof of concept PR](https://github.com/WordPress/openverse/pull/3601) proved
+to work locally for URL fields, so it will just need to be adapted to the tags.
+
+[batched_update]: ./../../../catalog/reference/DAGs.md#batched_update
+
+### Run an image Data Refresh to confirm cleaning time is reduced
+
+Finally, after the previous steps are done, running a data refresh will confirm
+there are no more updates applied at ingestion. If time isn't significantly
+reduced then it will be necessary to check what was missing in the previous
+steps.
+
+## Dependencies
+
+### Infrastructure
+
+<!-- Describe any infrastructure that will need to be provisioned or modified. In particular, identify associated potential cost changes. -->
+
+No changes needed. The Ingestion Server already has the credentials required to
+[connect with AWS](https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/ingestion_server/ingestion_server/indexer_worker.py#L23-L28).
+
+<!--
+### Tools & packages
+
+ Describe any tools or packages which this work might be dependent on. If multiple options are available, try to list as many as are reasonable with your own recommendation. -->
+
+### Other projects or work
+
+Once the steps have been completed and proved the method works we could make
+additional similar corrections following the same procedure. Some potentially
+related issues are:
+
+- [Some images have duplicate incorrectly decoded unicode tags #1303](https://github.com/WordPress/openverse/issues/1303)
+- [Provider scripts may include html tags in record titles #1441](https://github.com/WordPress/openverse/issues/1441)
+- [Fix Wikimedia image titles #1728](https://github.com/WordPress/openverse/issues/1728)
+
+This will also open up space for more structural changes to the Openverse DB
+schemas in a [second phase](https://github.com/WordPress/openverse/issues/244)
+of the Data Normalization endeavor.
+
+## Alternatives
+
+<!-- Describe any alternatives considered and why they were not chosen or recommended. -->
+
+## Rollback
+
+<!-- How do we roll back this solution in the event of failure? Are there any steps that can not easily be rolled back? -->
+
+In the rare case we need the old data back, we can resort to DB backups, which
+are performed [weekly][db_snapshots].
+
+[db_snapshots]: ./../../../catalog/reference/DAGs.md#rotate_db_snapshots
+
+<!--
+## Risks
+
+What risks are we taking with this solution? Are there risks that once taken can’t be undone?-->
+
+## Prior art
+
+- Previous attempt from cc-archive:
+  [Clean preexisting data using ImageStore #517](mathemancer_pr)
+- @obulat's PR to
+  [add logging and save cleaned up data in the Ingestion Server](https://github.com/WordPress/openverse/pull/904)
+
+[mathemancer_pr]: https://github.com/cc-archive/cccatalog/pull/517
diff --git a/documentation/projects/proposals/data_normalization/index.md b/documentation/projects/proposals/data_normalization/index.md
@@ -0,0 +1,8 @@
+# Data Normalization
+
+```{toctree}
+:titlesonly:
+:glob:
+
+*
+```