Skip to content

Commit

Permalink
Add Data Cleaning implementation plan
Browse files Browse the repository at this point in the history
  • Loading branch information
krysal committed Feb 29, 2024
1 parent a930ee0 commit 933cbd3
Show file tree
Hide file tree
Showing 2 changed files with 185 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
# 2024-02-27 Implementation Plan: Catalog Data Cleaning

**Author**: @krysal

## Reviewers

- [ ] TBD
- [ ] TBD

## Project links

<!-- Enumerate any references to other documents/pages, including milestones and other plans -->

- [Project Thread](https://github.com/WordPress/openverse/issues/430)

This project does not have a project proposal because the scope and rationale of
the project are clear, as defined in the project thread. In doubt, check the
[Expected Outcomes](#expected-outcomes) section below.

## Overview

One of the steps of the [data refresh process for images][img-data-refresh] is
cleaning the data that is not fit for production. This process is triggered
weekly by an Airflow DAG, and then runs in the Ingestion Server, taking
approximately just over **20 hours** to complete, according to a inspection of
latest executions. The cleaned data is only saved to the API database, which is
replaced each time during the same data refresh, causing it to have to be
repeated each time to make the _same_ corrections.

This cleaning process was designed this way to speed the rows update up since
the relevant part was to provide the correct data to users via the API. Most of
the rows affected were added previous to the creation of the `MediaStore` class
in the Catalog (possibly by the discontinued CommonCrawl ingestion) which is
nowadays responsible for validating the provider data. However, it entails a
problem of wasting resources both in time, which continues to increase, and in
the machines (CPU) it uses, which could easily be avoided making the changes
permanent by saving them in the upstream database.

This implementation plan (IP) describe a path to save these resources and
finally normalise the catalog DB data, pushing cleaning steps to the media
storage class and/or providers API DAGs.

[img-data-refresh]:
https://github.com/WordPress/openverse-catalog/blob/main/DAGs.md#image_data_refresh

## Expected Outcomes

<!-- List any succinct expected products from this implementation plan. -->

- The catalog database (upstream) preserves the cleaned data results of the
current Ingestion Server's cleaning steps
- The image Data Refresh process is simplified by reducing the cleaning steps
time to nearly zero (and optionally removing them).

## Step-by-step plan

The cleaning functions that the Ingestion Server applies are already ported to
the Catalog in the `MediaStore` class: see its `_tag_blacklisted` method (which
should probably be renamed) and the [url utilities][url_utils] file. The only
part that it's not there and can't be ported is the filtering of low-confidence
tags, since provider scripts don't save an "accuracy" by tag.

With this the plan then starts in the Ingestion Server with the following steps:

1. [Save TSV files of cleaned data to AWS S3](#save-tsv-files-of-cleaned-data-to-aws-s3)
1. [Make and run a batched update DAG for one-time cleanup](#make-and-run-a-batched-update-dag-for-one-time-cleanup)
1. [Run an image Data Refresh to confirm cleaning time is reduced](#run-an-image-data-refresh-to-confirm-cleaning-time-is-reduced)

[url_utils]:
https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/catalog/dags/common/urls.py

## Step details

### Save TSV files of cleaned data to AWS S3

In a previous exploration, it was set to store TSV files of the cleaned data in
the form of `<identifier> <cleaned_field>`, which can be used later to perform
the updates efficiently in the catalog DB, which only had indexes for the
`identifier` field. These files are saved to the disk of the Ingestion Server
EC2 instances, and worked fine for files with URL corrections since this type of
fields is relatively short, but became a problem when trying to save tags, as
the file turned too large and filled up the disk, causing problems to the data
refresh execution.

The alternative is to upload TSV files to the Amazon Simple Storage Service
(S3), creating a new bucket or using `openverse-catalog` with a subfolder. The
benefit of using S3 buckets is that they have streaming capabilities and will
allow us to read the files in chunks later if necessary for performance. The
downside is that objects in S3 don't allow appending, so it may require to
upload files with different part numbers or evaluate if the [multipart upload
process][aws_mpu] will serve us here.

[aws_mpu]:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html

| timestamp (UTC) | 'url' | 'creator_url' | 'foreign_landing_url' | 'tags' |
| ------------------- | :---: | :-----------: | :-------------------: | :----: |
| 2024-02-27 04:05:26 | 22156 | 9035458 | 8809213 | 0 |
| 2024-02-20 04:06:56 | 22157 | 9035456 | 8809209 | 0 |
| 2024-02-13 04:41:22 | 22155 | 9035451 | 8809204 | 0 |

The previous table shows the number of records cleaned by field for last runs at
the moment of writing this IP, except for tags, which we don't have accurate
registries since file saving was disabled.

### Make and run a batched update DAG for one-time cleanup

A batched catalog cleaner DAG (or potentially a `batched_update_from_files`)
should take the files of the previous step to perform an arbitrary batched
update on a Catalog media table, while handling deadlocking and timeout
concerns, similar to the [batched_update][batched_update]. A
[proof of concept PR](https://github.com/WordPress/openverse/pull/3601) proved
to work locally for URL fields, so it will just need to be adapted to the tags.

[batched_update]: ./../../../catalog/reference/DAGs.md#batched_update

### Run an image Data Refresh to confirm cleaning time is reduced

Finally, after the previous steps are done, running a data refresh will confirm
there are no more updates applied at ingestion. If time isn't significantly
reduced then it will be necessary to check what was missing in the previous
steps.

## Dependencies

### Infrastructure

<!-- Describe any infrastructure that will need to be provisioned or modified. In particular, identify associated potential cost changes. -->

No changes needed. The Ingestion Server already has the credentials required to
[connect with AWS](https://github.com/WordPress/openverse/blob/a930ee0f1f116bac77cf56d1fb0923989613df6d/ingestion_server/ingestion_server/indexer_worker.py#L23-L28).

<!--
### Tools & packages
Describe any tools or packages which this work might be dependent on. If multiple options are available, try to list as many as are reasonable with your own recommendation. -->

### Other projects or work

Once the steps have been completed and proved the method works we could make
additional similar corrections following the same procedure. Some potentially
related issues are:

- [Some images have duplicate incorrectly decoded unicode tags #1303](https://github.com/WordPress/openverse/issues/1303)
- [Provider scripts may include html tags in record titles #1441](https://github.com/WordPress/openverse/issues/1441)
- [Fix Wikimedia image titles #1728](https://github.com/WordPress/openverse/issues/1728)

This will also open up space for more structural changes to the Openverse DB
schemas in a [second phase](https://github.com/WordPress/openverse/issues/244)
of the Data Normalization endeavor.

## Alternatives

<!-- Describe any alternatives considered and why they were not chosen or recommended. -->

## Rollback

<!-- How do we roll back this solution in the event of failure? Are there any steps that can not easily be rolled back? -->

In the rare case we need the old data back, we can resort to DB backups, which
are performed [weekly][db_snapshots].

[db_snapshots]: ./../../../catalog/reference/DAGs.md#rotate_db_snapshots

<!--
## Risks
What risks are we taking with this solution? Are there risks that once taken can’t be undone?-->

## Prior art

- Previous attempt from cc-archive:
[Clean preexisting data using ImageStore #517](mathemancer_pr)
- @obulat's PR to
[add logging and save cleaned up data in the Ingestion Server](https://github.com/WordPress/openverse/pull/904)

[mathemancer_pr]: https://github.com/cc-archive/cccatalog/pull/517
8 changes: 8 additions & 0 deletions documentation/projects/proposals/data_normalization/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Data Normalization

```{toctree}
:titlesonly:
:glob:
*
```

0 comments on commit 933cbd3

Please sign in to comment.