From a399420edb11b12cf865d2c94290e26b8e695563 Mon Sep 17 00:00:00 2001 From: Krystle Salazar Date: Mon, 4 Mar 2024 14:20:18 -0400 Subject: [PATCH] Apply editorial suggestions Co-authored-by: Madison Swain-Bowden Co-authored-by: Olga Bulat --- ...plementation_plan_catalog_data_cleaning.md | 44 ++++++++++--------- 1 file changed, 23 insertions(+), 21 deletions(-) diff --git a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md index bd6de7ac301..a631cc40c51 100644 --- a/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md +++ b/documentation/projects/proposals/data_normalization/20240227-implementation_plan_catalog_data_cleaning.md @@ -19,24 +19,26 @@ the project are clear, as defined in the project thread. In doubt, check the ## Overview -This document describes a solution for incorrect data in the catalog database -(DB) that has to be cleaned up every time a data refresh is run, avoiding wasted -resources. +This document describes a mechanism for rectifying incorrect data in the catalog +database (DB) that currently has to be cleaned up every time a data refresh is +run. This one-time fix is an effort to avoid wasting resources and data refresh +runtime. ## Background One of the steps of the [data refresh process for images][img-data-refresh] is cleaning the data that is not fit for production. This process is triggered -weekly by an Airflow DAG, and then runs in the Ingestion Server, taking +weekly by an Airflow DAG, which then runs in the Ingestion Server, taking approximately just over **20 hours** to complete, according to a inspection of -latest executions. The cleaned data is only saved to the API database, which is -replaced each time during the same data refresh, causing it to have to be -repeated each time to make the _same_ corrections. +recent executions as of the time of drafting this document. The cleaned data is +only saved to the API database, which is replaced each time during the same data +refresh, meaning this process has to be repeated each time to make the _same_ +corrections. This cleaning process was designed this way to speed the rows update up since the relevant part was to provide the correct data to users via the API. Most of -the rows affected were added previous to the creation of the `MediaStore` class -in the Catalog (possibly by the discontinued CommonCrawl ingestion) which is +the rows affected were added prior to the creation of the `MediaStore` class in +the Catalog (possibly by the discontinued CommonCrawl ingestion) which is nowadays responsible for validating the provider data. However, it entails a problem of wasting resources both in time, which continues to increase, and in the machines (CPU) it uses, which could easily be avoided making the changes @@ -49,7 +51,7 @@ permanent by saving them in the upstream database. -- The catalog database (upstream) preserves the cleaned data results of the +- The catalog database (upstream) contains the cleaned data outputs of the current Ingestion Server's cleaning steps - The image Data Refresh process is simplified by reducing the cleaning steps time to nearly zero (and optionally removing them). @@ -75,14 +77,14 @@ With this the plan then starts in the Ingestion Server with the following steps: ### Save TSV files of cleaned data to AWS S3 -In a previous exploration, it was set to store TSV files of the cleaned data in -the form of ` `, which can be used later to perform -the updates efficiently in the catalog DB, which only had indexes for the -`identifier` field. These files are saved to the disk of the Ingestion Server -EC2 instances, and worked fine for files with URL corrections since this type of -fields is relatively short, but became a problem when trying to save tags, as -the file turned too large and filled up the disk, causing problems to the data -refresh execution. +In a previous exploration, the Ingestion Server was set to store TSV files of +the cleaned data in the form of ` `, which can be +used later to perform the updates efficiently in the catalog DB, which only had +indexes for the `identifier` field. These files are saved to the disk of the +Ingestion Server EC2 instances, and worked fine for files with URL corrections +since this type of fields is relatively short, but became a problem when trying +to save tags, as the file turned too large and filled up the disk, causing +problems to the data refresh execution. The alternative is to upload TSV files to the Amazon Simple Storage Service (S3), creating a new bucket or using `openverse-catalog` with a subfolder. The @@ -101,7 +103,7 @@ process][aws_mpu] will serve us here. | 2024-02-20 04:06:56 | 22157 | 9035456 | 8809209 | 0 | | 2024-02-13 04:41:22 | 22155 | 9035451 | 8809204 | 0 | -To have some numbers of the problem we are delaing with, the previous table +To have some numbers of the problem we are dealing with, the previous table shows the number of records cleaned by field for last runs at the moment of writing this IP, except for tags, which we don't have accurate registries since file saving was disabled. @@ -112,8 +114,8 @@ A batched catalog cleaner DAG (or potentially a `batched_update_from_file`) should take the files of the previous step to perform an batched update on the catalog's image table, while handling deadlocking and timeout concerns, similar to the [batched_update][batched_update]. This table is constantly in use by -other DAGs, such as those from API providers or the data refresh process, and -ideally can't be singly blocked by any DAG. +other DAGs, such as those from providers ingestion or the data refresh process, +and ideally can't be singly blocked by any DAG. [batched_update]: ./../../../catalog/reference/DAGs.md#batched_update