Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content harvest rm tmp files #1124

Merged
merged 1 commit into from
Oct 7, 2024
Merged

Conversation

barbarahui
Copy link
Collaborator

@barbarahui barbarahui commented Oct 4, 2024

The disk is filling up during content harvesting. I think the file removal statements in this bit of code got dropped during the recent caching implementation:

if media and os.path.exists(media.tmp_filepath):
os.remove(media.tmp_filepath)
downloaded_urls.pop(media.src_url, None)
if thumbnail and os.path.exists(thumbnail.tmp_filepath):
os.remove(thumbnail.tmp_filepath)
downloaded_urls.pop(thumbnail.src_url, None)
if derivative_filepath and os.path.exists(derivative_filepath):
os.remove(derivative_filepath)

The code in this PR removes source files and derivative files from local disk once they're no longer needed. I tried to find a neater way to remove the files all at once, i.e. at the end of harvest_record_content, but it proved tricky.

I also made a change to derivatives.subprocess_exception_handler so that it raises an error instead of returning None if there is an error. As was, the DAG was succeeding even when, for example, the pdf_to_thumb subprocess was failing. It seemed like the return value of None was not meant to be permanent, given the fact that raise(e) was commented out?

@amywieliczka
Copy link
Collaborator

Hey @barbarahui - thanks, the deletion code looks good. I think we actually still want these subprocess conversion errors to be non-blocking to the rest of the job, though - could you please revert the second of these two commits?

@barbarahui
Copy link
Collaborator Author

barbarahui commented Oct 7, 2024

@amywieliczka Do we want to add some code to handle the error somewhere? At the moment no error is raised and the job succeeds even if a subprocess throws an error.

I'll go ahead and revert the second commit so we can get the file deletion code through and add an issue for the subprocess error handling.

@amywieliczka
Copy link
Collaborator

amywieliczka commented Oct 7, 2024

@barbarahui I know, but that's the intended behavior at the request of the rest of the team - if you look at the blame on that line that returns None rather than raising an error, it's about making it non-blocking. The job should succeed even if the subprocess fails. A message should get printed to the logs, though.

In the Calisphere UI, iirc, this results in an icon being used instead of the thumbnail, which is another way it should get caught during QA. Once we're on Airflow 2.9 it should be a bit easier to find which mapped Airflow task ran which page, and trace back to the logs to understand why there's no thumbnail.

I did try to implement some rollup reporting, but due to the iffy-at-best return values from a RunTask operator, found it prohibitively difficult to do so. I think there's an outstanding issue about it somewhere. I think it bears thinking about in a larger context rather than shoehorning into this particular bug fix.

@amywieliczka
Copy link
Collaborator

Issue: #969

@barbarahui barbarahui force-pushed the content-harvest-rm-tmp-files branch from 5729d9e to a910bc7 Compare October 7, 2024 17:00
@barbarahui
Copy link
Collaborator Author

@amywieliczka Ahhh gotcha. I kind of remember this now but I think I might've been off doing aspace or nuxeo things. Thanks for explaining!!

@barbarahui barbarahui merged commit cb35b73 into main Oct 7, 2024
2 checks passed
@barbarahui barbarahui deleted the content-harvest-rm-tmp-files branch October 7, 2024 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants