-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreaded manifest-based downloader #57
Conversation
mtauraso
commented
Sep 5, 2024
- Support for multiple connections via 'concurrent_connections' config
- Replaced FailedChunkCollector and old resume system in downloadCutout.py with manifest-based resume system
- Added hash and eq to downloadCutout Rects so they can be dictionary keys
- Pulled downloader functions into a class so class constants could be namespaced
- Support for multiple connections via 'concurrent_connections' config - Replaced FailedChunkCollector and old resume system in downloadCutout.py with manifest-based resume system - Added __hash__ and __eq__ to downloadCutout Rects so they can be dictionary keys - Pulled downloader functions into a class so class constants could be namespaced
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #57 +/- ##
==========================================
- Coverage 47.18% 42.16% -5.03%
==========================================
Files 16 16
Lines 587 657 +70
==========================================
Hits 277 277
- Misses 310 380 +70 ☔ View full report in Codecov by Sentry. |
@mtauraso I was testing this out and ran a few examples. In |
"""Takes our runtime config and loads cutout config | ||
common to all cutouts into a prototypical Rect for downloading | ||
# Create thread objects for each of our worker threads | ||
num_threads = config.get("concurrent_connections", 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default .toml
has a max_connections
parameter. Is this different from concurrent_connections
? Since this is an important parameter, might be good to put concurrent_connections
in the default.toml
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is new and different. Really max_connections
ought not to have been added in the first place.
I'll address this in a followup PR, because I already have several race-condition fixes in my local checkout that are all fixups to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually strike that, I'm jut going to roll all the fixes into this PR.
column_names = ["object_id"] + variable_fields | ||
locations = filterfits(fits_file, column_names) | ||
# These are the column names we retain when writing a rect out to the manifest.fits file | ||
RECT_COLUMN_NAMES = VARIABLE_FIELDS + ["filter", "sw", "sh", "rerun", "type"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although for the downloads we will be doing, "filter", "sh", "sw", "rerun", "type" will be set to the same value for the whole download; this might not be the case for other situations.
I think it would be best to merge this code as-is and open a low-priority feature-request issue so that the downloader can support variable "filter", "sh", "sw", "rerun", "type".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I Agree. Filed: #58
I delved into the code, and it seems like that in the manifest file, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! Please take a look at/resolve comments prior to merging.
This is indeed what is happening, I'll add some documentation of the manifest file format so its super clear |
Global sort of rects means in a multi-threaded situation, the requests that end up in each thread generally are for the same area of the sky. This means when the server unzips a patch there is a greater likelihood the same request will contain all sources in that patch for the entire download.
- Removed pwd contextmanager in favor of absolute path filename representation in Rects. pwd context manager was not compatible with threading.Thread - Stats system now outputs every 30s (controllable with stats_print_interval config) - Stats system creates sensible metrics for a multithreaded system including how efficiently its multiple connections are being used (e.g. if the server is throttling how many connections we can hold - Several termination issues have been fixed by making all non-main threads daemon threads. The result is that everything shuts down dirty on KeyboardInterrupt. This is okay because the only thing the main thread does on shutdown is read append-only manifest dicts, so additional download activity during the shutdown will be presumed to have failed and be overwritten on resume.