`PredictionWriter`: optional gzip, use ThreadPoolExecutor #286

sjfleming · 2025-01-17T16:16:27Z

Closes #285

These changes add two init args to PredictionWriter: gzip (bool) and max_threadpool_workers (int), each of which have default values.

PredictionWriter now gzips saved CSVs by default, and runs the saving-and-gzip process in a background thread that does not block further lightning compute.

NewLightningCLI in cli.py also now injects return_predictions=False into calls to trainer.predict().

Testing indicates the following outcomes for scvi reconstructions which involve computing a dense output CSV with 250 columns:

               |    before changes     |     after changes
---------------------------------------------------------------
file size      |            12 MB      |        5 MB
total time     |            18 hr      |        5 hr

(the 18 hrs is gzipping the CSVs without the ThreadPool. if you don't gzip, the total time would be 7.5 hours.)

sjfleming · 2025-01-17T16:25:59Z

Here begins my stream-of-consciousness during development:

The only thing undesirable about this is that the ThreadPoolExecutor can fall behind. In my current test run for example, the prediction writing seems to be about 100 batches behind the actual lightning compute. There might be a danger of OOM if the max_threadpool_workers is so low that you're really lagging behind in terms of writing the outputs.

sjfleming · 2025-01-17T17:53:28Z

Empirically it seems that 8 thread workers does not fall behind systematically, so for now I will change the default to 8

sjfleming · 2025-01-17T19:28:36Z

Ah dang it, it was killed with 8 workers due to OOM after 2.5 hours...
that may be because I pushed batch size too high, but perhaps this approach is a bit fragile

Interested in your thoughts @ordabayevy

sjfleming · 2025-01-17T20:31:02Z

Okay I have now implemented a BoundedThreadPoolExecutor which prevents the executor's queue from growing without limit (and thus memory usage growing without limit). Let's see if this works.

Projected total runtime looks like about 5.5 hours now instead of 5 (projection with unbounded queue).

sjfleming · 2025-01-17T22:02:11Z

Added an (untested) check to fail fast if it can be projected that the total size of the prediction output files will not fit in the allocated disk space. (I ran into this problem and it was only after several hours I found out. :( )

sjfleming · 2025-01-18T06:08:41Z

The OOM problem was delayed but never completely disappeared. I now think the issue has to do with the following...

Needed to reach into cli.py to implement the note in the docstring here:

cellarium-ml/cellarium/ml/callbacks/prediction_writer.py

Lines 47 to 49 in 090161e

    
               .. note:: 
        
                   To prevent an out-of-memory error, set the ``return_predictions`` argument of the 
        
                   :class:`~lightning.pytorch.Trainer` to ``False``.

i.e. the return_predictions=False kwarg needs to be passed to trainer.predict(). I think the changes to cli.py are the correct way to make this happen.

sjfleming · 2025-01-18T18:43:13Z

The above cli.py modification did fix the full scvi run.

cellarium/ml/cli.py

sjfleming · 2025-01-19T05:19:57Z

I can get rid of the changes to cli.py certainly, and leave return_predictions: false to the config file if you think that makes the most sense.

The only thing that bothers me about that solution is that somebody like me can come in, not know they have to include that in the config file or where it goes, and waste a bunch of time hitting out-of-memory errors :)

What I like about modifying cli.py in the manner above is that return_predictions becomes False whenever predict is called, regardless of the config file. So it prevents users from making mistakes. My thinking is just that the simpler the config files can be, the better.

But I can see both sides... and I guess it would be okay with me if we left things as is and just put in some kind of massive UserWarning or something to check and make sure someone hasn't forgotten to add return_predictions: false to their config file by mistake.

What do you think @ordabayevy ?

ordabayevy · 2025-01-19T13:53:15Z

I prefer using the config file instead of hard coding and do the following:

Improve the documentation of PredictionWriter and add an example of how return_predictions should be added to the config file.
If you think that is not enough, add a warning message if return_predictions=True.

... but if we never really gonna need return_predictions=True maybe it makes sense to hard code it, idk. What happens if both your changes applied and return_predictions=True is set?

sjfleming · 2025-01-21T22:13:47Z

Okay I've got those changes implemented. Got rid of the hard-coded return_predictions=False. Got rid of fail-fast if predictions won't fit on disk (thinking ahead to #290 ). Issues a UserWarning if running predict with return_predictions=True. Included explicit config file example in the docstring for PredictionWriter.

ordabayevy · 2025-01-22T18:12:58Z

cellarium/ml/cli.py

+            "This can be set at indent level 0 in the config file. Example:\n"
+            "model:  ...\ndata:  ...\ntrainer:  ...\nreturn_predictions: false",
+            UserWarning,
+        )


I think this might not work if the config file used. Config file is parsed later on in the init method of LightningCLI. Probably this logic needs to be moved somewhere there. It should happen after this line self.parse_arguments(self.parser, args). This hook https://github.com/Lightning-AI/pytorch-lightning/blob/a944e7744e57a5a2c13f3c73b9735edf2f71e329/src/lightning/pytorch/cli.py#L554 might be a good place.

Right you are! Thank you.

I have now included an explicit test to make sure I'm actually doing what I wanted. Indeed you're right... it didn't work with config files. I've tried to implement what you suggested and the tests seem to pass.

ordabayevy

Looks great!

ordabayevy · 2025-01-23T17:00:37Z

tests/test_cli.py

+        warning_message = r.message.args[0]
+        if match_str in warning_message:
+            n += 1
+    assert n < 2, "Unexpected UserWarning when running predict with return_predictions=false"


What does this test do?

Yeah so this is asserting that the UserWarning is not emitted if running prediction with return_predictions: false.

I might be doing it in a weird way, but I'm not sure what the right way is. It's easy to assert that a warning is emitted, but not so easy to test that a warning is not emitted (from what I can tell). The only way I could figure was to count up the warnings matching a certain match string. (And I needed at least one such warning, or the counting mechanism would not work. Thus the assertion n < 2... there is one "fake" warning to enable counting, and then any further warning would be the real warning.)

Optional gzip, use ThreadPoolExecutor

fe735ce

sjfleming changed the title ~~PredictWriter: optional gzip, use ThreadPoolExecutor~~ PredictionWriter: optional gzip, use ThreadPoolExecutor Jan 17, 2025

max_threadpool_workers default 8

7e3d97a

sjfleming requested a review from ordabayevy January 17, 2025 17:54

sjfleming marked this pull request as draft January 17, 2025 19:25

Bound the queue of the thread pool

83673ba

sjfleming added 2 commits January 17, 2025 16:28

Add (untested) disk space check to fail fast

991f0d1

linting

481e83a

sjfleming added 2 commits January 17, 2025 17:06

linting again

18bd7a5

return_predictions=False as per PredictionWriter docstring

4ec5fb9

sjfleming marked this pull request as ready for review January 18, 2025 18:43

ordabayevy reviewed Jan 19, 2025

View reviewed changes

cellarium/ml/cli.py Outdated Show resolved Hide resolved

sjfleming added 4 commits January 21, 2025 16:51

Revert hard-coded return_predictions; issue warning

c1b7ec2

Remove disk space checking

e57039e

Remove obsolete attribute

2dc3573

Improve docstring

5f28754

ordabayevy reviewed Jan 22, 2025

View reviewed changes

sjfleming added 3 commits January 23, 2025 11:16

Test demonstrating unintended warning with config file

1d30010

Fix problem with test

98112bf

Fix the UserWarning location

5c32dc4

Remove extra newline

f431b59

ordabayevy approved these changes Jan 23, 2025

View reviewed changes

ordabayevy merged commit 8e7c817 into main Jan 23, 2025
8 checks passed

ordabayevy deleted the sf-predictwriter-gzip-threadpoolexecutor branch January 23, 2025 17:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`PredictionWriter`: optional gzip, use ThreadPoolExecutor #286

`PredictionWriter`: optional gzip, use ThreadPoolExecutor #286

sjfleming commented Jan 17, 2025 •

edited

Loading

sjfleming commented Jan 17, 2025 •

edited

Loading

sjfleming commented Jan 17, 2025

sjfleming commented Jan 17, 2025 •

edited

Loading

sjfleming commented Jan 17, 2025 •

edited

Loading

sjfleming commented Jan 17, 2025

sjfleming commented Jan 18, 2025 •

edited

Loading

sjfleming commented Jan 18, 2025

sjfleming commented Jan 19, 2025

ordabayevy commented Jan 19, 2025

sjfleming commented Jan 21, 2025

ordabayevy Jan 22, 2025

sjfleming Jan 23, 2025

ordabayevy left a comment

ordabayevy Jan 23, 2025

sjfleming Jan 23, 2025

PredictionWriter: optional gzip, use ThreadPoolExecutor #286

PredictionWriter: optional gzip, use ThreadPoolExecutor #286

Conversation

sjfleming commented Jan 17, 2025 • edited Loading

sjfleming commented Jan 17, 2025 • edited Loading

sjfleming commented Jan 17, 2025

sjfleming commented Jan 17, 2025 • edited Loading

sjfleming commented Jan 17, 2025 • edited Loading

sjfleming commented Jan 17, 2025

sjfleming commented Jan 18, 2025 • edited Loading

sjfleming commented Jan 18, 2025

sjfleming commented Jan 19, 2025

ordabayevy commented Jan 19, 2025

sjfleming commented Jan 21, 2025

ordabayevy Jan 22, 2025

Choose a reason for hiding this comment

sjfleming Jan 23, 2025

Choose a reason for hiding this comment

ordabayevy left a comment

Choose a reason for hiding this comment

ordabayevy Jan 23, 2025

Choose a reason for hiding this comment

sjfleming Jan 23, 2025

Choose a reason for hiding this comment

`PredictionWriter`: optional gzip, use ThreadPoolExecutor #286

`PredictionWriter`: optional gzip, use ThreadPoolExecutor #286

sjfleming commented Jan 17, 2025 •

edited

Loading

sjfleming commented Jan 17, 2025 •

edited

Loading

sjfleming commented Jan 17, 2025 •

edited

Loading

sjfleming commented Jan 17, 2025 •

edited

Loading

sjfleming commented Jan 18, 2025 •

edited

Loading