Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPPM graph refactor #391

Open
wants to merge 65 commits into
base: main
Choose a base branch
from
Open

IPPM graph refactor #391

wants to merge 65 commits into from

Conversation

caiw
Copy link
Member

@caiw caiw commented Oct 25, 2024

Main changes

Note

I'M SORRY that this is such a big diff. I'm aware it makes it hard to review. If I had had more time I'd have written a shorter letter, etc.

Also I'm sorry that I have renamed some classes/files in a way which prevents Github from showing diffs properly. In particular builder.py and IPPMBuilder are more or less replaced by grapy.py and IPPMGraph. data_tools.py was somewhat a collection of miscellaneous functionality which has been moved into other classes.

Little fixes

A few issues I created and fixed in the course of the refactor.

Unblocks

Issues removed from scope, but which can now be tackled, hopefully more easily.

@caiw caiw added ⚙️ refactor Changes which relate to code structure rather than functionality IPPM generation labels Oct 25, 2024
@caiw caiw self-assigned this Oct 25, 2024
@caiw caiw force-pushed the ippm-graph-refactor branch from f91409f to b949c97 Compare November 1, 2024 13:19
@caiw caiw force-pushed the ippm-graph-refactor branch from cd9e56b to 26d14ec Compare November 1, 2024 16:14
@caiw
Copy link
Member Author

caiw commented Jan 11, 2025

Down to 22 failing tests from 36 :)

@caiw
Copy link
Member Author

caiw commented Jan 11, 2025

And all the changes are improving things, adding new tests, improving comments. Testing rules! Thanks AGAIN to @anirudh1666 for setting up such a comprehensive suite for this code. 🙏

@caiw caiw requested review from anirudh1666 and neukym January 17, 2025 21:30
@caiw caiw marked this pull request as ready for review January 17, 2025 23:16
@caiw caiw added the 💪 enhancement New feature or request label Jan 17, 2025
@caiw
Copy link
Member Author

caiw commented Jan 18, 2025

Note to self to speed up: Instead of clearing 10s of 1000s of points, simply filter the few you want to keep - it'll be much faster, and will use nearly the same numpy code!

@neukym
Copy link
Member

neukym commented Jan 18, 2025

Amazing. @anirudh1666 - can you take a first look?

@caiw
Copy link
Member Author

caiw commented Jan 18, 2025

Oof.. it's really slow. I did some optimisations this morning but it's still noticeably slower than the original code - something to work on with proper profiling, though I do have a couple more ideas.

Also I confirmed (by running the demo) that the failing tests are a function of the denoising not working as intended. I'm sure it's an easy fix as I was only changing data structures not algorithms - probably just something silly. Still sorry to be giving you broken code to read @anirudh1666 - but I'd still really value your input at this stage.

expression_set = denoising_strategy.denoise(expression_set)

# Build the graph
self._graphs: dict[str, IPPMGraph] = dict()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for clarity, it may help to place self._graphs near the definition in the comments. Usually the constructor contains all of the attributes for a class in one place for easy reference

@@ -5,26 +5,24 @@
from __future__ import annotations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of densify_data_block in this file?

should_merge_hemis: bool = False,
should_exclude_insignificant: bool = True,
should_shuffle: bool = True,
should_normalise: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worthwhile to add some default values for these parameters. It comes down to whether you think which parameters have high variance across runs and which ones will remain constant. The constant ones can be moved to optional/default valuse.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think should_normalise, should_cluster_only_latency, should_max_pool will be False for most runs unless someone wants to experiment.

should_shuffle will be True. As for the threshold for significance, not sure, although estimating it from the data seems better than absolute threshold (exclude_points_above_n_sigma > exclude_logp_vals_above)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment