[python] Support custom obs encoders #1191

ebezzi · 2024-06-12T17:42:30Z

Add support for custom obs encoders. Example use case: when multiple obs columns need to be batched together before the tensor creation.

atolopko-czi

Noted two issues that need to be addressed (see ⚠️ comments). Otherwise, just some nits.

atolopko-czi · 2024-06-12T18:21:46Z

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

@@ -164,6 +212,8 @@ def __next__(self) -> _SOMAChunk:
        )
        assert obs_batch.shape[0] == obs_joinids_chunk.shape[0]

+        # print("obs_batch", obs_batch)
+


api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

atolopko-czi · 2024-06-12T18:40:42Z

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

@@ -700,7 +766,8 @@ def obs_encoders(self) -> Encoders:
        self._init()
        assert self._encoders is not None

-        return self._encoders
+        # return self._encoders


atolopko-czi · 2024-06-12T18:48:08Z

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

+            for enc in self._custom_encoders:
+                enc.register(obs)
+                encoders.append(enc)
+        else:


⚠️ Should this still register default encoders for columns that are not the source of any custom encoder? This complicates matters, I realize, since each encoder would then have to publicize its source columns explicitly. But without it, any object-typed columns that do not have a custom encoder will not be usable in a Tensor. Minimally, the class should present a clear error if after transform, any non-numeric columns remain (with a hint to specify an encoder for each unencoded column).

My idea here is that, if the user specifies custom encoders, then they are responsible for ensuring that all the columns are transformed properly. With the new changes (see lined 346-347), if a column isn't explicitly registered by the encoder, it won't be added to the final obs_tensor. I believe this addresses this concern, but feel free to let me know if I am still missing something 😄

That said, I think we should expand the docstrings to better explain the consequences of using custom encoders.

In that case, I suppose the obs_columns and encoders could be mutually exclusive. That is, it would be nice if one could specify just the encoders without having to also specify the source obs_columns, since this could lead to errors. But here again, the Encoder would need to be able to report what columns it depends on so that the ExpDataPipe can retrieve them correctly.

But I also agree this could just be handled via clear documentation and useful error messages.

I do agree that, if using custom encoders, obs_columns could be omitted. There is a slight performance increase in using them, as it doesn't require to fetch the whole obs though. Maybe we can just make them optional?

My 2 cents is that this should error if both are passed, and the encoders should be able to report which columns they are going to grab. This links onto my other comment about adopting a scikit-learn like (or maybe even compatible) API, where column transformers and metadata routing are used to retrieve information like column names for their transformers.

In my heart of hearts, I would probably even allow specifying an encoder for X. E.g. instead of specifying use_sparse_X you could have a DenseEncoder().

I'm ok with implementing the logic where either the parameter can be defined. I also think that using an encoder for X is a great idea - I'll create a ticket for that as it will go out of scope for this PR.

looks like the obs_columns vs encoder are now mutually exclusive, so the comment is nearly addressed--does the X encoder idea still need an issue created for it? (didn't see one)

Added: #1219

…ml/pytorch.py Co-authored-by: Andrew Tolopko <[email protected]>

ivirshup · 2024-06-18T19:54:04Z

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

+        self._encoder = LabelEncoder()
+        self.col = col
+
+    def register(self, obs: pd.DataFrame) -> None:


Could this be called fit? This would make the API very similar to a scikit-learn transformer, which I believe is close to what this is replacing.

It could be nice to be able to use some of their existing transformers here, or at least provide a familiar API

There is also an existing concept of a ColumnEncoder which keeps track of column names

I agree that fit is a better name. Will change it.

ivirshup · 2024-06-18T19:58:45Z

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

+    @property
+    def name(self) -> str:
+        """Name of the encoder."""
+        return self.col


Since the user might provide multiple encoders, I think it would be nice if they could provide multiple encoders of the same class. This would mean that the name probably has to be dynamic.

It could also be nice to provide each encoder as: encoders={name: Encoder(), ...} and then being able to access the endoded values at training time as: batch[name]. I think it's a little error prone right now that the positions of the encoded values are set by the positions of the encoder, even though the code that needs to synchronize this isn't in the same place. E.g. if the first encoder is removed, the training loop may need to change each variable = y_batch[:, i]-like statements.

ivirshup · 2024-06-18T20:08:29Z

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

+            for enc in self._custom_encoders:
+                enc.register(obs)
+                encoders.append(enc)
+        else:


My 2 cents is that this should error if both are passed, and the encoders should be able to report which columns they are going to grab. This links onto my other comment about adopting a scikit-learn like (or maybe even compatible) API, where column transformers and metadata routing are used to retrieve information like column names for their transformers.

ivirshup · 2024-06-18T20:11:46Z

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

+            for enc in self._custom_encoders:
+                enc.register(obs)
+                encoders.append(enc)
+        else:


In my heart of hearts, I would probably even allow specifying an encoder for X. E.g. instead of specifying use_sparse_X you could have a DenseEncoder().

ebezzi · 2024-06-25T22:49:43Z

api/python/cellxgene_census/tests/experimental/ml/test_pytorch.py

@@ -419,6 +421,7 @@ def test_distributed__returns_data_partition_for_rank(
            measurement_name="RNA",
            X_name="raw",
            obs_column_names=["label"],
+            encoders=[DefaultEncoder("soma_joinid"), DefaultEncoder("label")],


These test use the soma_joinid to assert positional conditions, so this is how you force the soma_joinid to be part of the encoded values.

Ah, then this point really should be explained in the docstring for encoders param.

Also would obs_column_namess=["soma_joinid", "label"] be equivalent?

Not the same - soma_joinid is ignored because by default only string columns are encoded when using the default behavior. Using custom encoders override this.

codecov · 2024-06-28T17:59:41Z

Codecov Report

Attention: Patch coverage is 95.10490% with 7 lines in your changes missing coverage. Please review.

Project coverage is 91.30%. Comparing base (f775282) to head (f165f7f).
Report is 2 commits behind head on main.

Files	Patch %	Lines
...s/src/cellxgene_census/experimental/ml/encoders.py	89.70%	7 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1191      +/-   ##
==========================================
+ Coverage   91.19%   91.30%   +0.11%     
==========================================
  Files          77       80       +3     
  Lines        5971     6256     +285     
==========================================
+ Hits         5445     5712     +267     
- Misses        526      544      +18

Flag	Coverage Δ
unittests	`91.30% <95.10%> (+0.11%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

atolopko-czi

LGTM w/a couple of minor comments and maybe a couple more tests:

The various existing tests do exercise both obs_columns and encoders, but the encoders param doesn't have an explicit test. So could add a couple of tests perhaps:

A test similar to test_encoders but the for encoders param
A custom encoder that reads from multiple columns (since that was a motivation for this work)

atolopko-czi · 2024-06-28T19:19:27Z

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

+        if obs_column_names and encoders:
+            raise ValueError(
+                "Cannot specify both `obs_column_names` and `encoders`. If `encoders` are specified, columns will be inferred automatically."
+            )


atolopko-czi · 2024-06-28T19:22:04Z

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

+            for enc in self._custom_encoders:
+                enc.register(obs)
+                encoders.append(enc)
+        else:


looks like the obs_columns vs encoder are now mutually exclusive, so the comment is nearly addressed--does the X encoder idea still need an issue created for it? (didn't see one)

atolopko-czi · 2024-06-28T19:26:22Z

api/python/cellxgene_census/tests/experimental/ml/test_pytorch.py

@@ -419,6 +421,7 @@ def test_distributed__returns_data_partition_for_rank(
            measurement_name="RNA",
            X_name="raw",
            obs_column_names=["label"],
+            encoders=[DefaultEncoder("soma_joinid"), DefaultEncoder("label")],


Ah, then this point really should be explained in the docstring for encoders param.

atolopko-czi · 2024-06-28T19:27:18Z

api/python/cellxgene_census/tests/experimental/ml/test_pytorch.py

@@ -419,6 +421,7 @@ def test_distributed__returns_data_partition_for_rank(
            measurement_name="RNA",
            X_name="raw",
            obs_column_names=["label"],
+            encoders=[DefaultEncoder("soma_joinid"), DefaultEncoder("label")],


Also would obs_column_namess=["soma_joinid", "label"] be equivalent?

api/python/cellxgene_census/tests/experimental/ml/test_pytorch.py

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

ivirshup · 2024-06-28T22:52:12Z

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

+            if len(encoders) != len({enc.name for enc in encoders}):
+                raise ValueError("Encoders must have unique names")
+
+            self.obs_column_names = list(itertools.chain(*[enc.columns for enc in encoders]))


What happens if two encoders rely on the same column?

My guess is that it would error since this errors: query.obs(column_names=["soma_joinid", "soma_joinid"]).concat().to_pandas()

I'll add a dedup here.

…ml/pytorch.py Co-authored-by: Isaac Virshup <[email protected]>

Co-authored-by: Isaac Virshup <[email protected]>

…uckerberg/cell-census into ebezzi/support-custom-obs-encoders

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

…ml/pytorch.py Co-authored-by: Isaac Virshup <[email protected]>

ivirshup

LGTM. I'm not totally sure about the API (e.g. what happens if an encoder returns a non-1d value?), but this can be revisited in future.

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/__init__.py

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/encoders.py

pablo-gar

Notebook looks good!

ebezzi added 2 commits June 12, 2024 10:31

Support for custom obs encoders

3113626

Add docstrings

23d000c

ebezzi requested review from prathapsridharan and atolopko-czi June 12, 2024 17:42

atolopko-czi requested changes Jun 12, 2024

View reviewed changes

ebezzi and others added 5 commits June 12, 2024 12:35

Update api/python/cellxgene_census/src/cellxgene_census/experimental/…

945f14f

…ml/pytorch.py Co-authored-by: Andrew Tolopko <[email protected]>

Doc changes

31f49a7

merge from main

1c5e875

merge from main

b9ca1e0

Revert some changes

f688020

ebezzi mentioned this pull request Jun 18, 2024

[python] CensusSCVIDataModule + notebook #1196

Open

More refactor

d4e1d9b

ivirshup reviewed Jun 18, 2024

View reviewed changes

ebezzi added 2 commits June 25, 2024 10:22

Partial test upgrades

9fa5b53

More fixes

367cc55

ebezzi commented Jun 25, 2024

View reviewed changes

ebezzi added 5 commits June 25, 2024 15:51

Revert dockerfile

99fbc76

Explicit columns

ad6b0ec

Consolidate encoders variable

b6253f9

Add duplicate check

fc4271f

Fix typying issue

085ec6e

atolopko-czi approved these changes Jun 28, 2024

View reviewed changes

ivirshup reviewed Jun 28, 2024

View reviewed changes

ebezzi and others added 6 commits July 1, 2024 11:31

Update api/python/cellxgene_census/src/cellxgene_census/experimental/…

1717109

…ml/pytorch.py Co-authored-by: Isaac Virshup <[email protected]>

Update api/python/cellxgene_census/tests/experimental/ml/test_pytorch.py

2ac7fbc

Co-authored-by: Isaac Virshup <[email protected]>

Changes

7d77596

Merge branch 'ebezzi/support-custom-obs-encoders' of github.com:chanz…

697f417

…uckerberg/cell-census into ebezzi/support-custom-obs-encoders

Move BatchEncoder to this branch

fac1373

Fix import

27ab0a6

Add encoders file

5d4b1ae

ivirshup reviewed Jul 2, 2024

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py Outdated Show resolved Hide resolved

ebezzi and others added 2 commits July 2, 2024 16:27

Add batch encoder test

a2c7e0d

Update api/python/cellxgene_census/src/cellxgene_census/experimental/…

572850c

…ml/pytorch.py Co-authored-by: Isaac Virshup <[email protected]>

ivirshup approved these changes Jul 3, 2024

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/__init__.py Outdated Show resolved Hide resolved

ivirshup reviewed Jul 3, 2024

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/encoders.py Outdated Show resolved Hide resolved

Doc changes (v2)

96e0650

ebezzi force-pushed the ebezzi/support-custom-obs-encoders branch from 498b8a0 to 96e0650 Compare July 3, 2024 20:27

ebezzi added 5 commits July 3, 2024 13:44

Remove print

14a41eb

Rename DefaultEncoder -> LabelEncoder

758f7ff

Merge branch 'main' into ebezzi/support-custom-obs-encoders

86d80a6

Fix categorical bug

a0e424e

Notebook changes

f165f7f

pablo-gar approved these changes Jul 6, 2024

View reviewed changes

ebezzi merged commit 8457e3f into main Jul 6, 2024
17 checks passed

ebezzi deleted the ebezzi/support-custom-obs-encoders branch July 6, 2024 01:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Support custom obs encoders #1191

[python] Support custom obs encoders #1191

ebezzi commented Jun 12, 2024 •

edited

Loading

atolopko-czi left a comment

atolopko-czi Jun 12, 2024

atolopko-czi Jun 12, 2024

atolopko-czi Jun 12, 2024

ebezzi Jun 12, 2024

atolopko-czi Jun 12, 2024

ebezzi Jun 18, 2024

ivirshup Jun 18, 2024

ivirshup Jun 18, 2024

ebezzi Jun 24, 2024

atolopko-czi Jun 28, 2024

ebezzi Jul 1, 2024

ivirshup Jun 18, 2024

ivirshup Jun 18, 2024

ebezzi Jun 24, 2024

ivirshup Jun 18, 2024

ivirshup Jun 18, 2024

ivirshup Jun 18, 2024

ebezzi Jun 25, 2024 •

edited

Loading

atolopko-czi Jun 28, 2024

atolopko-czi Jun 28, 2024

ebezzi Jul 1, 2024

codecov bot commented Jun 28, 2024 •

edited

Loading

atolopko-czi left a comment

atolopko-czi Jun 28, 2024

atolopko-czi Jun 28, 2024

atolopko-czi Jun 28, 2024

atolopko-czi Jun 28, 2024

ivirshup Jun 28, 2024

ebezzi Jul 1, 2024

ivirshup left a comment

pablo-gar left a comment

[python] Support custom obs encoders #1191

[python] Support custom obs encoders #1191

Conversation

ebezzi commented Jun 12, 2024 • edited Loading

atolopko-czi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebezzi Jun 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jun 28, 2024 • edited Loading

Codecov Report

atolopko-czi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivirshup left a comment

Choose a reason for hiding this comment

pablo-gar left a comment

Choose a reason for hiding this comment

ebezzi commented Jun 12, 2024 •

edited

Loading

ebezzi Jun 25, 2024 •

edited

Loading

codecov bot commented Jun 28, 2024 •

edited

Loading