-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] Support custom obs encoders #1191
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted two issues that need to be addressed (see
@@ -164,6 +212,8 @@ def __next__(self) -> _SOMAChunk: | |||
) | |||
assert obs_batch.shape[0] == obs_joinids_chunk.shape[0] | |||
|
|||
# print("obs_batch", obs_batch) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm
api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py
Outdated
Show resolved
Hide resolved
api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py
Outdated
Show resolved
Hide resolved
api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py
Outdated
Show resolved
Hide resolved
api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py
Show resolved
Hide resolved
@@ -700,7 +766,8 @@ def obs_encoders(self) -> Encoders: | |||
self._init() | |||
assert self._encoders is not None | |||
|
|||
return self._encoders | |||
# return self._encoders |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm
for enc in self._custom_encoders: | ||
enc.register(obs) | ||
encoders.append(enc) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
object
-typed columns that do not have a custom encoder will not be usable in a Tensor. Minimally, the class should present a clear error if after transform, any non-numeric columns remain (with a hint to specify an encoder for each unencoded column).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea here is that, if the user specifies custom encoders, then they are responsible for ensuring that all the columns are transformed properly. With the new changes (see lined 346-347), if a column isn't explicitly registered by the encoder, it won't be added to the final obs_tensor. I believe this addresses this concern, but feel free to let me know if I am still missing something 😄
That said, I think we should expand the docstrings to better explain the consequences of using custom encoders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I suppose the obs_columns
and encoders
could be mutually exclusive. That is, it would be nice if one could specify just the encoders
without having to also specify the source obs_columns
, since this could lead to errors. But here again, the Encoder
would need to be able to report what columns it depends on so that the ExpDataPipe
can retrieve them correctly.
But I also agree this could just be handled via clear documentation and useful error messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree that, if using custom encoders, obs_columns
could be omitted. There is a slight performance increase in using them, as it doesn't require to fetch the whole obs
though. Maybe we can just make them optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2 cents is that this should error if both are passed, and the encoders should be able to report which columns they are going to grab. This links onto my other comment about adopting a scikit-learn like (or maybe even compatible) API, where column transformers and metadata routing are used to retrieve information like column names for their transformers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my heart of hearts, I would probably even allow specifying an encoder for X
. E.g. instead of specifying use_sparse_X
you could have a DenseEncoder()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with implementing the logic where either the parameter can be defined. I also think that using an encoder for X is a great idea - I'll create a ticket for that as it will go out of scope for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the obs_columns vs encoder are now mutually exclusive, so the comment is nearly addressed--does the X encoder idea still need an issue created for it? (didn't see one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added: #1219
…ml/pytorch.py Co-authored-by: Andrew Tolopko <[email protected]>
self._encoder = LabelEncoder() | ||
self.col = col | ||
|
||
def register(self, obs: pd.DataFrame) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be called fit
? This would make the API very similar to a scikit-learn transformer, which I believe is close to what this is replacing.
It could be nice to be able to use some of their existing transformers here, or at least provide a familiar API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also an existing concept of a ColumnEncoder
which keeps track of column names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that fit
is a better name. Will change it.
@property | ||
def name(self) -> str: | ||
"""Name of the encoder.""" | ||
return self.col |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the user might provide multiple encoders, I think it would be nice if they could provide multiple encoders of the same class. This would mean that the name probably has to be dynamic.
It could also be nice to provide each encoder as: encoders={name: Encoder(), ...}
and then being able to access the endoded values at training time as: batch[name]
. I think it's a little error prone right now that the positions of the encoded values are set by the positions of the encoder, even though the code that needs to synchronize this isn't in the same place. E.g. if the first encoder is removed, the training loop may need to change each variable = y_batch[:, i]
-like statements.
for enc in self._custom_encoders: | ||
enc.register(obs) | ||
encoders.append(enc) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2 cents is that this should error if both are passed, and the encoders should be able to report which columns they are going to grab. This links onto my other comment about adopting a scikit-learn like (or maybe even compatible) API, where column transformers and metadata routing are used to retrieve information like column names for their transformers.
for enc in self._custom_encoders: | ||
enc.register(obs) | ||
encoders.append(enc) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my heart of hearts, I would probably even allow specifying an encoder for X
. E.g. instead of specifying use_sparse_X
you could have a DenseEncoder()
.
@@ -419,6 +421,7 @@ def test_distributed__returns_data_partition_for_rank( | |||
measurement_name="RNA", | |||
X_name="raw", | |||
obs_column_names=["label"], | |||
encoders=[DefaultEncoder("soma_joinid"), DefaultEncoder("label")], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These test use the soma_joinid
to assert positional conditions, so this is how you force the soma_joinid
to be part of the encoded values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, then this point really should be explained in the docstring for encoders
param.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also would obs_column_namess=["soma_joinid", "label"]
be equivalent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not the same - soma_joinid
is ignored because by default only string columns are encoded when using the default behavior. Using custom encoders override this.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1191 +/- ##
==========================================
+ Coverage 91.19% 91.30% +0.11%
==========================================
Files 77 80 +3
Lines 5971 6256 +285
==========================================
+ Hits 5445 5712 +267
- Misses 526 544 +18
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM w/a couple of minor comments and maybe a couple more tests:
The various existing tests do exercise both obs_columns
and encoders
, but the encoders
param doesn't have an explicit test. So could add a couple of tests perhaps:
- A test similar to
test_encoders
but the forencoders
param - A custom encoder that reads from multiple columns (since that was a motivation for this work)
if obs_column_names and encoders: | ||
raise ValueError( | ||
"Cannot specify both `obs_column_names` and `encoders`. If `encoders` are specified, columns will be inferred automatically." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
for enc in self._custom_encoders: | ||
enc.register(obs) | ||
encoders.append(enc) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the obs_columns vs encoder are now mutually exclusive, so the comment is nearly addressed--does the X encoder idea still need an issue created for it? (didn't see one)
@@ -419,6 +421,7 @@ def test_distributed__returns_data_partition_for_rank( | |||
measurement_name="RNA", | |||
X_name="raw", | |||
obs_column_names=["label"], | |||
encoders=[DefaultEncoder("soma_joinid"), DefaultEncoder("label")], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, then this point really should be explained in the docstring for encoders
param.
@@ -419,6 +421,7 @@ def test_distributed__returns_data_partition_for_rank( | |||
measurement_name="RNA", | |||
X_name="raw", | |||
obs_column_names=["label"], | |||
encoders=[DefaultEncoder("soma_joinid"), DefaultEncoder("label")], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also would obs_column_namess=["soma_joinid", "label"]
be equivalent?
api/python/cellxgene_census/tests/experimental/ml/test_pytorch.py
Outdated
Show resolved
Hide resolved
api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py
Outdated
Show resolved
Hide resolved
api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py
Show resolved
Hide resolved
if len(encoders) != len({enc.name for enc in encoders}): | ||
raise ValueError("Encoders must have unique names") | ||
|
||
self.obs_column_names = list(itertools.chain(*[enc.columns for enc in encoders])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if two encoders rely on the same column?
My guess is that it would error since this errors: query.obs(column_names=["soma_joinid", "soma_joinid"]).concat().to_pandas()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a dedup here.
…ml/pytorch.py Co-authored-by: Isaac Virshup <[email protected]>
Co-authored-by: Isaac Virshup <[email protected]>
…uckerberg/cell-census into ebezzi/support-custom-obs-encoders
api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py
Outdated
Show resolved
Hide resolved
…ml/pytorch.py Co-authored-by: Isaac Virshup <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm not totally sure about the API (e.g. what happens if an encoder returns a non-1d value?), but this can be revisited in future.
api/python/cellxgene_census/src/cellxgene_census/experimental/ml/__init__.py
Outdated
Show resolved
Hide resolved
api/python/cellxgene_census/src/cellxgene_census/experimental/ml/encoders.py
Outdated
Show resolved
Hide resolved
498b8a0
to
96e0650
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notebook looks good!
Add support for custom obs encoders. Example use case: when multiple obs columns need to be batched together before the tensor creation.