-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: MultiIndex union/difference not commutative #60642
Comments
pickle can perform arbitrary code execution and thus presents security issues. Can you post your example without using pickle. ix1 = pd.MultiIndex(levels=ix1.levels, codes=ix1.codes, names=ix1.names, verify_integrity=False) |
Thanks @rhshadrach - that's really helpful for boiling down the essential parameters of the multiindex to reproduce the issue without using pickle. When I extracted the Compare the following two snippets: >>> pd.__version__
'2.1.1'
>>> np.__version__
'1.24.4'
>>> ix1 = pd.MultiIndex.from_product([[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]], names=['dim1', 'dim2'])
>>> ix1
MultiIndex([(nan, '2018-06-01')],
names=['dim1', 'dim2'])
>>> ix1.codes
FrozenList([[-1], [0]])
>>>
>>> df = pd.DataFrame({'foo': [42]}, index=ix1)
>>> df = df.reset_index(drop=False)
>>> ix2 = df.groupby(['dim1', 'dim2'], dropna=False, observed=True).first().index
>>>
>>> ix2
MultiIndex([(nan, '2018-06-01')],
names=['dim1', 'dim2'])
>>> ix2.codes
FrozenList([[0], [0]])
^
supposedly wrong code (should be -1?)
>>>
>>> ix1.union(ix2)
MultiIndex([(nan, '2018-06-01')],
names=['dim1', 'dim2']) 2.1.1 has a 0-code for a >>> pd.__version__
'2.2.3'
>>> np.__version__
'2.2.1'
>>> ix1 = pd.MultiIndex.from_product([[np.nan], [pd.Timestamp('2018-06-01 00:00:00')]], names=['dim1', 'dim2'])
>>> ix1
MultiIndex([(nan, '2018-06-01')],
names=['dim1', 'dim2'])
>>> ix1.codes
FrozenList([[-1], [0]])
>>>
>>> df = pd.DataFrame({'foo': [42]}, index=ix1)
>>> df = df.reset_index(drop=False)
>>> ix2 = df.groupby(['dim1', 'dim2'], dropna=False, observed=True).first().index
>>>
>>> ix2
MultiIndex([(nan, '2018-06-01')],
names=['dim1', 'dim2'])
>>> ix2.codes
FrozenList([[0], [0]])
>>>
>>> ix1.union(ix2)
MultiIndex([(nan, '2018-06-01'),
(nan, '2018-06-01')], # <- wrong - double up of rows (should be only 1)
names=['dim1', 'dim2'])
>>>
# let's try to fix `ix2` (`from_frame` seems to be "fixing" the index)
>>> ix2_fixed = pd.MultiIndex.from_frame(pd.DataFrame(index=ix2).reset_index())
>>> ix2_fixed.codes
FrozenList([[-1], [0]])
# ^
# expected for nan
>>>
>>> ix1.union(ix2_fixed)
MultiIndex([(nan, '2018-06-01')], # <- correct result, 1 row only
names=['dim1', 'dim2']) The question is - is |
I went down the avenue of assuming that groupby should have -1 codes for nan values (so other nan level indices are compatible). This lead me to class Grouping:
...
def _codes_and_uniques(self):
...
uniques = self._uniques
else:
# GH35667, replace dropna=False with use_na_sentinel=False
# error: Incompatible types in assignment (expression has type "Union[
# ndarray[Any, Any], Index]", variable has type "Categorical")
codes, uniques = algorithms.factorize( # type: ignore[assignment]
self.grouping_vector, sort=self._sort, use_na_sentinel=self._dropna
)
return codes, uniques Since However, I don't think that matters much as the codes may not be consistent across indices (i.e. the code-level relationship can be different for different indices). Ideally, Strangely enough, one cannot set up the above example via the default constructor of MultiIndex.
works as expected as I'm still puzzled as to whether there's an implicit assumption that |
You need to pass The Cython groupby implementation is heavily tied to allowing nonnegative codes for NA values to implement pandas/pandas/_libs/groupby.pyx Lines 304 to 310 in a81d52f
I agree this appears to be a bug in |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
I wasn't able to extract the data for this example to set up the test case more programmatically, but I managed to reduce the data significantly without compromising the behaviour. I hope the below is somewhat portable (please let me know if it isn't). I used numpy==2.2.1 and pandas==2.2.3
Issue Description
Creating the union of two indices with a nan level causes the union result to depend on the order of the call (
index1.union(index2)
vs.index2.union(index1)
). With other words, one of the calls yields the wrong result as the call deems every row to be distinct. I'm fairly certain that is is due tonan
value in dim1, but if I recreate the example programmatically, the behaviour is as expected.However, in test cases for a rather large application, I arrive at the state from the pickle example. I'm not sure what's different to the working example
Expected Behavior
I would expect the difference of the two indices from the pickled example to be empty and the union to be the same as the two indices.
I am also at a loss as to why I can't reproduce the wrong behaviour programmatically.
Installed Versions
The text was updated successfully, but these errors were encountered: