-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save & load responses as parquet #8684
Conversation
3e063f7
to
378376d
Compare
31a824e
to
0d8ddd4
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8684 +/- ##
==========================================
+ Coverage 91.42% 91.47% +0.05%
==========================================
Files 344 344
Lines 21120 21243 +123
==========================================
+ Hits 19308 19433 +125
+ Misses 1812 1810 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
c2309fa
to
7b8976a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this improves readability and makes the responses more generic, which is good 👍
assert all( | ||
fopr.data.columns.get_level_values("data_index").values == list(range(200)) | ||
) | ||
# Why 210, not 200? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outdated comment?
@@ -1,212 +1,212 @@ | |||
------------ ------------------- ----- ----- ----- ----- ------ ----- ------ | |||
FOPR 2010-01-10T00:00:00 0.002 0.100 5.657 0.566 0.076 0.105 Active |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To decrease your diff you could probably just fix the formatting here 😅 Not a big deal though, see that it is only the formatting that changed.
pivoted = responses_for_type.pivot( | ||
on="realization", | ||
index=["response_key", *response_cls.primary_key], | ||
aggregate_function="mean", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the implication of mean
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It said so in the comment 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will that be output somewhere? Is it possible to for example log it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is for the edge case where we end up with duplicate values for one response at one index, for example a given time. In that case, we need to aggregate them for the pivoted table to make sense, else the index
used to pivot contains duplicates. So taking the average of the duplicate response values on the timestep seems to be somewhat "close enough" to do what we want, we could set it to use min,max,median,first, etc, could configure it, but not sure if it would be interesting to users to do this?
Example from running test_that_duplicate_summary_time_steps_does_not_fail
:
responses_for_type.pivot(
on="realization",
index=["response_key", *response_cls.primary_key],
aggregate_function="mean",
)
Out[9]:
shape: (1, 5)
┌──────────────┬─────────────────────┬───────────┬────────┬──────────┐
│ response_key ┆ time ┆ 0 ┆ 1 ┆ 2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ datetime[ms] ┆ f32 ┆ f32 ┆ f32 │
╞══════════════╪═════════════════════╪═══════════╪════════╪══════════╡
│ FOPR ┆ 2014-09-10 00:00:00 ┆ -1.603837 ┆ 0.0641 ┆ 0.740891 │
└──────────────┴─────────────────────┴───────────┴────────┴──────────┘
responses_for_type
Out[10]:
shape: (4, 4)
┌─────────────┬──────────────┬─────────────────────┬───────────┐
│ realization ┆ response_key ┆ time ┆ values │
│ --- ┆ --- ┆ --- ┆ --- │
│ u16 ┆ str ┆ datetime[ms] ┆ f32 │
╞═════════════╪══════════════╪═════════════════════╪═══════════╡
│ 0 ┆ FOPR ┆ 2014-09-10 00:00:00 ┆ -1.603837 │
│ 1 ┆ FOPR ┆ 2014-09-10 00:00:00 ┆ 0.0641 │
│ 2 ┆ FOPR ┆ 2014-09-10 00:00:00 ┆ 0.740891 │
│ 2 ┆ FOPR ┆ 2014-09-10 00:00:00 ┆ 0.740891 │
└─────────────┴──────────────┴─────────────────────┴───────────┘
Alternatively we could strive to achieve something like this:
┌──────────────┬─────────────────────┬───────────┬────────┬──────────┐
│ response_key ┆ time ┆ 0 ┆ 1 ┆ 2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ datetime[ms] ┆ f32 ┆ f32 ┆ f32 │
╞══════════════╪═════════════════════╪═══════════╪════════╪══════════╡
│ FOPR ┆ 2014-09-10 00:00:00 ┆ -1.603837 ┆ 0.0641 ┆ 0.740891 │
│ FOPR ┆ 2014-09-10 00:00:00 ┆ NaN ┆ NaN ┆ 0.740891 │
└──────────────┴─────────────────────┴───────────┴────────┴──────────┘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be logged / given as a warning somehow, I'm not so familiar with when/why it happens, which may be relevant to what the warning/logging message should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Performance-wise it might be slow to always check if some values were aggregated, or a naive try-catch around the pivot, as it will pass if there are no duplicate values)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a good, somewhat performant way of warning the user this has happened, that would be good. My hunch is that this would typically happen in pressure tests where the time resolution is quite high, and the simulator does not have the same resolution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be OK to do this in a separate PR? I think the try-catch, first trying without an aggregation, then trying with one, should be easy to add / easy to remove if it turns out to have bad side effects. Should maybe be tested as its own thing just to be sure.
18d2de1
to
2bc6bf8
Compare
# We need to either assume that if there is a time column | ||
# we will approx-join that, or we could specify in response configs | ||
# that there is a column that requires an approx "asof" join. | ||
# Suggest we simplify and assume that there is always only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, if and when we add new response types where this might be relevant we can add it then.
src/ert/config/ert_config.py
Outdated
self.observations: Dict[str, xr.Dataset] = self.enkf_obs.datasets | ||
self.observations: Dict[str, polars.DataFrame] = self.enkf_obs.datasets | ||
|
||
def write_observations_to_folder(self, dest: Path) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nitpick, but should this function be here? Maybe it belongs with the observations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved it to enkf_obs
}, | ||
return polars.DataFrame( | ||
{ | ||
"report_step": polars.Series( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This made it much easier to read!
src/ert/config/observation_vector.py
Outdated
if self.observation_type == EnkfObservationImplementationType.GEN_OBS: | ||
datasets = [] | ||
actual_response_key = self.data_key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use self.data_key
directly? Same on the next line, seems it is only used once.
@@ -61,8 +80,12 @@ def __getitem__(self, key: str) -> ObsVector: | |||
def __eq__(self, other: object) -> bool: | |||
if not isinstance(other, EnkfObs): | |||
return False | |||
|
|||
if self.datasets.keys() != other.datasets.keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isnt this duplicated in ErtConfig
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appears so, but this is for the EnkfObs
, and in ErtConfig
it is for the dict mapping response type to obs ds. Long-term we should maybe cut out enkfobs and only keep the dict but right now it is a bit duplicated and necessary.
2769d67
to
41aae5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice job, just some minor comments.
@@ -183,6 +183,9 @@ def summary_observations( | |||
"error_mode": draw(st.sampled_from(ErrorMode)), | |||
"value": draw(positive_floats), | |||
} | |||
|
|||
assume(kws["error_mode"] == ErrorMode.ABS or kws["error"] < 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in a separate commit, but think it has effect on logic from the first commit? If so they should be squashed so the tests pass on all commits.
@@ -236,3 +236,36 @@ def test_that_mismatched_responses_gives_nan_measured_data(ert_config, prior_ens | |||
assert pd.isna(fopr_2.loc[0].iloc[0]) | |||
assert pd.isna(fopr_2.loc[1].iloc[0]) | |||
assert pd.isna(fopr_1.loc[2].iloc[0]) | |||
|
|||
|
|||
def test_reading_past_2263_is_ok(ert_config, storage, prior_ensemble): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be squashed into the previous commit as the bug is fixed there, and so the test belongs along side that. Feel free to write a longer commit body of the first commit explaining the reason behind this change and the implications.
* Datetime reading past 2263 should now work, added test asserting that it does work * Enforced f32 precision for observations & responses
6086201
to
c0fb05c
Compare
(semeio PR: equinor/semeio#648 )
Issue
Towards combining datasets without xr nan artifacts etc
Approach
read&write parquet files with polars
Closes: #6525
Some benchmarking: