Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TimeSeriesDataset: support multiple aggregation methods #369

Merged

Conversation

epa095
Copy link
Contributor

@epa095 epa095 commented Jul 7, 2019

Now the TimeSeriesDataset supports an optional argument aggregation_methods
with default value mean. It can be either a single aggregation method or
a list of aggregation methods. Any aggregation method supported by pandas
is supported. If multiple aggregation methods are provided then the resulting
dataframe contains a multi-level column index with the tag-name as the top
level, and the aggregation method as the second level. If a single aggregation
method is provided then only the first level (tag-name) is used.

This closes #278

@codecov
Copy link

codecov bot commented Jul 7, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@1c2c410). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #369   +/-   ##
=========================================
  Coverage          ?   89.32%           
=========================================
  Files             ?       48           
  Lines             ?     2211           
  Branches          ?        0           
=========================================
  Hits              ?     1975           
  Misses            ?      236           
  Partials          ?        0
Impacted Files Coverage Δ
gordo_components/dataset/datasets.py 97.5% <100%> (ø)
gordo_components/dataset/base.py 85.71% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c2c410...ccb4562. Read the comment docs.

@epa095 epa095 force-pushed the dataset_support_multiple_resample_methods branch 2 times, most recently from 5e07a8b to 717ca0e Compare July 7, 2019 20:01
@flikka
Copy link
Contributor

flikka commented Jul 8, 2019

I like the idea, but I think the PR is WIP?

@epa095 epa095 force-pushed the dataset_support_multiple_resample_methods branch from 717ca0e to 0e58bbe Compare July 8, 2019 07:38
@epa095 epa095 changed the title TimeSeriesDataset: support multiple aggregation methods WIP: TimeSeriesDataset: support multiple aggregation methods Aug 1, 2019
@epa095 epa095 force-pushed the dataset_support_multiple_resample_methods branch from 0e58bbe to 7f6a539 Compare August 1, 2019 10:17
@epa095 epa095 requested a review from milesgranger August 1, 2019 10:20
@epa095
Copy link
Contributor Author

epa095 commented Aug 1, 2019

I think it is pretty good to go. Except I have not updated the documentation, I will do that as well.
But, are we good with how it works:

  • If there is only a single resample method then the column names are the same as they were given, i.e. "Tag 1", "Tag 2" etc
  • If there are several resample methods then the column names becomes "Tag 1_max", "Tag 1_mean", "Tag 2_max", "Tag 2_mean" etc. It is a bit ugly that it is different depending on how many aggregation methods we have, but it is maybe what the user would expect? Also it is backwards-compatible.

@epa095
Copy link
Contributor Author

epa095 commented Aug 1, 2019

This PR does also not make sure that it actually works to do this in a workflow. I guess it makes sense to wait for #368 for that. It involves some dirty stuff, especially since the grafana dashboards will probably not work with several aggregation methods.

For now the full workflow works when selecting a single aggregation method (e.g. "max" instead of default "mean"), but the "multiple aggregation methods" is only useful when using the dataset locally.

@milesgranger
Copy link
Contributor

milesgranger commented Aug 1, 2019

In general, what do you think about using pd.MultiIndex for columns so we'd have things like this coming out:

                    tag1     tag2    
                    mean sum mean sum
2016-01-01 00:00:00    4  36    4  36
2016-01-01 01:00:00    9   9    9   9

or if a single agg:

                    tag1 tag2    
                    mean mean
2016-01-01 00:00:00    4    4
2016-01-01 01:00:00    9    9

Might result in more complex changes?

@epa095 epa095 changed the title WIP: TimeSeriesDataset: support multiple aggregation methods TimeSeriesDataset: support multiple aggregation methods Aug 2, 2019
@epa095
Copy link
Contributor Author

epa095 commented Aug 2, 2019

In general, what do you think about using pd.MultiIndex for columns so we'd have things like this coming out:

                    tag1     tag2    
                    mean sum mean sum
2016-01-01 00:00:00    4  36    4  36
2016-01-01 01:00:00    9   9    9   9

or if a single agg:

                    tag1 tag2    
                    mean mean
2016-01-01 00:00:00    4    4
2016-01-01 01:00:00    9    9

Might result in more complex changes?

I like it. I am a bit uncertain how big of an API change it is. Maybe not that big? At least it has a nicer consistency between the single and multiple aggregation methods.

@epa095 epa095 force-pushed the dataset_support_multiple_resample_methods branch from cc817ff to cb53c9e Compare August 2, 2019 13:56
@epa095
Copy link
Contributor Author

epa095 commented Aug 2, 2019

So, now it is as follows: It uses multi-level dataframes when there are several aggregation methods, but the old-style single level ones in case of a single aggregation method. This is to make it backwards compatible, and make it a smaller PR. Two things:

  • I must update the notebook (or just ditch that and just add documentation somewhere else. Proposal?)
  • Maybe have an explicit tests in e.g. workflow generator which disallows multiple aggregation methods, since it is incompatible with the project at large (grafana particularly)?

@milesgranger
Copy link
Contributor

Seems reasonable to me, I've added #382 so we can address using the multi level column dataframes and thus support multiple agg function throughout the workflow later on.

For the example, I think adding it in the https://github.com/equinor/gordo-test-project might be a good place?

@epa095 epa095 force-pushed the dataset_support_multiple_resample_methods branch from cb53c9e to e3c8c4b Compare August 6, 2019 11:04
@epa095 epa095 force-pushed the dataset_support_multiple_resample_methods branch 2 times, most recently from 157d1db to 98e4b2c Compare August 9, 2019 10:36
examples/Gordo-Workflow-Semi-Low-Level.ipynb Show resolved Hide resolved
gordo_components/dataset/base.py Outdated Show resolved Hide resolved
gordo_components/dataset/base.py Show resolved Hide resolved
gordo_components/dataset/base.py Outdated Show resolved Hide resolved
@epa095 epa095 force-pushed the dataset_support_multiple_resample_methods branch 2 times, most recently from a56c45f to 2e84c08 Compare August 9, 2019 12:13
Copy link
Contributor

@milesgranger milesgranger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase, mahn.

Now the TimeSeriesDataset supports an optional argument `aggregation_methods`
with default value `mean`. It can be either a single aggregation method or
a list of aggregation methods. Any aggregation method supported by pandas
is supported. If multiple aggregation methods are provided then the resulting
dataframe contains a multi-level column index with the tag-name as the top
level, and the aggregation method as the second level. If a single aggregation
method is provided then only the first level (tag-name) is used.
Erik Parmann added 2 commits August 9, 2019 15:59
Also changed the dates to some dates with more interesting values when
comparing max/min.
@epa095 epa095 force-pushed the dataset_support_multiple_resample_methods branch from 2e84c08 to ccb4562 Compare August 9, 2019 14:00
@epa095 epa095 merged commit 37a10e9 into equinor:master Aug 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataset: Add ability to have several resample methods
3 participants