TimeSeriesDataset: support multiple aggregation methods #369

epa095 · 2019-07-07T18:01:50Z

Now the TimeSeriesDataset supports an optional argument aggregation_methods
with default value mean. It can be either a single aggregation method or
a list of aggregation methods. Any aggregation method supported by pandas
is supported. If multiple aggregation methods are provided then the resulting
dataframe contains a multi-level column index with the tag-name as the top
level, and the aggregation method as the second level. If a single aggregation
method is provided then only the first level (tag-name) is used.

This closes #278

codecov · 2019-07-07T18:11:04Z

Codecov Report

❗ No coverage uploaded for pull request base (master@1c2c410). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master     #369   +/-   ##
=========================================
  Coverage          ?   89.32%           
=========================================
  Files             ?       48           
  Lines             ?     2211           
  Branches          ?        0           
=========================================
  Hits              ?     1975           
  Misses            ?      236           
  Partials          ?        0

Impacted Files	Coverage Δ
gordo_components/dataset/datasets.py	`97.5% <100%> (ø)`
gordo_components/dataset/base.py	`85.71% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c2c410...ccb4562. Read the comment docs.

gordo_components/dataset/base.py

flikka · 2019-07-08T06:22:12Z

I like the idea, but I think the PR is WIP?

epa095 · 2019-08-01T10:23:23Z

I think it is pretty good to go. Except I have not updated the documentation, I will do that as well.
But, are we good with how it works:

If there is only a single resample method then the column names are the same as they were given, i.e. "Tag 1", "Tag 2" etc
If there are several resample methods then the column names becomes "Tag 1_max", "Tag 1_mean", "Tag 2_max", "Tag 2_mean" etc. It is a bit ugly that it is different depending on how many aggregation methods we have, but it is maybe what the user would expect? Also it is backwards-compatible.

epa095 · 2019-08-01T10:42:30Z

This PR does also not make sure that it actually works to do this in a workflow. I guess it makes sense to wait for #368 for that. It involves some dirty stuff, especially since the grafana dashboards will probably not work with several aggregation methods.

For now the full workflow works when selecting a single aggregation method (e.g. "max" instead of default "mean"), but the "multiple aggregation methods" is only useful when using the dataset locally.

gordo_components/dataset/datasets.py

tests/gordo_components/dataset/test_dataset.py

milesgranger · 2019-08-01T14:22:52Z

In general, what do you think about using pd.MultiIndex for columns so we'd have things like this coming out:

                    tag1     tag2    
                    mean sum mean sum
2016-01-01 00:00:00    4  36    4  36
2016-01-01 01:00:00    9   9    9   9

or if a single agg:

                    tag1 tag2    
                    mean mean
2016-01-01 00:00:00    4    4
2016-01-01 01:00:00    9    9

Might result in more complex changes?

epa095 · 2019-08-02T07:43:43Z

In general, what do you think about using pd.MultiIndex for columns so we'd have things like this coming out:
                    tag1     tag2    
                    mean sum mean sum
2016-01-01 00:00:00    4  36    4  36
2016-01-01 01:00:00    9   9    9   9
or if a single agg:
                    tag1 tag2    
                    mean mean
2016-01-01 00:00:00    4    4
2016-01-01 01:00:00    9    9
Might result in more complex changes?

I like it. I am a bit uncertain how big of an API change it is. Maybe not that big? At least it has a nicer consistency between the single and multiple aggregation methods.

epa095 · 2019-08-02T14:06:35Z

So, now it is as follows: It uses multi-level dataframes when there are several aggregation methods, but the old-style single level ones in case of a single aggregation method. This is to make it backwards compatible, and make it a smaller PR. Two things:

I must update the notebook (or just ditch that and just add documentation somewhere else. Proposal?)
Maybe have an explicit tests in e.g. workflow generator which disallows multiple aggregation methods, since it is incompatible with the project at large (grafana particularly)?

milesgranger · 2019-08-06T07:43:22Z

Seems reasonable to me, I've added #382 so we can address using the multi level column dataframes and thus support multiple agg function throughout the workflow later on.

For the example, I think adding it in the https://github.com/equinor/gordo-test-project might be a good place?

gordo_components/dataset/base.py

examples/Gordo-Workflow-Semi-Low-Level.ipynb

gordo_components/dataset/base.py

milesgranger

Rebase, mahn.

Now the TimeSeriesDataset supports an optional argument `aggregation_methods` with default value `mean`. It can be either a single aggregation method or a list of aggregation methods. Any aggregation method supported by pandas is supported. If multiple aggregation methods are provided then the resulting dataframe contains a multi-level column index with the tag-name as the top level, and the aggregation method as the second level. If a single aggregation method is provided then only the first level (tag-name) is used.

Also changed the dates to some dates with more interesting values when comparing max/min.

epa095 force-pushed the dataset_support_multiple_resample_methods branch 2 times, most recently from 5e07a8b to 717ca0e Compare July 7, 2019 20:01

flikka reviewed Jul 8, 2019

View reviewed changes

gordo_components/dataset/base.py Outdated Show resolved Hide resolved

flikka reviewed Jul 8, 2019

View reviewed changes

gordo_components/dataset/base.py Outdated Show resolved Hide resolved

flikka reviewed Jul 8, 2019

View reviewed changes

gordo_components/dataset/base.py Outdated Show resolved Hide resolved

epa095 force-pushed the dataset_support_multiple_resample_methods branch from 717ca0e to 0e58bbe Compare July 8, 2019 07:38

epa095 changed the title ~~TimeSeriesDataset: support multiple aggregation methods~~ WIP: TimeSeriesDataset: support multiple aggregation methods Aug 1, 2019

epa095 force-pushed the dataset_support_multiple_resample_methods branch from 0e58bbe to 7f6a539 Compare August 1, 2019 10:17

epa095 requested a review from milesgranger August 1, 2019 10:20

milesgranger reviewed Aug 1, 2019

View reviewed changes

gordo_components/dataset/datasets.py Outdated Show resolved Hide resolved

tests/gordo_components/dataset/test_dataset.py Show resolved Hide resolved

epa095 changed the title ~~WIP: TimeSeriesDataset: support multiple aggregation methods~~ TimeSeriesDataset: support multiple aggregation methods Aug 2, 2019

epa095 force-pushed the dataset_support_multiple_resample_methods branch from cc817ff to cb53c9e Compare August 2, 2019 13:56

milesgranger mentioned this pull request Aug 6, 2019

Support/think about using MultiIndex columned dataframes in the whole workflow #382

Open

epa095 force-pushed the dataset_support_multiple_resample_methods branch from cb53c9e to e3c8c4b Compare August 6, 2019 11:04

milesgranger reviewed Aug 7, 2019

View reviewed changes

gordo_components/dataset/base.py Outdated Show resolved Hide resolved

epa095 force-pushed the dataset_support_multiple_resample_methods branch 2 times, most recently from 157d1db to 98e4b2c Compare August 9, 2019 10:36

milesgranger reviewed Aug 9, 2019

View reviewed changes

examples/Gordo-Workflow-Semi-Low-Level.ipynb Show resolved Hide resolved

gordo_components/dataset/base.py Outdated Show resolved Hide resolved

gordo_components/dataset/base.py Show resolved Hide resolved

gordo_components/dataset/base.py Outdated Show resolved Hide resolved

epa095 force-pushed the dataset_support_multiple_resample_methods branch 2 times, most recently from a56c45f to 2e84c08 Compare August 9, 2019 12:13

milesgranger approved these changes Aug 9, 2019

View reviewed changes

Erik Parmann added 2 commits August 9, 2019 15:59

Clean up dataset notebook example

a099955

Also changed the dates to some dates with more interesting values when comparing max/min.

Notebook: Document aggregation_methods

ccb4562

epa095 force-pushed the dataset_support_multiple_resample_methods branch from 2e84c08 to ccb4562 Compare August 9, 2019 14:00

epa095 merged commit 37a10e9 into equinor:master Aug 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TimeSeriesDataset: support multiple aggregation methods #369

TimeSeriesDataset: support multiple aggregation methods #369

epa095 commented Jul 7, 2019 •

edited

Loading

codecov bot commented Jul 7, 2019 •

edited

Loading

flikka commented Jul 8, 2019

epa095 commented Aug 1, 2019

epa095 commented Aug 1, 2019

milesgranger commented Aug 1, 2019 •

edited

Loading

epa095 commented Aug 2, 2019

epa095 commented Aug 2, 2019

milesgranger commented Aug 6, 2019

milesgranger left a comment

TimeSeriesDataset: support multiple aggregation methods #369

TimeSeriesDataset: support multiple aggregation methods #369

Conversation

epa095 commented Jul 7, 2019 • edited Loading

codecov bot commented Jul 7, 2019 • edited Loading

Codecov Report

flikka commented Jul 8, 2019

epa095 commented Aug 1, 2019

epa095 commented Aug 1, 2019

milesgranger commented Aug 1, 2019 • edited Loading

epa095 commented Aug 2, 2019

epa095 commented Aug 2, 2019

milesgranger commented Aug 6, 2019

milesgranger left a comment

Choose a reason for hiding this comment

epa095 commented Jul 7, 2019 •

edited

Loading

codecov bot commented Jul 7, 2019 •

edited

Loading

milesgranger commented Aug 1, 2019 •

edited

Loading