Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingester/Distributor: Add support for exporting cost attribution metrics #10269

Merged
merged 108 commits into from
Jan 17, 2025
Merged
Show file tree
Hide file tree
Changes from 107 commits
Commits
Show all changes
108 commits
Select commit Hold shift + click to select a range
e315ebb
Poc: cost attribution proposal 2
ying-jeanne Oct 24, 2024
f04c28f
refectory
ying-jeanne Dec 17, 2024
2c422d1
add experimental features in about-versioning.md
ying-jeanne Dec 19, 2024
d2eab6b
change const variable to private
ying-jeanne Dec 19, 2024
1f39282
make timer service
ying-jeanne Dec 19, 2024
9b4337d
rename TrackerForUser to Tracker
ying-jeanne Dec 19, 2024
1a523e1
use fine locking
ying-jeanne Dec 19, 2024
f10f787
add comments explain why we use unchecked collector
ying-jeanne Dec 19, 2024
cc0e939
rename deleteUserTracker to deleteTracker
ying-jeanne Dec 19, 2024
c020be0
rename cat in cost attribution package to t or tracker
ying-jeanne Dec 19, 2024
71e4666
avoid get tracker twice
ying-jeanne Dec 19, 2024
9dd101b
refactor inactiveObservationsForUser
ying-jeanne Dec 19, 2024
7d4ea9a
refactor shouldDelete function
ying-jeanne Dec 19, 2024
6754666
rename calabels and calabelmap to labels and index
ying-jeanne Dec 19, 2024
fffc5b3
remove getter and setter of max cardinality and cooldown duration
ying-jeanne Dec 19, 2024
2cf8c3e
rename CompareLabels to hasSameLabels
ying-jeanne Dec 19, 2024
f994034
remove the mapping logic since the slices are ordered
ying-jeanne Dec 19, 2024
b060c09
remove unnecessary tracker nil checking
ying-jeanne Dec 19, 2024
e35a8d9
fix linting
ying-jeanne Dec 19, 2024
5cc0b5d
refactor updateOverflow method
ying-jeanne Dec 19, 2024
389dff0
remove stream in comments
ying-jeanne Dec 19, 2024
116a69e
make observation struct private
ying-jeanne Dec 19, 2024
9c30445
remove unnecessary pointers
ying-jeanne Dec 19, 2024
88ef49e
rename discardSampleMtx to discardedSampleMtx
ying-jeanne Dec 19, 2024
130636a
rename variable observedMtx because I write with feet
ying-jeanne Dec 19, 2024
b701ba7
update test name dum dum
ying-jeanne Dec 19, 2024
dccd9c8
remove test result
ying-jeanne Dec 19, 2024
eebd028
address doc change
ying-jeanne Dec 19, 2024
8386503
remove time checking
ying-jeanne Dec 24, 2024
d8f1e9b
add createIfDoesNotExist parameter
ying-jeanne Dec 24, 2024
b9efb94
add more condition for trigger newTracker
ying-jeanne Dec 24, 2024
a37e6de
remove the label adapter to labels call
ying-jeanne Dec 24, 2024
211b3a2
remove useless function dum dum
ying-jeanne Dec 24, 2024
f697e6f
make hardcoded increment value
ying-jeanne Dec 24, 2024
fe8a1e5
rename + make cooldownuntil a normal int64 and lock with observedMtx
ying-jeanne Dec 24, 2024
8b5836f
use build-in functon dum dum
ying-jeanne Dec 24, 2024
888d8b0
modify the copy of calabels instead of directly the slice
ying-jeanne Dec 24, 2024
b15b487
update mimir-prometheus
ying-jeanne Dec 24, 2024
87209d6
Merge remote-tracking branch 'origin/r322' into final-cost-attribution
ying-jeanne Dec 24, 2024
4706bde
vendor new mimir-prometheus
ying-jeanne Dec 24, 2024
1ab1f00
rename function
ying-jeanne Dec 24, 2024
8111b6c
fix lint
ying-jeanne Dec 24, 2024
17b64a9
add unittest in active series
ying-jeanne Dec 26, 2024
a191044
copy slice instead
ying-jeanne Dec 26, 2024
2bb1845
add test for discarded samples
ying-jeanne Dec 26, 2024
ddd507d
change small map to slice since it is quicker
ying-jeanne Dec 27, 2024
b27e379
remove unused parameter
ying-jeanne Dec 27, 2024
a79fac7
add new parameter
ying-jeanne Dec 27, 2024
37901b7
update config file
ying-jeanne Dec 27, 2024
f7115f4
Update pkg/costattribution/manager.go
ying-jeanne Dec 27, 2024
679f2cc
take config before locking tracker map
ying-jeanne Dec 30, 2024
66accc9
simplify logics
ying-jeanne Dec 30, 2024
f4a4efd
remove useless initialization
ying-jeanne Dec 30, 2024
f90ac0e
change int64 to time.x
ying-jeanne Dec 30, 2024
1ab89c5
change pointer to instance
ying-jeanne Dec 30, 2024
23b32cf
change instance to pointer in map
ying-jeanne Dec 30, 2024
7a60c7d
remove callback
ying-jeanne Dec 30, 2024
0287bf6
use string when create new key in map
ying-jeanne Dec 30, 2024
9c4c2df
move the logic to different place
ying-jeanne Dec 30, 2024
f8f2a49
get cat once out of loop
ying-jeanne Dec 30, 2024
1ad99ad
update tracker per request for received samples
ying-jeanne Dec 30, 2024
fa62ee1
make the lock fanny by dum dum
ying-jeanne Dec 30, 2024
1b0fb00
make ingester work
ying-jeanne Dec 30, 2024
ced8346
fix lock
ying-jeanne Dec 30, 2024
0a7c858
add changelog
ying-jeanne Dec 31, 2024
4336f7f
update changelog
ying-jeanne Dec 31, 2024
67b6cea
update doc with correct metrics name
ying-jeanne Dec 31, 2024
800fe85
remove useless function
ying-jeanne Dec 31, 2024
80e69fb
cast only once
ying-jeanne Dec 31, 2024
a2ffe5a
stop using string
ying-jeanne Jan 2, 2025
f28d672
simplify logics
ying-jeanne Jan 2, 2025
a5c3944
add new tracker for active series only
ying-jeanne Jan 10, 2025
aee9049
add new sample tracker vs active series tracker
ying-jeanne Jan 13, 2025
3bd23e4
Merge remote-tracking branch 'origin/main' into final-cost-attribution
ying-jeanne Jan 13, 2025
749bafd
remove conflict
ying-jeanne Jan 13, 2025
5f07f16
clean up code
ying-jeanne Jan 13, 2025
37ba4f2
fix
ying-jeanne Jan 13, 2025
8d06204
Update pkg/distributor/distributor.go
ying-jeanne Jan 14, 2025
68410c8
address comments
ying-jeanne Jan 14, 2025
c4a44eb
update docs
ying-jeanne Jan 14, 2025
091e5c2
update tests
ying-jeanne Jan 14, 2025
78ea839
correct the metrics name
ying-jeanne Jan 14, 2025
f0b0b40
fix lint
ying-jeanne Jan 15, 2025
fbb2fac
update examples
ying-jeanne Jan 15, 2025
70c1d9e
remove test files
ying-jeanne Jan 16, 2025
39b888f
change tests
ying-jeanne Jan 16, 2025
23ca840
rename cat to cast
ying-jeanne Jan 16, 2025
d7886f0
remove the unnecessary indent
ying-jeanne Jan 16, 2025
5a7dbbb
format
ying-jeanne Jan 16, 2025
5c26a0b
move the order function to the caller
ying-jeanne Jan 16, 2025
56beeaa
add comments
ying-jeanne Jan 16, 2025
21a6d3a
Update pkg/costattribution/sample_tracker.go
ying-jeanne Jan 16, 2025
73a9881
remove useless function
ying-jeanne Jan 16, 2025
6d27724
change overflowSince to time.Time
ying-jeanne Jan 16, 2025
5ac64f5
change the lock to RWMutex
ying-jeanne Jan 16, 2025
e5cda41
change the condition of recovered to less than maxcardinality
ying-jeanne Jan 16, 2025
41e9f47
remove useless function
ying-jeanne Jan 16, 2025
cb5512b
change overflowsince to time.time
ying-jeanne Jan 16, 2025
c968b95
Update pkg/costattribution/manager.go
ying-jeanne Jan 16, 2025
5733955
Update pkg/costattribution/sample_tracker.go
ying-jeanne Jan 16, 2025
80b64dc
fix test
ying-jeanne Jan 16, 2025
34fd8b3
formatting
ying-jeanne Jan 16, 2025
72c16ea
fix dum dum
ying-jeanne Jan 16, 2025
f1da054
just defer
ying-jeanne Jan 16, 2025
673d607
use write lock to write, sounds reasonable hum?
ying-jeanne Jan 16, 2025
aa57e2f
update lock
ying-jeanne Jan 16, 2025
582e92e
Merge remote-tracking branch 'origin/main' into final-cost-attribution
ying-jeanne Jan 17, 2025
c5eb8c2
changelog update
ying-jeanne Jan 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
* [CHANGE] Querier: pass query matchers to queryable `IsApplicable` hook. #10256
* [CHANGE] Query-frontend: Add `topic` label to `cortex_ingest_storage_strong_consistency_requests_total`, `cortex_ingest_storage_strong_consistency_failures_total`, and `cortex_ingest_storage_strong_consistency_wait_duration_seconds` metrics. #10220
* [CHANGE] Ruler: cap the rate of retries for remote query evaluation to 170/sec. This is configurable via `-ruler.query-frontend.max-retries-rate`. #10375 #10403
* [CHANGE] Ingester/Distributor: Add support for exporting cost attribution metrics (`cortex_ingester_attributed_active_series`, `cortex_distributor_received_attributed_samples_total`, and `cortex_discarded_attributed_samples_total`) with labels specified by customers to a custom Prometheus registry. This feature enables more flexible billing data tracking. #10269
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for a late comment, but this should be a FEATURE.

* [CHANGE] Query-frontend: Add `topic` label to `cortex_ingest_storage_reader_last_produced_offset_requests_total`, `cortex_ingest_storage_reader_last_produced_offset_failures_total`, `cortex_ingest_storage_reader_last_produced_offset_request_duration_seconds`, `cortex_ingest_storage_reader_partition_start_offset_requests_total`, `cortex_ingest_storage_reader_partition_start_offset_failures_total`, `cortex_ingest_storage_reader_partition_start_offset_request_duration_seconds` metrics. #10462
* [ENHANCEMENT] Query Frontend: Return server-side `samples_processed` statistics. #10103
* [ENHANCEMENT] Distributor: OTLP receiver now converts also metric metadata. See also https://github.com/prometheus/prometheus/pull/15416. #10168
Expand Down
77 changes: 77 additions & 0 deletions cmd/mimir/config-descriptor.json
Original file line number Diff line number Diff line change
Expand Up @@ -4400,6 +4400,50 @@
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cost_attribution_labels",
"required": false,
"desc": "Defines labels for cost attribution. Applies to metrics like cortex_distributor_received_attributed_samples_total. To disable, set to an empty string. For example, 'team,service' produces metrics such as cortex_distributor_received_attributed_samples_total{team='frontend', service='api'}.",
"fieldValue": null,
"fieldDefaultValue": "",
"fieldFlag": "validation.cost-attribution-labels",
"fieldType": "string",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "max_cost_attribution_labels_per_user",
"required": false,
"desc": "Maximum number of cost attribution labels allowed per user, the value is capped at 4.",
"fieldValue": null,
"fieldDefaultValue": 2,
"fieldFlag": "validation.max-cost-attribution-labels-per-user",
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "max_cost_attribution_cardinality_per_user",
"required": false,
"desc": "Maximum cardinality of cost attribution labels allowed per user.",
"fieldValue": null,
"fieldDefaultValue": 10000,
"fieldFlag": "validation.max-cost-attribution-cardinality-per-user",
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cost_attribution_cooldown",
"required": false,
"desc": "Defines how long cost attribution stays in overflow before attempting a reset, with received/discarded samples extending the cooldown if overflow persists, while active series reset and restart tracking after the cooldown.",
"fieldValue": null,
"fieldDefaultValue": 0,
"fieldFlag": "validation.cost-attribution-cooldown",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "ruler_evaluation_delay_duration",
Expand Down Expand Up @@ -19681,6 +19725,39 @@
"fieldFlag": "timeseries-unmarshal-caching-optimization-enabled",
"fieldType": "boolean",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cost_attribution_eviction_interval",
"required": false,
"desc": "Specifies how often inactive cost attributions for received and discarded sample trackers are evicted from the counter, ensuring they do not contribute to the cost attribution cardinality per user limit. This setting does not apply to active series, which are managed separately.",
"fieldValue": null,
"fieldDefaultValue": 1200000000000,
"fieldFlag": "cost-attribution.eviction-interval",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cost_attribution_registry_path",
"required": false,
"desc": "Defines a custom path for the registry. When specified, Mimir exposes cost attribution metrics through this custom path. If not specified, cost attribution metrics aren't exposed.",
"fieldValue": null,
"fieldDefaultValue": "",
"fieldFlag": "cost-attribution.registry-path",
"fieldType": "string",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cost_attribution_cleanup_interval",
"required": false,
"desc": "Time interval at which the cost attribution cleanup process runs, ensuring inactive cost attribution entries are purged.",
"fieldValue": null,
"fieldDefaultValue": 180000000000,
"fieldFlag": "cost-attribution.cleanup-interval",
"fieldType": "duration",
"fieldCategory": "experimental"
}
],
"fieldValue": null,
Expand Down
14 changes: 14 additions & 0 deletions cmd/mimir/help-all.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -1283,6 +1283,12 @@ Usage of ./cmd/mimir/mimir:
Expands ${var} or $var in config according to the values of the environment variables.
-config.file value
Configuration file to load.
-cost-attribution.cleanup-interval duration
[experimental] Time interval at which the cost attribution cleanup process runs, ensuring inactive cost attribution entries are purged. (default 3m0s)
-cost-attribution.eviction-interval duration
[experimental] Specifies how often inactive cost attributions for received and discarded sample trackers are evicted from the counter, ensuring they do not contribute to the cost attribution cardinality per user limit. This setting does not apply to active series, which are managed separately. (default 20m0s)
-cost-attribution.registry-path string
[experimental] Defines a custom path for the registry. When specified, Mimir exposes cost attribution metrics through this custom path. If not specified, cost attribution metrics aren't exposed.
-debug.block-profile-rate int
Fraction of goroutine blocking events that are reported in the blocking profile. 1 to include every blocking event in the profile, 0 to disable.
-debug.mutex-profile-fraction int
Expand Down Expand Up @@ -3323,10 +3329,18 @@ Usage of ./cmd/mimir/mimir:
Enable anonymous usage reporting. (default true)
-usage-stats.installation-mode string
Installation mode. Supported values: custom, helm, jsonnet. (default "custom")
-validation.cost-attribution-cooldown duration
[experimental] Defines how long cost attribution stays in overflow before attempting a reset, with received/discarded samples extending the cooldown if overflow persists, while active series reset and restart tracking after the cooldown.
-validation.cost-attribution-labels comma-separated-list-of-strings
[experimental] Defines labels for cost attribution. Applies to metrics like cortex_distributor_received_attributed_samples_total. To disable, set to an empty string. For example, 'team,service' produces metrics such as cortex_distributor_received_attributed_samples_total{team='frontend', service='api'}.
-validation.create-grace-period duration
Controls how far into the future incoming samples and exemplars are accepted compared to the wall clock. Any sample or exemplar will be rejected if its timestamp is greater than '(now + creation_grace_period)'. This configuration is enforced in the distributor and ingester. (default 10m)
-validation.enforce-metadata-metric-name
Enforce every metadata has a metric name. (default true)
-validation.max-cost-attribution-cardinality-per-user int
[experimental] Maximum cardinality of cost attribution labels allowed per user. (default 10000)
-validation.max-cost-attribution-labels-per-user int
[experimental] Maximum number of cost attribution labels allowed per user, the value is capped at 4. (default 2)
-validation.max-label-names-per-info-series int
Maximum number of label names per info series. Has no effect if less than the value of the maximum number of label names per series option (-validation.max-label-names-per-series) (default 80)
-validation.max-label-names-per-series int
Expand Down
13 changes: 13 additions & 0 deletions docs/sources/mimir/configure/about-versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,19 @@ Experimental configuration and flags are subject to change.

The following features are currently experimental:

- Cost attribution
- Configure labels for cost attribution
- `-validation.cost-attribution-labels`
- Configure cost attribution limits, such as label cardinality and the maximum number of cost attribution labels
- `-validation.max-cost-attribution-labels-per-user`
- `-validation.max-cost-attribution-cardinality-per-user`
- Configure cooldown periods and eviction intervals for cost attribution
- `-validation.cost-attribution-cooldown`
- `-cost-attribution.eviction-interval`
- Configure the metrics endpoint dedicated to cost attribution
- `-cost-attribution.registry-path`
- Configure the cost attribution cleanup process run interval
- `-cost-attribution.cleanup-interval`
- Alertmanager
- Enable a set of experimental API endpoints to help support the migration of the Grafana Alertmanager to the Mimir Alertmanager.
- `-alertmanager.grafana-alertmanager-compatibility-enabled`
Expand Down
43 changes: 43 additions & 0 deletions docs/sources/mimir/configure/configuration-parameters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -455,6 +455,24 @@ overrides_exporter:
# (experimental) Enables optimized marshaling of timeseries.
# CLI flag: -timeseries-unmarshal-caching-optimization-enabled
[timeseries_unmarshal_caching_optimization_enabled: <boolean> | default = true]

# (experimental) Specifies how often inactive cost attributions for received and
# discarded sample trackers are evicted from the counter, ensuring they do not
# contribute to the cost attribution cardinality per user limit. This setting
# does not apply to active series, which are managed separately.
# CLI flag: -cost-attribution.eviction-interval
[cost_attribution_eviction_interval: <duration> | default = 20m]

# (experimental) Defines a custom path for the registry. When specified, Mimir
# exposes cost attribution metrics through this custom path. If not specified,
# cost attribution metrics aren't exposed.
# CLI flag: -cost-attribution.registry-path
[cost_attribution_registry_path: <string> | default = ""]

# (experimental) Time interval at which the cost attribution cleanup process
# runs, ensuring inactive cost attribution entries are purged.
# CLI flag: -cost-attribution.cleanup-interval
[cost_attribution_cleanup_interval: <duration> | default = 3m]
```

### common
Expand Down Expand Up @@ -3599,6 +3617,31 @@ The `limits` block configures default and per-tenant limits imposed by component
# CLI flag: -querier.active-series-results-max-size-bytes
[active_series_results_max_size_bytes: <int> | default = 419430400]

# (experimental) Defines labels for cost attribution. Applies to metrics like
# cortex_distributor_received_attributed_samples_total. To disable, set to an
# empty string. For example, 'team,service' produces metrics such as
# cortex_distributor_received_attributed_samples_total{team='frontend',
# service='api'}.
# CLI flag: -validation.cost-attribution-labels
[cost_attribution_labels: <string> | default = ""]

# (experimental) Maximum number of cost attribution labels allowed per user, the
# value is capped at 4.
# CLI flag: -validation.max-cost-attribution-labels-per-user
[max_cost_attribution_labels_per_user: <int> | default = 2]

# (experimental) Maximum cardinality of cost attribution labels allowed per
# user.
# CLI flag: -validation.max-cost-attribution-cardinality-per-user
[max_cost_attribution_cardinality_per_user: <int> | default = 10000]

# (experimental) Defines how long cost attribution stays in overflow before
# attempting a reset, with received/discarded samples extending the cooldown if
# overflow persists, while active series reset and restart tracking after the
# cooldown.
# CLI flag: -validation.cost-attribution-cooldown
[cost_attribution_cooldown: <duration> | default = 0s]

# Duration to delay the evaluation of rules to ensure the underlying metrics
# have been pushed.
# CLI flag: -ruler.evaluation-delay-duration
Expand Down
6 changes: 6 additions & 0 deletions pkg/api/api.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ import (
"github.com/grafana/dskit/middleware"
"github.com/grafana/dskit/server"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"

"github.com/grafana/mimir/pkg/alertmanager"
"github.com/grafana/mimir/pkg/alertmanager/alertmanagerpb"
Expand Down Expand Up @@ -281,6 +282,11 @@ func (a *API) RegisterDistributor(d *distributor.Distributor, pushConfig distrib
a.RegisterRoute("/distributor/ha_tracker", d.HATracker, false, true, "GET")
}

// RegisterCostAttribution registers a Prometheus HTTP handler for the cost attribution metrics.
func (a *API) RegisterCostAttribution(customRegistryPath string, reg *prometheus.Registry) {
a.RegisterRoute(customRegistryPath, promhttp.HandlerFor(reg, promhttp.HandlerOpts{}), false, false, "GET")
}

// Ingester is defined as an interface to allow for alternative implementations
// of ingesters to be passed into the API.RegisterIngester() method.
type Ingester interface {
Expand Down
2 changes: 1 addition & 1 deletion pkg/blockbuilder/tsdb.go
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ type TSDBBuilder struct {
var softErrProcessor = mimir_storage.NewSoftAppendErrorProcessor(
func() {}, func(int64, []mimirpb.LabelAdapter) {}, func(int64, []mimirpb.LabelAdapter) {},
func(int64, []mimirpb.LabelAdapter) {}, func(int64, []mimirpb.LabelAdapter) {}, func(int64, []mimirpb.LabelAdapter) {},
func() {}, func([]mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
func([]mimirpb.LabelAdapter) {}, func([]mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
func(error, int64, []mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
func(error, int64, []mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
)
Expand Down
Loading
Loading