Ingester/Distributor: Add support for exporting cost attribution metr…

…ics (#10269) * Poc: cost attribution proposal 2 * refectory * add experimental features in about-versioning.md * change const variable to private * make timer service * rename TrackerForUser to Tracker * use fine locking * add comments explain why we use unchecked collector * rename deleteUserTracker to deleteTracker * rename cat in cost attribution package to t or tracker * avoid get tracker twice * refactor inactiveObservationsForUser * refactor shouldDelete function * rename calabels and calabelmap to labels and index * remove getter and setter of max cardinality and cooldown duration * rename CompareLabels to hasSameLabels * remove the mapping logic since the slices are ordered * remove unnecessary tracker nil checking * fix linting * refactor updateOverflow method * remove stream in comments * make observation struct private * remove unnecessary pointers * rename discardSampleMtx to discardedSampleMtx * rename variable observedMtx because I write with feet * update test name dum dum * remove test result * address doc change * remove time checking * add createIfDoesNotExist parameter * add more condition for trigger newTracker * remove the label adapter to labels call * remove useless function dum dum * make hardcoded increment value * rename + make cooldownuntil a normal int64 and lock with observedMtx * use build-in functon dum dum * modify the copy of calabels instead of directly the slice * update mimir-prometheus * vendor new mimir-prometheus * rename function * fix lint * add unittest in active series * copy slice instead * add test for discarded samples * change small map to slice since it is quicker * remove unused parameter * add new parameter * update config file * Update pkg/costattribution/manager.go Co-authored-by: Oleg Zaytsev <[email protected]> * take config before locking tracker map * simplify logics * remove useless initialization * change int64 to time.x * change pointer to instance * change instance to pointer in map * remove callback * use string when create new key in map * move the logic to different place * get cat once out of loop * update tracker per request for received samples * make the lock fanny by dum dum * make ingester work * fix lock * add changelog * update changelog * update doc with correct metrics name * remove useless function * cast only once * stop using string * simplify logics * add new tracker for active series only * add new sample tracker vs active series tracker * remove conflict * clean up code * fix * Update pkg/distributor/distributor.go Co-authored-by: Oleg Zaytsev <[email protected]> * address comments * update docs * update tests * correct the metrics name * fix lint * update examples * remove test files * change tests * rename cat to cast * remove the unnecessary indent * format * move the order function to the caller * add comments * Update pkg/costattribution/sample_tracker.go Co-authored-by: Oleg Zaytsev <[email protected]> * remove useless function * change overflowSince to time.Time * change the lock to RWMutex * change the condition of recovered to less than maxcardinality * remove useless function * change overflowsince to time.time * Update pkg/costattribution/manager.go Co-authored-by: Oleg Zaytsev <[email protected]> * Update pkg/costattribution/sample_tracker.go Co-authored-by: Oleg Zaytsev <[email protected]> * fix test * formatting * fix dum dum * just defer * use write lock to write, sounds reasonable hum? * update lock * changelog update --------- Co-authored-by: Oleg Zaytsev <[email protected]>
grafana · Jan 17, 2025 · bd6e14b · bd6e14b
1 parent 504dd37
commit bd6e14b
Show file tree

Hide file tree

Showing 36 changed files with 2,217 additions and 344 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,7 @@
 
 ### Grafana Mimir
 
+* [FEATURE] Ingester/Distributor: Add support for exporting cost attribution metrics (`cortex_ingester_attributed_active_series`, `cortex_distributor_received_attributed_samples_total`, and `cortex_discarded_attributed_samples_total`) with labels specified by customers to a custom Prometheus registry. This feature enables more flexible billing data tracking. #10269
 * [CHANGE] Querier: pass context to queryable `IsApplicable` hook. #10451
 * [CHANGE] Distributor: OTLP and push handler replace all non-UTF8 characters with the unicode replacement character `\uFFFD` in error messages before propagating them. #10236
 * [CHANGE] Querier: pass query matchers to queryable `IsApplicable` hook. #10256

diff --git a/cmd/mimir/config-descriptor.json b/cmd/mimir/config-descriptor.json
@@ -4400,6 +4400,50 @@
           "fieldType": "int",
           "fieldCategory": "experimental"
         },
+        {
+          "kind": "field",
+          "name": "cost_attribution_labels",
+          "required": false,
+          "desc": "Defines labels for cost attribution. Applies to metrics like cortex_distributor_received_attributed_samples_total. To disable, set to an empty string. For example, 'team,service' produces metrics such as cortex_distributor_received_attributed_samples_total{team='frontend', service='api'}.",
+          "fieldValue": null,
+          "fieldDefaultValue": "",
+          "fieldFlag": "validation.cost-attribution-labels",
+          "fieldType": "string",
+          "fieldCategory": "experimental"
+        },
+        {
+          "kind": "field",
+          "name": "max_cost_attribution_labels_per_user",
+          "required": false,
+          "desc": "Maximum number of cost attribution labels allowed per user, the value is capped at 4.",
+          "fieldValue": null,
+          "fieldDefaultValue": 2,
+          "fieldFlag": "validation.max-cost-attribution-labels-per-user",
+          "fieldType": "int",
+          "fieldCategory": "experimental"
+        },
+        {
+          "kind": "field",
+          "name": "max_cost_attribution_cardinality_per_user",
+          "required": false,
+          "desc": "Maximum cardinality of cost attribution labels allowed per user.",
+          "fieldValue": null,
+          "fieldDefaultValue": 10000,
+          "fieldFlag": "validation.max-cost-attribution-cardinality-per-user",
+          "fieldType": "int",
+          "fieldCategory": "experimental"
+        },
+        {
+          "kind": "field",
+          "name": "cost_attribution_cooldown",
+          "required": false,
+          "desc": "Defines how long cost attribution stays in overflow before attempting a reset, with received/discarded samples extending the cooldown if overflow persists, while active series reset and restart tracking after the cooldown.",
+          "fieldValue": null,
+          "fieldDefaultValue": 0,
+          "fieldFlag": "validation.cost-attribution-cooldown",
+          "fieldType": "duration",
+          "fieldCategory": "experimental"
+        },
         {
           "kind": "field",
           "name": "ruler_evaluation_delay_duration",
@@ -19681,6 +19725,39 @@
       "fieldFlag": "timeseries-unmarshal-caching-optimization-enabled",
       "fieldType": "boolean",
       "fieldCategory": "experimental"
+    },
+    {
+      "kind": "field",
+      "name": "cost_attribution_eviction_interval",
+      "required": false,
+      "desc": "Specifies how often inactive cost attributions for received and discarded sample trackers are evicted from the counter, ensuring they do not contribute to the cost attribution cardinality per user limit. This setting does not apply to active series, which are managed separately.",
+      "fieldValue": null,
+      "fieldDefaultValue": 1200000000000,
+      "fieldFlag": "cost-attribution.eviction-interval",
+      "fieldType": "duration",
+      "fieldCategory": "experimental"
+    },
+    {
+      "kind": "field",
+      "name": "cost_attribution_registry_path",
+      "required": false,
+      "desc": "Defines a custom path for the registry. When specified, Mimir exposes cost attribution metrics through this custom path. If not specified, cost attribution metrics aren't exposed.",
+      "fieldValue": null,
+      "fieldDefaultValue": "",
+      "fieldFlag": "cost-attribution.registry-path",
+      "fieldType": "string",
+      "fieldCategory": "experimental"
+    },
+    {
+      "kind": "field",
+      "name": "cost_attribution_cleanup_interval",
+      "required": false,
+      "desc": "Time interval at which the cost attribution cleanup process runs, ensuring inactive cost attribution entries are purged.",
+      "fieldValue": null,
+      "fieldDefaultValue": 180000000000,
+      "fieldFlag": "cost-attribution.cleanup-interval",
+      "fieldType": "duration",
+      "fieldCategory": "experimental"
     }
   ],
   "fieldValue": null,

diff --git a/cmd/mimir/help-all.txt.tmpl b/cmd/mimir/help-all.txt.tmpl
@@ -1283,6 +1283,12 @@ Usage of ./cmd/mimir/mimir:
     	Expands ${var} or $var in config according to the values of the environment variables.
   -config.file value
     	Configuration file to load.
+  -cost-attribution.cleanup-interval duration
+    	[experimental] Time interval at which the cost attribution cleanup process runs, ensuring inactive cost attribution entries are purged. (default 3m0s)
+  -cost-attribution.eviction-interval duration
+    	[experimental] Specifies how often inactive cost attributions for received and discarded sample trackers are evicted from the counter, ensuring they do not contribute to the cost attribution cardinality per user limit. This setting does not apply to active series, which are managed separately. (default 20m0s)
+  -cost-attribution.registry-path string
+    	[experimental] Defines a custom path for the registry. When specified, Mimir exposes cost attribution metrics through this custom path. If not specified, cost attribution metrics aren't exposed.
   -debug.block-profile-rate int
     	Fraction of goroutine blocking events that are reported in the blocking profile. 1 to include every blocking event in the profile, 0 to disable.
   -debug.mutex-profile-fraction int
@@ -3323,10 +3329,18 @@ Usage of ./cmd/mimir/mimir:
     	Enable anonymous usage reporting. (default true)
   -usage-stats.installation-mode string
     	Installation mode. Supported values: custom, helm, jsonnet. (default "custom")
+  -validation.cost-attribution-cooldown duration
+    	[experimental] Defines how long cost attribution stays in overflow before attempting a reset, with received/discarded samples extending the cooldown if overflow persists, while active series reset and restart tracking after the cooldown.
+  -validation.cost-attribution-labels comma-separated-list-of-strings
+    	[experimental] Defines labels for cost attribution. Applies to metrics like cortex_distributor_received_attributed_samples_total. To disable, set to an empty string. For example, 'team,service' produces metrics such as cortex_distributor_received_attributed_samples_total{team='frontend', service='api'}.
   -validation.create-grace-period duration
     	Controls how far into the future incoming samples and exemplars are accepted compared to the wall clock. Any sample or exemplar will be rejected if its timestamp is greater than '(now + creation_grace_period)'. This configuration is enforced in the distributor and ingester. (default 10m)
   -validation.enforce-metadata-metric-name
     	Enforce every metadata has a metric name. (default true)
+  -validation.max-cost-attribution-cardinality-per-user int
+    	[experimental] Maximum cardinality of cost attribution labels allowed per user. (default 10000)
+  -validation.max-cost-attribution-labels-per-user int
+    	[experimental] Maximum number of cost attribution labels allowed per user, the value is capped at 4. (default 2)
   -validation.max-label-names-per-info-series int
     	Maximum number of label names per info series. Has no effect if less than the value of the maximum number of label names per series option (-validation.max-label-names-per-series) (default 80)
   -validation.max-label-names-per-series int

diff --git a/docs/sources/mimir/configure/about-versioning.md b/docs/sources/mimir/configure/about-versioning.md
@@ -46,6 +46,19 @@ Experimental configuration and flags are subject to change.
 
 The following features are currently experimental:
 
+- Cost attribution
+  - Configure labels for cost attribution
+    - `-validation.cost-attribution-labels`
+  - Configure cost attribution limits, such as label cardinality and the maximum number of cost attribution labels
+    - `-validation.max-cost-attribution-labels-per-user`
+    - `-validation.max-cost-attribution-cardinality-per-user`
+  - Configure cooldown periods and eviction intervals for cost attribution
+    - `-validation.cost-attribution-cooldown`
+    - `-cost-attribution.eviction-interval`
+  - Configure the metrics endpoint dedicated to cost attribution
+    - `-cost-attribution.registry-path`
+  - Configure the cost attribution cleanup process run interval
+    - `-cost-attribution.cleanup-interval`
 - Alertmanager
   - Enable a set of experimental API endpoints to help support the migration of the Grafana Alertmanager to the Mimir Alertmanager.
     - `-alertmanager.grafana-alertmanager-compatibility-enabled`

diff --git a/docs/sources/mimir/configure/configuration-parameters/index.md b/docs/sources/mimir/configure/configuration-parameters/index.md
@@ -455,6 +455,24 @@ overrides_exporter:
 # (experimental) Enables optimized marshaling of timeseries.
 # CLI flag: -timeseries-unmarshal-caching-optimization-enabled
 [timeseries_unmarshal_caching_optimization_enabled: <boolean> | default = true]
+
+# (experimental) Specifies how often inactive cost attributions for received and
+# discarded sample trackers are evicted from the counter, ensuring they do not
+# contribute to the cost attribution cardinality per user limit. This setting
+# does not apply to active series, which are managed separately.
+# CLI flag: -cost-attribution.eviction-interval
+[cost_attribution_eviction_interval: <duration> | default = 20m]
+
+# (experimental) Defines a custom path for the registry. When specified, Mimir
+# exposes cost attribution metrics through this custom path. If not specified,
+# cost attribution metrics aren't exposed.
+# CLI flag: -cost-attribution.registry-path
+[cost_attribution_registry_path: <string> | default = ""]
+
+# (experimental) Time interval at which the cost attribution cleanup process
+# runs, ensuring inactive cost attribution entries are purged.
+# CLI flag: -cost-attribution.cleanup-interval
+[cost_attribution_cleanup_interval: <duration> | default = 3m]
 ```
 
 ### common
@@ -3599,6 +3617,31 @@ The `limits` block configures default and per-tenant limits imposed by component
 # CLI flag: -querier.active-series-results-max-size-bytes
 [active_series_results_max_size_bytes: <int> | default = 419430400]
 
+# (experimental) Defines labels for cost attribution. Applies to metrics like
+# cortex_distributor_received_attributed_samples_total. To disable, set to an
+# empty string. For example, 'team,service' produces metrics such as
+# cortex_distributor_received_attributed_samples_total{team='frontend',
+# service='api'}.
+# CLI flag: -validation.cost-attribution-labels
+[cost_attribution_labels: <string> | default = ""]
+
+# (experimental) Maximum number of cost attribution labels allowed per user, the
+# value is capped at 4.
+# CLI flag: -validation.max-cost-attribution-labels-per-user
+[max_cost_attribution_labels_per_user: <int> | default = 2]
+
+# (experimental) Maximum cardinality of cost attribution labels allowed per
+# user.
+# CLI flag: -validation.max-cost-attribution-cardinality-per-user
+[max_cost_attribution_cardinality_per_user: <int> | default = 10000]
+
+# (experimental) Defines how long cost attribution stays in overflow before
+# attempting a reset, with received/discarded samples extending the cooldown if
+# overflow persists, while active series reset and restart tracking after the
+# cooldown.
+# CLI flag: -validation.cost-attribution-cooldown
+[cost_attribution_cooldown: <duration> | default = 0s]
+
 # Duration to delay the evaluation of rules to ensure the underlying metrics
 # have been pushed.
 # CLI flag: -ruler.evaluation-delay-duration

diff --git a/pkg/api/api.go b/pkg/api/api.go
@@ -20,6 +20,7 @@ import (
 	"github.com/grafana/dskit/middleware"
 	"github.com/grafana/dskit/server"
 	"github.com/prometheus/client_golang/prometheus"
+	"github.com/prometheus/client_golang/prometheus/promhttp"
 
 	"github.com/grafana/mimir/pkg/alertmanager"
 	"github.com/grafana/mimir/pkg/alertmanager/alertmanagerpb"
@@ -281,6 +282,11 @@ func (a *API) RegisterDistributor(d *distributor.Distributor, pushConfig distrib
 	a.RegisterRoute("/distributor/ha_tracker", d.HATracker, false, true, "GET")
 }
 
+// RegisterCostAttribution registers a Prometheus HTTP handler for the cost attribution metrics.
+func (a *API) RegisterCostAttribution(customRegistryPath string, reg *prometheus.Registry) {
+	a.RegisterRoute(customRegistryPath, promhttp.HandlerFor(reg, promhttp.HandlerOpts{}), false, false, "GET")
+}
+
 // Ingester is defined as an interface to allow for alternative implementations
 // of ingesters to be passed into the API.RegisterIngester() method.
 type Ingester interface {

diff --git a/pkg/blockbuilder/tsdb.go b/pkg/blockbuilder/tsdb.go
@@ -50,7 +50,7 @@ type TSDBBuilder struct {
 var softErrProcessor = mimir_storage.NewSoftAppendErrorProcessor(
 	func() {}, func(int64, []mimirpb.LabelAdapter) {}, func(int64, []mimirpb.LabelAdapter) {},
 	func(int64, []mimirpb.LabelAdapter) {}, func(int64, []mimirpb.LabelAdapter) {}, func(int64, []mimirpb.LabelAdapter) {},
-	func() {}, func([]mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
+	func([]mimirpb.LabelAdapter) {}, func([]mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
 	func(error, int64, []mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
 	func(error, int64, []mimirpb.LabelAdapter) {}, func(error, int64, []mimirpb.LabelAdapter) {},
 )