Ingester/Distributor: Add support for exporting cost attribution metrics #10269

ying-jeanne · 2024-12-17T21:00:08Z

What this PR does

This is the follow up of #9733,

The PR intent to export extra attributed metrics in distributor and ingester, in order to get sample received, sample discarded and active_series attributed by cost attribution label.

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

docs/sources/mimir/configure/configuration-parameters/index.md

pkg/costattribution/manager.go

pkg/ingester/activeseries/active_series.go

pkg/ingester/activeseries/active_series_test.go

pkg/ingester/ingester.go

pkg/mimir/modules.go

tacole02

Thanks for updating the docs! I left a few suggestions.

docs/sources/mimir/configure/configuration-parameters/index.md

pkg/costattribution/sample_tracker.go

colega · 2025-01-16T13:08:41Z

pkg/costattribution/sample_tracker.go

+			o.discardedSampleMtx.Lock()
+			if _, ok := o.discardedSample[*reason]; ok {
+				o.discardedSample[*reason].Add(discardedSampleIncrement)
+			} else {
+				o.discardedSample[*reason] = atomic.NewFloat64(discardedSampleIncrement)
+			}
+			o.discardedSampleMtx.Unlock()


Either we switch discardedSampleMtx to a RWMutex (and we call .Add() while holding just the RLock, and we grab the Lock for creation, with a second check), or we switch the map to use non-atomic values (because why we need them atomic while we're holding a mutex?)

Given that the amount of reasons is most likely small, I would go for a RWMutex.

If we don't do this, then a customer that is discarding lots of samples would be causing lock contention here.

addressed here 5ac64f5

colega · 2025-01-16T13:17:06Z

pkg/costattribution/sample_tracker.go

+func (st *SampleTracker) recoveredFromOverflow(deadline time.Time) bool {
+	st.observedMtx.RLock()
+	if st.overflowSince.Load() > 0 && time.Unix(st.overflowSince.Load(), 0).Add(st.cooldownDuration).Before(deadline) {
+		if len(st.observed) <= st.maxCardinality {


Previously we checked:

// if it is not known, we need to check if the max cardinality is exceeded if len(st.observed) >= st.maxCardinality { st.overflowSince.Store(ts.Unix()) }

So equals meant we're in overflow, but now equals means we've recovered from overflow. This is going to cause continous flapping there.

the recoveredfrom overflow was introduced for reducing active_series recount, what do you think if we remove it?

changed the condition to strict less than maxcardinality in the mean time e5cda41

colega · 2025-01-16T13:19:23Z

pkg/costattribution/manager.go

+		invalidKeys := st.inactiveObservations(deadline)
+		for _, key := range invalidKeys {
+			st.cleanupTrackerAttribution(key)
+		}


This is very inefficient because you have to take a write mutex for each observation to cleanup.

OTOH, you're not checking again the observations before cleaning them up.

I would suggest you to delegate this logic to the tracker:

Suggested change

invalidKeys := st.inactiveObservations(deadline)

for _, key := range invalidKeys {

st.cleanupTrackerAttribution(key)

}

st.cleanupInactiveObservations(deadline)

Then in cleanupInactiveObservations you build a slice of observations to cleanup taking a read mutex, then you take a write mutex, and iterate through those, deleting the ones that are still inactive (maybe they became active in the meantime?)

addressed here 41e9f47

pkg/costattribution/sample_tracker.go

colega · 2025-01-16T15:09:35Z

pkg/costattribution/active_tracker.go

+		at.overflowCounter.Dec()
+		return
+	}
+	defer at.observedMtx.RUnlock()


This should be before the if. We're not unlocking in the previous return statement.

addressed here 80b64dc

pkg/costattribution/manager.go

pkg/costattribution/sample_tracker.go

Co-authored-by: Oleg Zaytsev <[email protected]>

colega · 2025-01-17T09:04:09Z

CHANGELOG.md

@@ -9,6 +9,7 @@
 * [CHANGE] Querier: pass query matchers to queryable `IsApplicable` hook. #10256
 * [CHANGE] Query-frontend: Add `topic` label to `cortex_ingest_storage_strong_consistency_requests_total`, `cortex_ingest_storage_strong_consistency_failures_total`, and `cortex_ingest_storage_strong_consistency_wait_duration_seconds` metrics. #10220
 * [CHANGE] Ruler: cap the rate of retries for remote query evaluation to 170/sec. This is configurable via `-ruler.query-frontend.max-retries-rate`. #10375 #10403
+* [CHANGE] Ingester/Distributor: Add support for exporting cost attribution metrics (`cortex_ingester_attributed_active_series`, `cortex_distributor_received_attributed_samples_total`, and `cortex_discarded_attributed_samples_total`) with labels specified by customers to a custom Prometheus registry. This feature enables more flexible billing data tracking. #10269


Sorry for a late comment, but this should be a FEATURE.

Poc: cost attribution proposal 2

e315ebb

ying-jeanne force-pushed the final-cost-attribution branch from 5165a5b to 6f36b5f Compare December 17, 2024 21:07

ying-jeanne changed the title ~~Final cost attribution~~ MVP: Cost attribution Dec 17, 2024

ying-jeanne requested a review from colega December 17, 2024 21:10

ying-jeanne force-pushed the final-cost-attribution branch from 6f36b5f to 077a94a Compare December 17, 2024 22:01

refectory

f04c28f

ying-jeanne force-pushed the final-cost-attribution branch from 077a94a to f04c28f Compare December 17, 2024 22:08

ying-jeanne marked this pull request as ready for review December 17, 2024 22:13

ying-jeanne requested review from tacole02 and a team as code owners December 17, 2024 22:13

ying-jeanne requested a review from seizethedave December 18, 2024 10:20

colega reviewed Dec 18, 2024

View reviewed changes

tacole02 approved these changes Dec 18, 2024

View reviewed changes

ying-jeanne added 17 commits December 19, 2024 14:09

add experimental features in about-versioning.md

2c422d1

change const variable to private

d2eab6b

make timer service

1f39282

rename TrackerForUser to Tracker

9b4337d

use fine locking

1a523e1

add comments explain why we use unchecked collector

f10f787

rename deleteUserTracker to deleteTracker

cc0e939

rename cat in cost attribution package to t or tracker

c020be0

avoid get tracker twice

71e4666

refactor inactiveObservationsForUser

9dd101b

refactor shouldDelete function

7d4ea9a

rename calabels and calabelmap to labels and index

6754666

remove getter and setter of max cardinality and cooldown duration

fffc5b3

rename CompareLabels to hasSameLabels

2cf8c3e

remove the mapping logic since the slices are ordered

f994034

remove unnecessary tracker nil checking

b060c09

fix linting

e35a8d9

colega reviewed Jan 16, 2025

View reviewed changes

pkg/costattribution/sample_tracker.go Outdated Show resolved Hide resolved

colega reviewed Jan 16, 2025

View reviewed changes

change overflowSince to time.Time

6d27724

colega reviewed Jan 16, 2025

View reviewed changes

pkg/costattribution/sample_tracker.go Outdated Show resolved Hide resolved

colega reviewed Jan 16, 2025

View reviewed changes

pkg/costattribution/sample_tracker.go Show resolved Hide resolved

ying-jeanne added 4 commits January 16, 2025 14:33

change the lock to RWMutex

5ac64f5

change the condition of recovered to less than maxcardinality

e5cda41

remove useless function

41e9f47

change overflowsince to time.time

cb5512b

ying-jeanne requested a review from colega January 16, 2025 14:34

colega reviewed Jan 16, 2025

View reviewed changes

pkg/costattribution/manager.go Outdated Show resolved Hide resolved

colega reviewed Jan 16, 2025

View reviewed changes

pkg/costattribution/sample_tracker.go Outdated Show resolved Hide resolved

ying-jeanne and others added 3 commits January 16, 2025 16:34

Update pkg/costattribution/manager.go

c968b95

Co-authored-by: Oleg Zaytsev <[email protected]>

Update pkg/costattribution/sample_tracker.go

5733955

Co-authored-by: Oleg Zaytsev <[email protected]>

fix test

80b64dc

ying-jeanne requested a review from colega January 16, 2025 15:35

ying-jeanne added 5 commits January 16, 2025 16:37

formatting

34fd8b3

fix dum dum

72c16ea

just defer

f1da054

use write lock to write, sounds reasonable hum?

673d607

update lock

aa57e2f

colega approved these changes Jan 17, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into final-cost-attribution

582e92e

colega reviewed Jan 17, 2025

View reviewed changes

changelog update

c5eb8c2

ying-jeanne merged commit bd6e14b into main Jan 17, 2025
31 checks passed

ying-jeanne deleted the final-cost-attribution branch January 17, 2025 09:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingester/Distributor: Add support for exporting cost attribution metrics #10269

Ingester/Distributor: Add support for exporting cost attribution metrics #10269

ying-jeanne commented Dec 17, 2024 •

edited

Loading

tacole02 left a comment

colega Jan 16, 2025

ying-jeanne Jan 16, 2025

colega Jan 16, 2025

ying-jeanne Jan 16, 2025

ying-jeanne Jan 16, 2025

colega Jan 16, 2025 •

edited

Loading

ying-jeanne Jan 16, 2025

colega Jan 16, 2025

ying-jeanne Jan 16, 2025

colega Jan 17, 2025

Ingester/Distributor: Add support for exporting cost attribution metrics #10269

Ingester/Distributor: Add support for exporting cost attribution metrics #10269

Conversation

ying-jeanne commented Dec 17, 2024 • edited Loading

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

tacole02 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colega Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ying-jeanne commented Dec 17, 2024 •

edited

Loading

colega Jan 16, 2025 •

edited

Loading