Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingester/Distributor: Add support for exporting cost attribution metrics #10269

Merged
merged 108 commits into from
Jan 17, 2025

Conversation

ying-jeanne
Copy link
Contributor

@ying-jeanne ying-jeanne commented Dec 17, 2024

What this PR does

This is the follow up of #9733,

The PR intent to export extra attributed metrics in distributor and ingester, in order to get sample received, sample discarded and active_series attributed by cost attribution label.

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@ying-jeanne ying-jeanne force-pushed the final-cost-attribution branch from 5165a5b to 6f36b5f Compare December 17, 2024 21:07
@ying-jeanne ying-jeanne changed the title Final cost attribution MVP: Cost attribution Dec 17, 2024
@ying-jeanne ying-jeanne requested a review from colega December 17, 2024 21:10
@ying-jeanne ying-jeanne force-pushed the final-cost-attribution branch from 6f36b5f to 077a94a Compare December 17, 2024 22:01
@ying-jeanne ying-jeanne force-pushed the final-cost-attribution branch from 077a94a to f04c28f Compare December 17, 2024 22:08
@ying-jeanne ying-jeanne marked this pull request as ready for review December 17, 2024 22:13
@ying-jeanne ying-jeanne requested review from tacole02 and a team as code owners December 17, 2024 22:13
pkg/costattribution/manager.go Outdated Show resolved Hide resolved
pkg/costattribution/manager.go Outdated Show resolved Hide resolved
pkg/costattribution/manager.go Outdated Show resolved Hide resolved
pkg/costattribution/manager.go Outdated Show resolved Hide resolved
pkg/ingester/activeseries/active_series.go Outdated Show resolved Hide resolved
pkg/ingester/activeseries/active_series.go Outdated Show resolved Hide resolved
pkg/ingester/activeseries/active_series_test.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/mimir/modules.go Outdated Show resolved Hide resolved
Copy link
Contributor

@tacole02 tacole02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the docs! I left a few suggestions.

Comment on lines 203 to 209
o.discardedSampleMtx.Lock()
if _, ok := o.discardedSample[*reason]; ok {
o.discardedSample[*reason].Add(discardedSampleIncrement)
} else {
o.discardedSample[*reason] = atomic.NewFloat64(discardedSampleIncrement)
}
o.discardedSampleMtx.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either we switch discardedSampleMtx to a RWMutex (and we call .Add() while holding just the RLock, and we grab the Lock for creation, with a second check), or we switch the map to use non-atomic values (because why we need them atomic while we're holding a mutex?)

Given that the amount of reasons is most likely small, I would go for a RWMutex.

If we don't do this, then a customer that is discarding lots of samples would be causing lock contention here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed here 5ac64f5

func (st *SampleTracker) recoveredFromOverflow(deadline time.Time) bool {
st.observedMtx.RLock()
if st.overflowSince.Load() > 0 && time.Unix(st.overflowSince.Load(), 0).Add(st.cooldownDuration).Before(deadline) {
if len(st.observed) <= st.maxCardinality {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we checked:

		// if it is not known, we need to check if the max cardinality is exceeded
		if len(st.observed) >= st.maxCardinality {
			st.overflowSince.Store(ts.Unix())
		}

So equals meant we're in overflow, but now equals means we've recovered from overflow. This is going to cause continous flapping there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the recoveredfrom overflow was introduced for reducing active_series recount, what do you think if we remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the condition to strict less than maxcardinality in the mean time e5cda41

Comment on lines 220 to 223
invalidKeys := st.inactiveObservations(deadline)
for _, key := range invalidKeys {
st.cleanupTrackerAttribution(key)
}
Copy link
Contributor

@colega colega Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very inefficient because you have to take a write mutex for each observation to cleanup.

OTOH, you're not checking again the observations before cleaning them up.

I would suggest you to delegate this logic to the tracker:

Suggested change
invalidKeys := st.inactiveObservations(deadline)
for _, key := range invalidKeys {
st.cleanupTrackerAttribution(key)
}
st.cleanupInactiveObservations(deadline)

Then in cleanupInactiveObservations you build a slice of observations to cleanup taking a read mutex, then you take a write mutex, and iterate through those, deleting the ones that are still inactive (maybe they became active in the meantime?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed here 41e9f47

@ying-jeanne ying-jeanne requested a review from colega January 16, 2025 14:34
at.overflowCounter.Dec()
return
}
defer at.observedMtx.RUnlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be before the if. We're not unlocking in the previous return statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed here 80b64dc

@ying-jeanne ying-jeanne requested a review from colega January 16, 2025 15:35
CHANGELOG.md Outdated
@@ -9,6 +9,7 @@
* [CHANGE] Querier: pass query matchers to queryable `IsApplicable` hook. #10256
* [CHANGE] Query-frontend: Add `topic` label to `cortex_ingest_storage_strong_consistency_requests_total`, `cortex_ingest_storage_strong_consistency_failures_total`, and `cortex_ingest_storage_strong_consistency_wait_duration_seconds` metrics. #10220
* [CHANGE] Ruler: cap the rate of retries for remote query evaluation to 170/sec. This is configurable via `-ruler.query-frontend.max-retries-rate`. #10375 #10403
* [CHANGE] Ingester/Distributor: Add support for exporting cost attribution metrics (`cortex_ingester_attributed_active_series`, `cortex_distributor_received_attributed_samples_total`, and `cortex_discarded_attributed_samples_total`) with labels specified by customers to a custom Prometheus registry. This feature enables more flexible billing data tracking. #10269
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for a late comment, but this should be a FEATURE.

@ying-jeanne ying-jeanne merged commit bd6e14b into main Jan 17, 2025
31 checks passed
@ying-jeanne ying-jeanne deleted the final-cost-attribution branch January 17, 2025 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants