-
Notifications
You must be signed in to change notification settings - Fork 544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingester/Distributor: Add support for exporting cost attribution metrics #10269
Conversation
5165a5b
to
6f36b5f
Compare
6f36b5f
to
077a94a
Compare
077a94a
to
f04c28f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the docs! I left a few suggestions.
o.discardedSampleMtx.Lock() | ||
if _, ok := o.discardedSample[*reason]; ok { | ||
o.discardedSample[*reason].Add(discardedSampleIncrement) | ||
} else { | ||
o.discardedSample[*reason] = atomic.NewFloat64(discardedSampleIncrement) | ||
} | ||
o.discardedSampleMtx.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either we switch discardedSampleMtx
to a RWMutex (and we call .Add()
while holding just the RLock
, and we grab the Lock
for creation, with a second check), or we switch the map to use non-atomic values (because why we need them atomic while we're holding a mutex?)
Given that the amount of reasons is most likely small, I would go for a RWMutex.
If we don't do this, then a customer that is discarding lots of samples would be causing lock contention here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed here 5ac64f5
func (st *SampleTracker) recoveredFromOverflow(deadline time.Time) bool { | ||
st.observedMtx.RLock() | ||
if st.overflowSince.Load() > 0 && time.Unix(st.overflowSince.Load(), 0).Add(st.cooldownDuration).Before(deadline) { | ||
if len(st.observed) <= st.maxCardinality { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we checked:
// if it is not known, we need to check if the max cardinality is exceeded
if len(st.observed) >= st.maxCardinality {
st.overflowSince.Store(ts.Unix())
}
So equals meant we're in overflow, but now equals means we've recovered from overflow. This is going to cause continous flapping there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the recoveredfrom overflow was introduced for reducing active_series recount, what do you think if we remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed the condition to strict less than maxcardinality in the mean time e5cda41
pkg/costattribution/manager.go
Outdated
invalidKeys := st.inactiveObservations(deadline) | ||
for _, key := range invalidKeys { | ||
st.cleanupTrackerAttribution(key) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very inefficient because you have to take a write mutex for each observation to cleanup.
OTOH, you're not checking again the observations before cleaning them up.
I would suggest you to delegate this logic to the tracker:
invalidKeys := st.inactiveObservations(deadline) | |
for _, key := range invalidKeys { | |
st.cleanupTrackerAttribution(key) | |
} | |
st.cleanupInactiveObservations(deadline) |
Then in cleanupInactiveObservations
you build a slice of observations to cleanup taking a read mutex, then you take a write mutex, and iterate through those, deleting the ones that are still inactive (maybe they became active in the meantime?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed here 41e9f47
at.overflowCounter.Dec() | ||
return | ||
} | ||
defer at.observedMtx.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be before the if
. We're not unlocking in the previous return statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed here 80b64dc
Co-authored-by: Oleg Zaytsev <[email protected]>
Co-authored-by: Oleg Zaytsev <[email protected]>
CHANGELOG.md
Outdated
@@ -9,6 +9,7 @@ | |||
* [CHANGE] Querier: pass query matchers to queryable `IsApplicable` hook. #10256 | |||
* [CHANGE] Query-frontend: Add `topic` label to `cortex_ingest_storage_strong_consistency_requests_total`, `cortex_ingest_storage_strong_consistency_failures_total`, and `cortex_ingest_storage_strong_consistency_wait_duration_seconds` metrics. #10220 | |||
* [CHANGE] Ruler: cap the rate of retries for remote query evaluation to 170/sec. This is configurable via `-ruler.query-frontend.max-retries-rate`. #10375 #10403 | |||
* [CHANGE] Ingester/Distributor: Add support for exporting cost attribution metrics (`cortex_ingester_attributed_active_series`, `cortex_distributor_received_attributed_samples_total`, and `cortex_discarded_attributed_samples_total`) with labels specified by customers to a custom Prometheus registry. This feature enables more flexible billing data tracking. #10269 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for a late comment, but this should be a FEATURE.
What this PR does
This is the follow up of #9733,
The PR intent to export extra attributed metrics in distributor and ingester, in order to get sample received, sample discarded and active_series attributed by cost attribution label.
Which issue(s) this PR fixes or relates to
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.