Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reformat procfs scraping to handle errors #1178

Merged
merged 4 commits into from
Jan 7, 2025

Conversation

cmetz100
Copy link
Contributor

@cmetz100 cmetz100 commented Dec 19, 2024

What does this PR do?

Extracts all the per-pid operations into a helper function so that if we cant read something for a given pid it doesnt blow up the whole observer.

Also removes process information (which we retain over poll calls) for pids not seen in the current poll call (pids that have exited).

Motivation

Processes exiting mid poll loop can cause numerous errors, all of which, if not handled properly, polute the collection of data from active processes.

Related issues

https://datadoghq.atlassian.net/browse/SMPTNG-574

First attempts at this in #1175 and #1142
After some discussion we thought it better to just catch all per-pid related errors rather than keeping up with new errors on individual functions related to collecting per process information.

Validating the change in this notebook

Signed-off-by: Caleb Metz <[email protected]>
@cmetz100 cmetz100 marked this pull request as ready for review January 3, 2025 18:48
@cmetz100 cmetz100 requested a review from a team as a code owner January 3, 2025 18:48
&mut self,
process: Process,
aggr: &mut memory::smaps_rollup::Aggregator,
) -> Result<(), Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any codepath that returns an error? I'm not seeing any.

My thought here is that we need to explicitly handle errors when reading process information, ie, avoid the ? operator. So if we make this process not have a return type, we could avoid any accidental error handling that uses ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm yeah I think I see what you are saying. I chose this implementation specifically to handle ? without having to explicitly handle errors. Mostly just because the proc_exe(), proc_comm(), and proc_cmdline() all can error if the pid had exited and I thought it was cleaner to just blanket catch any of those failures rather then add messiness catching different failures that are all resulting from the same root cause (the pid exiting)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very open to changing this tho to catch every error explicitly, just thought this may be easier to add other code later in handle_process without having to risk observer crashes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proc_exe(), proc_comm(), and proc_cmdline() all can error if the pid had exited and I thought it was cleaner to just blanket catch any of those failures rather then add messiness catching different failures that are all resulting from the same root cause (the pid exiting)

What makes these failures different from, eg process.status() failing?

In my understanding, any and all errors that could be caused by a process exiting during execution of handle_process should be explicitly caught and handled by ignoring and return/return Ok(()).

If there are errors that can occur that are not caused by the process exiting early, then those would be a valid reason to bubble an error out of this function, but I don't think process-exit-errors should escape this function, otherwise what is the point of extracting into handle_process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah I see what you are saying. It is the same type of failure as process.status(). I will update the code to catch what I expect to be specific "pid-exit" errors. The only exception to that I see is the uptime poll since that is more broad than a specific process

}
}
// END pid loop
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop is now short enough you could cut these.

Comment on lines +121 to +122
// Update the process_info map to only hold processes seen by the current poll call.
self.process_info.retain(|pid, _| pids.contains(pid));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could potentially be a separate PR, since it's technically out of scope for the change description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it fits within the spirit of handling pids that exit I added a line to the change description

Signed-off-by: Caleb Metz <[email protected]>
@cmetz100 cmetz100 merged commit 03f26fa into main Jan 7, 2025
17 checks passed
@cmetz100 cmetz100 deleted the cmetz/extract_procfs_per_process_ops branch January 7, 2025 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants