Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find by jobid and output aggregate filter data #1322

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

milroy
Copy link
Member

@milroy milroy commented Dec 18, 2024

Errors in partial cancel are causing inability to schedule resources in production. While evidence points to orphaned (i.e., not cleared) vertex data such as aggregate filters, it is difficult to investigate and debug without being able to inspect the state of the vertices.

The PR adds capability to print vertex-specific data by jobid and an agfilter boolean. The state can be printed based on whether the jobid is in the allocations, reservations, job2span, and tags maps.

This PR is WIP because I need to add support for all writers.

Copy link
Member

@jameshcorbett jameshcorbett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just gave it a brief look-through, how are you feeling about getting this in by Tuesday which is I think the deadline for getting the RPM into the next TOSS version?

resource/planner/c/planner_c_interface.cpp Show resolved Hide resolved
@milroy
Copy link
Member Author

milroy commented Jan 13, 2025

how are you feeling about getting this in by Tuesday which is I think the deadline for getting the RPM into the next TOSS version?

I'd lie to have @trws feedback before merging this PR.

Problem: Partial cancellation is failing under rabbit-related
circumstances, possibly orphaning job-related data on allocated
vertices. Having a way to find all vertices with jobid-related data
would be very useful in diagnosing the problem. Furthermore, it is
likely the partial cancellation failures are related to aggregate
filter (pruning filter) counts.

Add capability to the evaulator to validate and evaluate by jobid, span,
tag, reservation, allocation, and agfilter.
Problem: a jobid and aggregate filter boolean are now supported in the
evaluator and validator. The evaluator needs support for returning the
parsed jobid and agfilter bool to the traverser, so it can change
behavior based on whether a jobid or agfilter boolean is present in the
evaluation string.

Add the capability to extract the jobid as a uint64_t and agfilter as a
bool and return them to the client.
Problem: the evaluator unit tests do not support extraction and the new
options.

Add support in the tests and test cases for the new options.
Problem: the issues with partial cancel have exposed the need for
printing the state of resource graph vertices based on the jobid and
aggregate filter.

Add traverser support for gathering agfilter utilization and total
counts for all resources via updated function calls.
Problem: a planner_multi C API function that returns the number of
resources utilized by a jobid by resource type is needed to write out
aggregate filter counts.

Add a new C API function to planner_multi that returns the number of
resources utilized by a jobid by calling the planner C API function for
the specified resource type index.
Problem: the writer needs support for outputting the aggregate filter
counts by type.

Add the capability to write out the aggregate filter state by passing
the data to map2json.
Problem: after extracting the jobid and/or agfilter boolean with the
evaluator, the traverser needs to fetch the aggregate filter counts by
jobid (if the ID is present in the find string) or total utilization.

Add conditional behavior to fetch appropriate aggregate filter data
depending on the presence of jobid and the value of the agfilter
boolean.
Problem: there is a typo in the planner C API function to reduce a span.

Fix the typo.
Problem: the testsuite doesn't have tests for the new find behavior

Add tests.
@milroy milroy changed the title [WIP] Find by jobid and output aggregate filter data Find by jobid and output aggregate filter data Jan 14, 2025
@milroy milroy requested a review from trws January 14, 2025 03:51
@milroy
Copy link
Member Author

milroy commented Jan 14, 2025

I updated the PR to include an alternative method of extracting (p, x) pairs of strings (in the spirit of what you suggested, @trws, and included that approach as a series of fixup commits. Those fixups can be squashed depending on which approach is better.

This PR is no longer WIP and is ready for review.

Copy link

codecov bot commented Jan 14, 2025

Codecov Report

Attention: Patch coverage is 86.30137% with 30 lines in your changes missing coverage. Please review.

Project coverage is 75.5%. Comparing base (5ae7459) to head (7d3b079).

Files with missing lines Patch % Lines
resource/evaluators/expr_eval_api.cpp 75.5% 12 Missing ⚠️
resource/traversers/dfu_impl.cpp 85.1% 7 Missing ⚠️
resource/planner/c/planner_multi_c_interface.cpp 54.5% 5 Missing ⚠️
resource/evaluators/expr_eval_vtx_target.cpp 91.3% 4 Missing ⚠️
resource/writers/match_writers.cpp 71.4% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master   #1322     +/-   ##
========================================
+ Coverage    75.3%   75.5%   +0.2%     
========================================
  Files         111     111             
  Lines       16042   16253    +211     
========================================
+ Hits        12081   12276    +195     
- Misses       3961    3977     +16     
Files with missing lines Coverage Δ
resource/evaluators/expr_eval_target.hpp 100.0% <ø> (ø)
resource/evaluators/expr_eval_vtx_target.hpp 100.0% <ø> (ø)
resource/evaluators/test/expr_eval_test01.cpp 100.0% <100.0%> (ø)
resource/planner/c/planner_c_interface.cpp 74.3% <ø> (ø)
resource/traversers/dfu_impl.hpp 94.5% <ø> (ø)
resource/traversers/dfu_impl_update.cpp 75.3% <100.0%> (+0.1%) ⬆️
resource/writers/match_writers.hpp 69.2% <ø> (ø)
resource/writers/match_writers.cpp 59.7% <71.4%> (-0.1%) ⬇️
resource/evaluators/expr_eval_vtx_target.cpp 90.6% <91.3%> (+0.4%) ⬆️
resource/planner/c/planner_multi_c_interface.cpp 61.4% <54.5%> (-0.1%) ⬇️
... and 2 more

... and 1 file with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants