-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Unsafe parallel write in CheckAndCorrectZeroDiagonalValues in ublas_space.h #12768
Comments
For all those interested/who participated in the discussion: created the issue for the problem we found in #12761. I would suggest to try and trigger the issue with a unit test. When running the test with FYI @RiccardoRossi @roigcarlo @matekelemen @philbucher @loumalouomega @avdg81 |
hi @rfaasse i think that we need to assume that after "ConstructMatrixStructure" the structure of the matrix cannot be modified. on my side i would either throw an error (possibly in debug only) or silently doing nothing (for example it would make sense doing nothing in the scaling of a row if that row is completely empty) |
Hi @RiccardoRossi, that would indeed prevent unsafe parallel writing, but it would significantly alter the behavior of this function. If I read it correctly, the only goal of the function is to loop over the rows and if a row is empty, set the diagonal belonging to that row. I'm not too sure what the reasoning is for this, @loumalouomega could you shed some light on this, since you're the original author of the function? If we would need to retain the same functionality (so set the diagonal if the entire row is empty), I'd suggest to do a parallel read over the rows, store the indices for empty rows and then do a sequential, safe write to the diagonals of these (empty) rows. Since the number of write operations is then only the number of empty rows, I don't expect performance problems, but that needs to be checked of course. If we don't need to retain the functionality, we can probably remove the function as a whole. |
The main problem that I see is that despite being the B&S being templated, in fact we are only thinking in Ublas sparse matrix. Eventually this is going to be replaced with a custom CSR matrix developed by @RiccardoRossi . So I would only think on fix this if:
|
the original objective of that function was to add a non zero diagonal entry in the case there was a row of zeros. With "zeros" it was originally ment "existing entries of the matrix which have a value of zero". The problem here is triggered in the case there is a row in which the diagonal term is not there in the graph. I believe that this is happening in the practice in selected cases in which we make use of constraints, and that this is a byproduct of the (Admittedly broken) mechanism that we use to take them into account. The current implementation in any case is wrong for two reasons: Per regards a revamp of the strategies, which will move to using the "native" CSR matrix (which BY DESIGN would simply not allow doing the insert the way is done now) @rubenzorrilla is on it, although it is currently paused due to an incoming deadline. My HOPE is to have a first demo, to be put in the "Experimental" section of the core, ready by the end of this year ... |
clarifying my answer, the problem is that a matrix with a row of zeros has definitely zero determinant so it cannot be inverted... |
well ... it looks like you do agree with my answer, but ... where do we go from here? |
In my opinion, it makes sense if someone from the core development team picks this up, since this functionality is created and maintained in the core part of Kratos. In the previous PR, I created a quick fix for our pipelines (mainly for sequential writing) because it was hampering the geomechanics development and created this issue afterwards to document the problem as requested. Since fixing it is more fundamental, I think it makes sense if the issue is picked up by someone from the core team. |
@KratosMultiphysics/technical-committee FYI |
The oginal code is mine, if @KratosMultiphysics/technical-committee agrees I can take the lead. |
@loumalouomega we would really appreciate it... |
@rfaasse my question is, before staring to do any code, can you ensure the structure of your matrix has the diagonal terms?, a priori it will be the simplest. Because other alternatives implies cost in checking diagonal term is defined or cost in non-allocated values allocation. |
@loumalouomega i would propose to begin by throwing a meaningful error, at least in debug mode, if the diagonal is not there. my guess is that the error is thrown only in the tests ... |
That is hard for me to say, I don't know the details of all the possible cases we set up. Considering we ran into the issue of the re-allocation in #12761, I expect this was not always the case. But I think we'll notice it quickly enough in the github pipelines + our own pipelines if throwing an error will fail any of our tests. If you have a branch ready, could you let me know? Then I'll make sure we run your branch in our pipelines to double-check nothing breaks before merging to master. |
Picking up the index of the diagonal entry during a loop through the row shouldn't pose a big performance hit. |
Description
While fixing a pipeline failure regarding raw pointers pointing to invalidated memory (for more background information, see #12761, it was found that the function
CheckAndCorrectZeroDiagonalValues
inublas_space.h
is not threadsafe. In short, the function loops over the rows of the system matrix in parallel and changes the data of the matrix. There is the possibility that the underlying value data is re-allocated when writing to matrix entries that don't exist yet in a sparse matrix. When this happens on one thread while other threads are also reading and writing, this could lead to memory violations or undefined behavior.Acceptance Criteria
Given a sparse system matrix A
When the
CheckAndCorrectZeroDiagonalValues
is run in parallelThen there are no memory issues or undefined behavior
Detailed Background
The the function in question can be found here. The problem is found in line 737, where the matrix value rA(Index, Index) is changed. For a sparse matrix, this could lead to an
insert_element
into thevalue_data()
:If because of this insert, the underlying
unbounded_array<T>
gets re-allocated, this will give problems when running in parallel.Scope
The text was updated successfully, but these errors were encountered: