[Core] Unsafe parallel write in CheckAndCorrectZeroDiagonalValues in ublas_space.h #12768

rfaasse · 2024-10-21T07:49:15Z

Description
While fixing a pipeline failure regarding raw pointers pointing to invalidated memory (for more background information, see #12761, it was found that the function CheckAndCorrectZeroDiagonalValues in ublas_space.h is not threadsafe. In short, the function loops over the rows of the system matrix in parallel and changes the data of the matrix. There is the possibility that the underlying value data is re-allocated when writing to matrix entries that don't exist yet in a sparse matrix. When this happens on one thread while other threads are also reading and writing, this could lead to memory violations or undefined behavior.

Acceptance Criteria
Given a sparse system matrix A
When the CheckAndCorrectZeroDiagonalValues is run in parallel
Then there are no memory issues or undefined behavior

Detailed Background
The the function in question can be found here. The problem is found in line 737, where the matrix value rA(Index, Index) is changed. For a sparse matrix, this could lead to an insert_element into the value_data():

BOOST_UBLAS_INLINE
        const_reference operator () (size_type i, size_type j) const {
            const_pointer p = find_element (i, j);
            if (p)
                return *p;
            else
                return zero_;
        }
        BOOST_UBLAS_INLINE
        reference operator () (size_type i, size_type j) {
#ifndef BOOST_UBLAS_STRICT_MATRIX_SPARSE
            size_type element1 (layout_type::index_M (i, j));
            size_type element2 (layout_type::index_m (i, j));
            if (filled1_ <= element1 + 1)
                return insert_element (i, j, value_type/*zero*/());
            pointer p = find_element (i, j);
            if (p)
                return *p;
            else
                return insert_element (i, j, value_type/*zero*/());
#else
            return reference (*this, i, j);
#endif
        }

If because of this insert, the underlying unbounded_array<T> gets re-allocated, this will give problems when running in parallel.

Scope

KratosCore

The text was updated successfully, but these errors were encountered:

rfaasse · 2024-10-21T07:57:23Z

For all those interested/who participated in the discussion: created the issue for the problem we found in #12761. I would suggest to try and trigger the issue with a unit test. When running the test with --gtest_repeat=N N being large enough, we should be able to consistently trigger some failures, e.g:
OMP_NUM_THREADS=16 ./KratosGeoMechanicsCoreTest --gtest_filter="*<TestName>*" --gtest_repeat=200

FYI @RiccardoRossi @roigcarlo @matekelemen @philbucher @loumalouomega @avdg81

RiccardoRossi · 2024-10-21T09:59:06Z

hi @rfaasse i think that we need to assume that after "ConstructMatrixStructure" the structure of the matrix cannot be modified.

on my side i would either throw an error (possibly in debug only) or silently doing nothing (for example it would make sense doing nothing in the scaling of a row if that row is completely empty)

rfaasse · 2024-10-22T14:51:05Z

Hi @RiccardoRossi, that would indeed prevent unsafe parallel writing, but it would significantly alter the behavior of this function. If I read it correctly, the only goal of the function is to loop over the rows and if a row is empty, set the diagonal belonging to that row.

I'm not too sure what the reasoning is for this, @loumalouomega could you shed some light on this, since you're the original author of the function?

If we would need to retain the same functionality (so set the diagonal if the entire row is empty), I'd suggest to do a parallel read over the rows, store the indices for empty rows and then do a sequential, safe write to the diagonals of these (empty) rows. Since the number of write operations is then only the number of empty rows, I don't expect performance problems, but that needs to be checked of course.

If we don't need to retain the functionality, we can probably remove the function as a whole.

loumalouomega · 2024-10-22T15:08:55Z

The main problem that I see is that despite being the B&S being templated, in fact we are only thinking in Ublas sparse matrix. Eventually this is going to be replaced with a custom CSR matrix developed by @RiccardoRossi . So I would only think on fix this if:

The change to new matrix is going to take some time, I don't know if @KratosMultiphysics/technical-committee has some estimation.
We are sure what behaviour we want to ensure. IMHO current implementation should be fine. As @RiccardoRossi a priori once the matrix structure is defined we should only modify those values, not add a new ones.

RiccardoRossi · 2024-10-23T06:55:55Z

the original objective of that function was to add a non zero diagonal entry in the case there was a row of zeros. With "zeros" it was originally ment "existing entries of the matrix which have a value of zero".

The problem here is triggered in the case there is a row in which the diagonal term is not there in the graph.
This is something that should never happen in a standard FEM calculation in the sense that the matrix you obtain by assembly should always contain the diagonal.

I believe that this is happening in the practice in selected cases in which we make use of constraints, and that this is a byproduct of the (Admittedly broken) mechanism that we use to take them into account.

The current implementation in any case is wrong for two reasons:
1 - it is simply not threadsafe ... so it can break at any moment
2 - even if we manage somehow to make it threadsafe (for example by posponing empty rows to the end), the cost is enormous (generating a matrix like this would be O(N^2))

Per regards a revamp of the strategies, which will move to using the "native" CSR matrix (which BY DESIGN would simply not allow doing the insert the way is done now) @rubenzorrilla is on it, although it is currently paused due to an incoming deadline. My HOPE is to have a first demo, to be put in the "Experimental" section of the core, ready by the end of this year ...

RiccardoRossi · 2024-10-23T08:06:38Z

clarifying my answer, the problem is that a matrix with a row of zeros has definitely zero determinant so it cannot be inverted...

RiccardoRossi · 2024-10-23T12:34:01Z

well ... it looks like you do agree with my answer, but ... where do we go from here?

rfaasse · 2024-10-23T13:56:54Z

In my opinion, it makes sense if someone from the core development team picks this up, since this functionality is created and maintained in the core part of Kratos. In the previous PR, I created a quick fix for our pipelines (mainly for sequential writing) because it was hampering the geomechanics development and created this issue afterwards to document the problem as requested. Since fixing it is more fundamental, I think it makes sense if the issue is picked up by someone from the core team.

RiccardoRossi · 2024-10-23T14:04:03Z

@KratosMultiphysics/technical-committee FYI

loumalouomega · 2024-10-23T14:14:12Z

In my opinion, it makes sense if someone from the core development team picks this up, since this functionality is created and maintained in the core part of Kratos. In the previous PR, I created a quick fix for our pipelines (mainly for sequential writing) because it was hampering the geomechanics development and created this issue afterwards to document the problem as requested. Since fixing it is more fundamental, I think it makes sense if the issue is picked up by someone from the core team.

The oginal code is mine, if @KratosMultiphysics/technical-committee agrees I can take the lead.

RiccardoRossi · 2024-10-23T14:23:11Z

@loumalouomega we would really appreciate it...

loumalouomega · 2024-10-24T07:36:19Z

@rfaasse my question is, before staring to do any code, can you ensure the structure of your matrix has the diagonal terms?, a priori it will be the simplest. Because other alternatives implies cost in checking diagonal term is defined or cost in non-allocated values allocation.

RiccardoRossi · 2024-10-24T07:51:57Z

@loumalouomega i would propose to begin by throwing a meaningful error, at least in debug mode, if the diagonal is not there.

my guess is that the error is thrown only in the tests ...

rfaasse · 2024-10-24T09:50:13Z

@rfaasse my question is, before staring to do any code, can you ensure the structure of your matrix has the diagonal terms?, a priori it will be the simplest. Because other alternatives implies cost in checking diagonal term is defined or cost in non-allocated values allocation.

That is hard for me to say, I don't know the details of all the possible cases we set up. Considering we ran into the issue of the re-allocation in #12761, I expect this was not always the case. But I think we'll notice it quickly enough in the github pipelines + our own pipelines if throwing an error will fail any of our tests.

If you have a branch ready, could you let me know? Then I'll make sure we run your branch in our pipelines to double-check nothing breaks before merging to master.

loumalouomega · 2024-10-24T12:01:50Z

#12782

matekelemen · 2024-10-24T12:12:07Z

Picking up the index of the diagonal entry during a loop through the row shouldn't pose a big performance hit.

rfaasse added Kratos Core Parallel-SMP Shared memory parallelism with OpenMP or C++ Threads labels Oct 21, 2024

loumalouomega mentioned this issue Oct 24, 2024

[Core] Adding debug check in CheckAndCorrectZeroDiagonalValues #12782

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Unsafe parallel write in CheckAndCorrectZeroDiagonalValues in ublas_space.h #12768

[Core] Unsafe parallel write in CheckAndCorrectZeroDiagonalValues in ublas_space.h #12768

rfaasse commented Oct 21, 2024

rfaasse commented Oct 21, 2024

RiccardoRossi commented Oct 21, 2024

rfaasse commented Oct 22, 2024

loumalouomega commented Oct 22, 2024

RiccardoRossi commented Oct 23, 2024

RiccardoRossi commented Oct 23, 2024

RiccardoRossi commented Oct 23, 2024

rfaasse commented Oct 23, 2024

RiccardoRossi commented Oct 23, 2024

loumalouomega commented Oct 23, 2024

RiccardoRossi commented Oct 23, 2024

loumalouomega commented Oct 24, 2024

RiccardoRossi commented Oct 24, 2024

rfaasse commented Oct 24, 2024

loumalouomega commented Oct 24, 2024

matekelemen commented Oct 24, 2024 •

edited

Loading

[Core] Unsafe parallel write in CheckAndCorrectZeroDiagonalValues in ublas_space.h #12768

[Core] Unsafe parallel write in CheckAndCorrectZeroDiagonalValues in ublas_space.h #12768

Comments

rfaasse commented Oct 21, 2024

rfaasse commented Oct 21, 2024

RiccardoRossi commented Oct 21, 2024

rfaasse commented Oct 22, 2024

loumalouomega commented Oct 22, 2024

RiccardoRossi commented Oct 23, 2024

RiccardoRossi commented Oct 23, 2024

RiccardoRossi commented Oct 23, 2024

rfaasse commented Oct 23, 2024

RiccardoRossi commented Oct 23, 2024

loumalouomega commented Oct 23, 2024

RiccardoRossi commented Oct 23, 2024

loumalouomega commented Oct 24, 2024

RiccardoRossi commented Oct 24, 2024

rfaasse commented Oct 24, 2024

loumalouomega commented Oct 24, 2024

matekelemen commented Oct 24, 2024 • edited Loading

matekelemen commented Oct 24, 2024 •

edited

Loading