Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Unsafe parallel write in CheckAndCorrectZeroDiagonalValues in ublas_space.h #12768

Open
rfaasse opened this issue Oct 21, 2024 · 16 comments
Labels
Kratos Core Parallel-SMP Shared memory parallelism with OpenMP or C++ Threads

Comments

@rfaasse
Copy link
Contributor

rfaasse commented Oct 21, 2024

Description
While fixing a pipeline failure regarding raw pointers pointing to invalidated memory (for more background information, see #12761, it was found that the function CheckAndCorrectZeroDiagonalValues in ublas_space.h is not threadsafe. In short, the function loops over the rows of the system matrix in parallel and changes the data of the matrix. There is the possibility that the underlying value data is re-allocated when writing to matrix entries that don't exist yet in a sparse matrix. When this happens on one thread while other threads are also reading and writing, this could lead to memory violations or undefined behavior.

Acceptance Criteria
Given a sparse system matrix A
When the CheckAndCorrectZeroDiagonalValues is run in parallel
Then there are no memory issues or undefined behavior

Detailed Background
The the function in question can be found here. The problem is found in line 737, where the matrix value rA(Index, Index) is changed. For a sparse matrix, this could lead to an insert_element into the value_data():

BOOST_UBLAS_INLINE
        const_reference operator () (size_type i, size_type j) const {
            const_pointer p = find_element (i, j);
            if (p)
                return *p;
            else
                return zero_;
        }
        BOOST_UBLAS_INLINE
        reference operator () (size_type i, size_type j) {
#ifndef BOOST_UBLAS_STRICT_MATRIX_SPARSE
            size_type element1 (layout_type::index_M (i, j));
            size_type element2 (layout_type::index_m (i, j));
            if (filled1_ <= element1 + 1)
                return insert_element (i, j, value_type/*zero*/());
            pointer p = find_element (i, j);
            if (p)
                return *p;
            else
                return insert_element (i, j, value_type/*zero*/());
#else
            return reference (*this, i, j);
#endif
        }

If because of this insert, the underlying unbounded_array<T> gets re-allocated, this will give problems when running in parallel.

Scope

  • KratosCore
@rfaasse rfaasse added Kratos Core Parallel-SMP Shared memory parallelism with OpenMP or C++ Threads labels Oct 21, 2024
@rfaasse
Copy link
Contributor Author

rfaasse commented Oct 21, 2024

For all those interested/who participated in the discussion: created the issue for the problem we found in #12761. I would suggest to try and trigger the issue with a unit test. When running the test with --gtest_repeat=N N being large enough, we should be able to consistently trigger some failures, e.g:
OMP_NUM_THREADS=16 ./KratosGeoMechanicsCoreTest --gtest_filter="*<TestName>*" --gtest_repeat=200

FYI @RiccardoRossi @roigcarlo @matekelemen @philbucher @loumalouomega @avdg81

@RiccardoRossi
Copy link
Member

hi @rfaasse i think that we need to assume that after "ConstructMatrixStructure" the structure of the matrix cannot be modified.

on my side i would either throw an error (possibly in debug only) or silently doing nothing (for example it would make sense doing nothing in the scaling of a row if that row is completely empty)

@rfaasse
Copy link
Contributor Author

rfaasse commented Oct 22, 2024

Hi @RiccardoRossi, that would indeed prevent unsafe parallel writing, but it would significantly alter the behavior of this function. If I read it correctly, the only goal of the function is to loop over the rows and if a row is empty, set the diagonal belonging to that row.

I'm not too sure what the reasoning is for this, @loumalouomega could you shed some light on this, since you're the original author of the function?

If we would need to retain the same functionality (so set the diagonal if the entire row is empty), I'd suggest to do a parallel read over the rows, store the indices for empty rows and then do a sequential, safe write to the diagonals of these (empty) rows. Since the number of write operations is then only the number of empty rows, I don't expect performance problems, but that needs to be checked of course.

If we don't need to retain the functionality, we can probably remove the function as a whole.

@loumalouomega
Copy link
Member

The main problem that I see is that despite being the B&S being templated, in fact we are only thinking in Ublas sparse matrix. Eventually this is going to be replaced with a custom CSR matrix developed by @RiccardoRossi . So I would only think on fix this if:

  1. The change to new matrix is going to take some time, I don't know if @KratosMultiphysics/technical-committee has some estimation.
  2. We are sure what behaviour we want to ensure. IMHO current implementation should be fine. As @RiccardoRossi a priori once the matrix structure is defined we should only modify those values, not add a new ones.

@RiccardoRossi
Copy link
Member

the original objective of that function was to add a non zero diagonal entry in the case there was a row of zeros. With "zeros" it was originally ment "existing entries of the matrix which have a value of zero".

The problem here is triggered in the case there is a row in which the diagonal term is not there in the graph.
This is something that should never happen in a standard FEM calculation in the sense that the matrix you obtain by assembly should always contain the diagonal.

I believe that this is happening in the practice in selected cases in which we make use of constraints, and that this is a byproduct of the (Admittedly broken) mechanism that we use to take them into account.

The current implementation in any case is wrong for two reasons:
1 - it is simply not threadsafe ... so it can break at any moment
2 - even if we manage somehow to make it threadsafe (for example by posponing empty rows to the end), the cost is enormous (generating a matrix like this would be O(N^2))

Per regards a revamp of the strategies, which will move to using the "native" CSR matrix (which BY DESIGN would simply not allow doing the insert the way is done now) @rubenzorrilla is on it, although it is currently paused due to an incoming deadline. My HOPE is to have a first demo, to be put in the "Experimental" section of the core, ready by the end of this year ...

@RiccardoRossi
Copy link
Member

clarifying my answer, the problem is that a matrix with a row of zeros has definitely zero determinant so it cannot be inverted...

@RiccardoRossi
Copy link
Member

well ... it looks like you do agree with my answer, but ... where do we go from here?

@rfaasse
Copy link
Contributor Author

rfaasse commented Oct 23, 2024

In my opinion, it makes sense if someone from the core development team picks this up, since this functionality is created and maintained in the core part of Kratos. In the previous PR, I created a quick fix for our pipelines (mainly for sequential writing) because it was hampering the geomechanics development and created this issue afterwards to document the problem as requested. Since fixing it is more fundamental, I think it makes sense if the issue is picked up by someone from the core team.

@RiccardoRossi
Copy link
Member

@KratosMultiphysics/technical-committee FYI

@loumalouomega
Copy link
Member

In my opinion, it makes sense if someone from the core development team picks this up, since this functionality is created and maintained in the core part of Kratos. In the previous PR, I created a quick fix for our pipelines (mainly for sequential writing) because it was hampering the geomechanics development and created this issue afterwards to document the problem as requested. Since fixing it is more fundamental, I think it makes sense if the issue is picked up by someone from the core team.

The oginal code is mine, if @KratosMultiphysics/technical-committee agrees I can take the lead.

@RiccardoRossi
Copy link
Member

@loumalouomega we would really appreciate it...

@loumalouomega
Copy link
Member

@rfaasse my question is, before staring to do any code, can you ensure the structure of your matrix has the diagonal terms?, a priori it will be the simplest. Because other alternatives implies cost in checking diagonal term is defined or cost in non-allocated values allocation.

@RiccardoRossi
Copy link
Member

@loumalouomega i would propose to begin by throwing a meaningful error, at least in debug mode, if the diagonal is not there.

my guess is that the error is thrown only in the tests ...

@rfaasse
Copy link
Contributor Author

rfaasse commented Oct 24, 2024

@rfaasse my question is, before staring to do any code, can you ensure the structure of your matrix has the diagonal terms?, a priori it will be the simplest. Because other alternatives implies cost in checking diagonal term is defined or cost in non-allocated values allocation.

That is hard for me to say, I don't know the details of all the possible cases we set up. Considering we ran into the issue of the re-allocation in #12761, I expect this was not always the case. But I think we'll notice it quickly enough in the github pipelines + our own pipelines if throwing an error will fail any of our tests.

If you have a branch ready, could you let me know? Then I'll make sure we run your branch in our pipelines to double-check nothing breaks before merging to master.

@loumalouomega
Copy link
Member

#12782

@matekelemen
Copy link
Contributor

matekelemen commented Oct 24, 2024

Picking up the index of the diagonal entry during a loop through the row shouldn't pose a big performance hit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Kratos Core Parallel-SMP Shared memory parallelism with OpenMP or C++ Threads
Projects
None yet
Development

No branches or pull requests

4 participants