Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 [BUG] - <title>Misallocation of stations to slices #1745

Closed
eyalshimony opened this issue Oct 6, 2024 · 5 comments
Closed

🐛 [BUG] - <title>Misallocation of stations to slices #1745

eyalshimony opened this issue Oct 6, 2024 · 5 comments
Labels

Comments

@eyalshimony
Copy link

Description

Some stations (or sources for adjoint simulations) are misallocated to the wrong core when they are close to the boundary. I printed some station values as an example, for core number 4 on the machine I run on. In theory, it should handle 6000<x<10000 and -10000<y<-6000.
station # 52982 located in slice 4 x: 5999.96 y: -6708.24
station # 53984 located in slice 9 x: 6708.13 y: -6000.09
station # 53985 located in slice 4 x: 6708.79 y: -5999.34
Overall, there are 3 stations that are misallocated here. One station that should be in slice 3 is allocated to slice 4, and another station that should be in slice 9 is allocated to slice 4. There is also one station that should be in slice 4 that is allocated to slice 9.
I encountered the problem when I tried to migrate my code to use SU seismograms (and adjoint sources), and as a result of the bug there is a mismatch between the number of seismograms in the .adj file and the number of receivers allocated in the corresponding slice. This causes the read error in the image.
image

Affected SPECFEM3D version

Latest Development Version

Your software and hardware environment

gcc & gfortran 9.3.1, intel mpi 2021.1, on CentOS 7

Reproduction steps

I believe that the problem can be produced by entering the following entries in STATIONS_ADJOINT:
ST1        DS      -6708.235405744787      5999.964811656147       0       0.0
ST2        DS      -6000.086675107924      6708.126406918124       0       0.0
ST3        DS      -5999.341290654987      6708.793041840081       0       0.0
With a Mesh_Par_file with the following parameters:
LATITUDE_MIN                    = -10000
LATITUDE_MAX                    = 10000
LONGITUDE_MIN                   = -10000
LONGITUDE_MAX                   = 10000
DEPTH_BLOCK_KM                  = 20.d0
UTM_PROJECTION_ZONE             = 36
SUPPRESS_UTM_PROJECTION         = .true.
NEX_XI                          = 80
NEX_ETA                         = 80
NPROC_XI                        = 5
NPROC_ETA                       = 5
And the number of processors, by extension being 25.

Screenshots

No response

Logs

At line 344 of file src/specfem3D/compute_arrays_source.f90 (unit = 47, file = 'run0002/./OUTPUT_FILES/../SEM/0_dx_SU.adj')
Fortran runtime error: End of file
Error termination. Backtrace:
At line 344 of file src/specfem3D/compute_arrays_source.f90 (unit = 47, file = 'run0002/./OUTPUT_FILES/../SEM/4_dx_SU.adj')
Fortran runtime error: End of file
Error termination. Backtrace:
#0  0x2ac9c2c4edfd in ???
#1  0x2ac9c2c4f995 in ???
#2  0x2ac9c2c5017d in ???
#3  0x2ac9c2e53bb2 in ???
#4  0x2ac9c2e57d0f in ???
#5  0x2ac9c2e53721 in ???
#0  0x2b3c438dadfd in ???
#1  0x2b3c438db995 in ???
#2  0x2b3c438dc17d in ???
#3  0x2b3c43adfbb2 in ???
#4  0x2b3c43ae3d0f in ???
#5  0x2b3c43adf721 in ???
#6  0x417629 in compute_arrays_adjoint_source_su_
	at src/specfem3D/compute_arrays_source.f90:344
#6  0x417629 in compute_arrays_adjoint_source_su_
	at src/specfem3D/compute_arrays_source.f90:344
#7  0x40d46b in compute_add_sources_viscoelastic_
	at src/specfem3D/compute_add_sources_viscoelastic.F90:225
#7  0x40d46b in compute_add_sources_viscoelastic_
	at src/specfem3D/compute_add_sources_viscoelastic.F90:225
#8  0x428f14 in compute_forces_viscoelastic_calling_
	at src/specfem3D/compute_forces_viscoelastic_calling_routine.F90:275
#8  0x428f14 in compute_forces_viscoelastic_calling_
	at src/specfem3D/compute_forces_viscoelastic_calling_routine.F90:275
#9  0x4c6716 in iterate_time_undoatt_
	at src/specfem3D/iterate_time_undoatt.F90:557
#10  0x403909 in xspecfem3d
	at src/specfem3D/specfem3D.F90:412
#11  0x403909 in main
	at src/specfem3D/specfem3D.F90:365
#9  0x4c6716 in iterate_time_undoatt_
	at src/specfem3D/iterate_time_undoatt.F90:557
#10  0x403909 in xspecfem3d
	at src/specfem3D/specfem3D.F90:412
#11  0x403909 in main
	at src/specfem3D/specfem3D.F90:365
At line 344 of file src/specfem3D/compute_arrays_source.f90 (unit = 47, file = 'run0002/./OUTPUT_FILES/../SEM/20_dx_SU.adj')
Fortran runtime error: End of file

OS

Linux

@eyalshimony eyalshimony added the bug label Oct 6, 2024
@eyalshimony
Copy link
Author

I believe I found the reason for the problem. find_local_coordinates in locate_point allows to search points also slightly outside elements (up to 1 percent difference) by "extending" it, meaning that the closest point in the element to the target point (the station's point) can actually be outside of it, leading to zero distance. Therefore, two different elements (and as a result, two different slices) can have 0 as the distance between them and the stations, leading to (normally) the first of them to be chosen in locate_MPI_slice. Because of numerical accuracy, sometimes another slice will be chosen (the search for the closest point inside each slice is stopped when a point is found such that its square distance is 10^(-10), while in locate_MPI_slice it is just a straight comparison), so the behaviour is somewhat unexpected. I hope my understanding is correct.
This 1 percent difference is a feature of the code to allow for points that lie slightly outside of the mesh. However, it can create problems as the one I described. To me it seems that the correct solution will be a preferential treatment to slices for which the elements were not extended (the absolute values of the local coordinates are under 1) in locate_MPI_slice. This information is currently not passed to it (only the distance), so for such a feature to be added this information will need to be somehow passed along with the distance.

@danielpeter
Copy link
Member

sounds possible, but this will add additional checks and affect all other simulation setups and speeds which have been fine so far. and it's still unclear to me why the neighbor slice finds a better position if they should be within an element of the one you assume.

could attach the full output files, i.e., output_meshfem3D.txt, output_generate_databases.txt and output_solver.txt for checking?

@eyalshimony
Copy link
Author

eyalshimony commented Oct 7, 2024

error_output.txt
output_generate_databases_ev2.txt
output_meshfem3D_ev2.txt
output_solver_ev1.txt
output_solver_ev2.txt
output_generate_databases_ev1.txt
output_meshfem3D_ev1.txt
Hi, I attached the output files (two each as I run two events in parallel). I also attached a file named output_error, which contains both the error the occurs and some prints I did to understand it. Note that the error occurs on the adjoint simulation, so the output_solver files are for the adjoint while the other two are for the forward.
As far as I understand it, each core works independently. As the code allows to expand the elements by 1%, it is possible that two different elements will have a distance between them and the target point of less than 10^(-10), which is the limit after which the search stops. If it happens, then it is possible that the wrong slice will be chosen, under one of the two following conditions:

  1. If the distance calculated for the wrong slice is lower than the right one (can happen if both are under 10^(-10)).
  2. If the distances are equal (probably zero), then the first slice (numerically) is chosen, so if the right one is second, the wrong one will be chosen.

You can see the first behaviour in the output_error file, where the coordinates of the "found" point in slice 4 are actually out of the slice (the y_found is -5999.3412906549866, while the maximum y for this slice is -6000). Furthermore, you can see in the file that due to the limit of 10^(-10) the distance found for slice 4 is 0, while for slice 9 (which is the slice where the point actually lies) is 9*10^(-13). Then, slice 4 is chosen, even though it is the wrong slice.

@danielpeter
Copy link
Member

thanks, you could try if the recent devel version with PR #1756 fixes your problem.

@eyalshimony
Copy link
Author

It fixes the problem, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants