4

The background of this question is in some computational areas such as Computational Fluid Dynamics (CFD). We often need finer mesh/grid in some critical regions while the background mesh can be coarser. For example the adaptive refine mesh to track shock waves and nesting domains in meteorology.

The Cartesian topology is used and domain decomposition is shown in the following sketch. In this case, 4*2=8 processors are used. The single number means the processor's rank and (x,y) means its topological coordinate. Fig 1. Basic topology

Assume the mesh is refined in the regions with ranks 2, 3, 4, 5 (in the middle) and the local refinement ratio is defined as R=D_coarse/D_fine=2 in this case. Since the mesh is refined, so the time advancement should also be refined as well. This needs in the refined region the time steps t, t+1/2*dt, t+dt should be computed while only time steps t and t+dt are computed in global regions. This requires a smaller communicator which only includes ranks in the middle for extra computation. A global rank + coordinate and correspondent local ones (in red) sketch is shown as following:

Local refinement and their ranks and topological coordinate

However, I have some errors in implementation of this scheme and a snippet code in Fortran (not complete) is shown:

integer :: global_comm, local_comm   ! global and local communicators
integer :: global_rank, local_rank   !
integer :: global_grp,  local_grp    ! global and local groups
integer :: ranks(4)                  ! ranks in the refined region
integer :: dim                       ! dimension
integer :: left(-2:2), right(-2:2)   ! ranks of neighbouring processors in 2 directions
ranks=[2,3,4,5]

!---- Make global communicator and their topological relationship
call mpi_init(ierr)
call mpi_cart_create(MPI_COMM_WORLD, 2, [4,2], [.false., .false.], .true., global_comm, ierr)
call mpi_comm_rank(global_comm, global_rank, ierr)
do dim=1, 2
    call mpi_cart_shift(global_comm, dim-1, 1, left(-dim), right(dim), ierr)
end do


!---- make local communicator and its topological relationship
! Here I use group and create communicator

! create global group
call mpi_comm_group(MPI_COMM_WORLD, global_grp, ierr)

! extract 4 ranks from global group to make a local group
call mpi_group_incl(global_grp, 4, ranks, local_grp, ierr)

! make new communicator based on local group
call mpi_comm_create(MPI_COMM_WORLD, local_grp, local_comm, ierr)

! make topology for local communicator
call mpi_cart_create(global_comm, 2, [2,2], [.false., .false.], .true., local_comm, ierr)

! **** get rank for local communicator
call mpi_comm_rank(local_comm, local_rank, ierr)

! Do the same thing to make topological relationship as before in local communicator.
 ...

When I run the program, the problem comes from ' **** get rank for local communicator' step. My idea is to build two communicators: global and local communicators and local one is embedded in the global one. Then create their correspondent topological relationship in global and local communicators respectively. I do not if my concept is wrong or some syntax is wrong. And thank you very much if you can give me some suggestions.

The error message is

*** An error occurred in MPI_Comm_rank
 *** reported by process [817692673,4]
 *** on communicator MPI_COMM_WORLD
 *** MPI_ERR_COMM: invalid communicator
 *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
 ***    and potentially your MPI job)
Yongxin
  • 141
  • 1
  • 7
  • You get the status of operations in the `ierr` , have you tried checking these after each operation and on each rank to verify that all operations up to this point are indeed working? – d_1999 Jul 12 '17 at 15:30
  • I always use coarse background mesh setup (only one global communicator) before and this is no problem. When I test the 2 communicators case, the program works well if I comment all content from ! **** step (mpi_comm_rank for local communicator) and the error message is shown as above. I also added a command to test if ierr is equal to 0 after call mpi_comm_rank step but this command is not executed and program is aborted in call mpi_comm_rank step with the error. – Yongxin Jul 12 '17 at 15:48
  • why do you create `local_comm` twice ? (e.g. first with `mpi_comm_create()` and then with `mpi_cart_create()`) – Gilles Gouaillardet Jul 13 '17 at 01:21
  • @GillesGouaillardet mpi_comm_create() is to create a smaller communicator which only has 4 ranks. The middle region should do some independent computations, so it needs communications with neighbouring processors and a Cartesian topology is built by mpi_cart_creat() in the local communicator – Yongxin Jul 13 '17 at 15:25
  • @GillesGouaillardet Indeed, I found the what the problem is. I created local_comm twice. In `mpi_cart_create()` step, I should use `local_comm` as an older communicator to create a new Cartesian communicator. Thanks – Yongxin Jul 13 '17 at 16:21

1 Answers1

2

You are creating a 2x2 Cartesian topology from the group of the global communicator, which contains eight ranks. Therefore, in four of them the value of local_comm as returned by MPI_Cart_create will be MPI_COMM_NULL. Calling MPI_Comm_rank on the null communicator results in the error.

If I understand your logic correctly, you should instead do something like:

if (local_comm /= MPI_COMM_NULL) then
  ! make topology for local communicator
  call mpi_cart_create(local_comm, 2, [2,2], [.false., .false.], .true., &
                     local_cart_comm, ierr)

  ! **** get rank for local communicator
  call mpi_comm_rank(local_cart_comm, local_rank, ierr)

  ...
end if
Hristo 'away' Iliev
  • 66,184
  • 11
  • 120
  • 168
  • Thanks. Indeed, the step makes local communicator from local group gives 4 `MPI_COMM_NULL` values for `local_comm`. If I directly do `! make topology for local communicator call mpi_cart_create(local_comm, 2, [2,2], [.false., .false.], .true., & local_cart_comm, ierr)` This cannot be executed. So I use a `if (local_comm /= mpi_comm_null)` condition before `mpi_cart_create()` to make `local_cart_comm`. – Yongxin Jul 13 '17 at 15:51
  • Of course, I fell for the same mistake I described in the first paragraph :) – Hristo 'away' Iliev Jul 13 '17 at 20:21