Page 3 of 3
Re: qbox core dumps in MPI on BG/Q
Posted: Tue Apr 30, 2013 7:57 pm
by naromero
We have confirmed that this is a context exhaustion problem in MPICH and the problem is not in Qbox. I don't have an ETA for a fix for MPICH. But will post a comment here once it is fixed. Please let me know if you encounter this on production runs on BG/Q.
Re: qbox core dumps in MPI on BG/Q
Posted: Wed Jun 05, 2013 3:42 pm
by naromero
Hi Francois,
Unfortunately, the MPICH developers that I have been working with on this issue are no longer employed by ANL which has slowed down progress.
I have attempted to produce this issue with a standalone test case based on BLACS but I failed to reproduce this issue.
I have been looking through the Qbox code to understand what is going on and I believe that I failed to see the following sequence of communicator calls. I will write them below in pseudo code, please let me know if this this is correct or not.
0. spin_context = BlacsGrid(nprow1,npcol1), created on MPI_COMM_WORLD
1. kpt_context = Blacs(1,npcol2), created on a subcommunicator associated with spin_context
2. sd_context = BlacsGrid(1,npcol3), created on a subcommunicator associated with kpt_context
3. column_context = loop icol = 1 to npcol3 which calls BlacsGrid(nprow4,npcol4) and creates/destroys this context if icol equals mycol, created on a subcommunicator associated with sd_context
Is there any other important communicator calls that I have overlooked? One thing I had trouble following is why there seemed to be two "types" of column context created. One appeared to actually be a column, but another appear to be communicator with a single rank. It is this latter context communication call that is leading to problems as far as I can tell.
For reference purposes, the test case is a spin-paired Gamma-point calculation on a box of water molecules.
Re: qbox core dumps in MPI on BG/Q
Posted: Tue Jun 25, 2013 7:34 pm
by fgygi
Nichols,
It appears that the problem comes from the constructor of the SlaterDet class. This part relies on the BLACS and MPICH library ability to create and delete many contexts and communicators. Apparently, this is not something that we can count on.
I have now developed a Qbox branch in which the use of BLACS context is minimized (with the hope that they will eventually be entirely eliminated). In particular, the SlaterDet constructor now avoids creating and deleting many BLACS contexts. Early tests of this branch show that it is now possible to run 32k, 64k and 128k MPI tasks on Mira without hitting the "too many communicators" issue. Further testing is in progress to verify that code functionality is correct with this new approach.
Francois
Re: qbox core dumps in MPI on BG/Q
Posted: Wed Oct 16, 2013 6:00 pm
by naromero
Hi Francois,
Thanks for the update. I think this is the best way to proceed. If you run into the bug again, let us know. I did take a look at the SlaterDet constructor and tried to reproduce the effect in a simple C code but never managed to do it. I think this is because my simple C code was too oversimplied from the SlaterDet constructor. As it is used in Qbox, it seems to have a heiarchy of nested communicators k-points -> spins -> etc, but my simplified code only ever simulated one of theese layers.
Re: qbox core dumps in MPI on BG/Q
Posted: Sun Dec 15, 2013 10:40 pm
by fgygi
The recently posted release 1.58.0 includes the new implementation of the SlaterDet class that avoids the use of BLACS in the class constructor. The above problems should therefore be resolved. Please post new messages if you find problems with this version.
Re: qbox core dumps in MPI on BG/Q
Posted: Wed Jan 29, 2014 3:52 pm
by naromero
Francois,
Thanks for addressing this issue.