Page 1 of 1

MPI_Abort errors on different node counts for Cori KNL

Posted: Wed Jun 14, 2017 3:55 am
by tbbishop
Hi, I am new to Qbox, and I am trying to compute MD calculations on Cori KNL at NERSC for different numbers of nodes for benchmarking. I am using a Qbox module they provided, 1.63.5. First, I do a ground state calculation that works fine most of the time but on five nodes gives an error:

Code: Select all

Rank 137 [Tue Jun 13 17:51:05 2017] [c11-4c0s0n2] application called MPI_Abort(MPI_COMM_WORLD, 2) - process 137
and many more after that. My input file is:

Code: Select all

set nrowmax 64
SnO.sys
set ecut 20
set xc PBE
set wf_dyn PSDA
randomize_wf
set scf_tol 1.e-8
run -atomic_density 0 20 20
save gs.xml
and SnO.sys is

Code: Select all

set cell 68.39864845 0 0 0 62.19542973 0 0 0 30
species tin Sn_ONCV_PBE-1.1.xml
species oxygen O_ONCV_PBE-1.0.xml
atom Sn_000 tin 0.0000000000 3.4553016519 7.2387095421
atom Sn_001 tin 3.7999249137 0.0000000000 11.6731783381
atom O_000 oxygen 0.0000000000 0.0000000000 9.6417027165
atom O_001 oxygen 3.7999249137 3.4553016519 9.2701976498
... it has 324 atoms.

It seems in the output file the <eigenvalue_sum> fluctuates a lot, grep-ing it gives

Code: Select all

  <eigenvalue_sum>  137553.601065 </eigenvalue_sum>
  <eigenvalue_sum>  -780899.179023 </eigenvalue_sum>
  <eigenvalue_sum>  -28447.778683 </eigenvalue_sum>
  <eigenvalue_sum>  -39040.139854 </eigenvalue_sum>
  <eigenvalue_sum>  134230.029446 </eigenvalue_sum>
  <eigenvalue_sum>  -4863865.300754 </eigenvalue_sum>
  <eigenvalue_sum>  -2070989.432315 </eigenvalue_sum>
There's more before that but it was a lot, that's just the end of it, and the file ends with many lines of:

Code: Select all

DoubleMatrix::potrf, info=334
Subsequent MD calculations on all nodes fail too. The input is:

Code: Select all

set nrowmax 64
load gs.xml
set xc PBE
set wf_dyn PSDA
set scf_tol 1.e-6
set atoms_dyn MD
set dt 60
randomize_v 600
run 10 10
save md4.xml
And the end of the output is:

Code: Select all

<unit_cell_volume> 127622.500 </unit_cell_volume>
  <econst> inf </econst>
  <ekin_ion> 0.89641798 </ekin_ion>
  <temp_ion> 582.46993560 </temp_ion>
  total_electronic_charge: 3240.00000000
  <eigenvalue_sum>  -nan </eigenvalue_sum>
 DoubleMatrix::potrf, info=1
...
When I do this for a small cell of 4 atoms, everything works fine. Hopefully you will know how to fix this.
Thank you in advance for any help,
-Tyler Bishop

Re: MPI_Abort errors on different node counts for Cori KNL

Posted: Wed Jun 14, 2017 5:07 pm
by fgygi
Hi Tyler,
This error is caused by an instability of the PSDA algorithm in the ground state calculation when starting from an unfavorable initial wave function. The growth of oscillations during iterations causes a failure of the orthogonalization of orbitals, leading to the crash. This sometimes happens in large unit cells when starting with randomized wave functions. The random numbers used to initialize the wave functions depend on the particular configuration of tasks (i.e total number of tasks, value of nrowmax) because random number generators are initialized differently on each node. For this reason, it is possible that a calculation may not converge with one choice of nrowmax, and may converge with another choice. This is obviously not an acceptable outcome, and we are working to improve this situation.

The JD algorithm (set wf_dyn JD) is more conservative than the PSDA algorithm, and it can be better when starting a ground state calculation in a large unit cell. The PSDA algorithm may then be used later in MD simulations when wave functions remain close to the ground state.

I have reproduced the problem you observed using 5 nodes on cori, assuming a 324-atom 9x9x1 supercell of SnO. Modifying the value of nrowmax and using the JD algorithm, I get the ground state calculation to converge.

I used the following job script:

Code: Select all

#!/bin/bash -l
#SBATCH -C haswell
#SBATCH -p debug
#SBATCH -N 10
#SBATCH --tasks-per-node=32
#SBATCH -t 00:30:00
#SBATCH --qos=premium
# default job name is job script file name
# To change job name, submit with: sbatch -n 64 -J jobname qbox_cori.job
module load qbox
echo $SLURM_JOB_NAME
exe=qb
export OMP_NUM_THREADS=2
infile=${SLURM_JOB_NAME}.i
outfile=${SLURM_JOB_NAME}.r
srun $exe $infile > $outfile
and the following Qbox input script:

Code: Select all

set nrowmax 160
sno324.sys
set ecut 60
set xc PBE
set wf_dyn JD
randomize_wf 
set scf_tol 1.e-8
run -atomic_density 0 50 10
save scratch_dir_path/gs.xml
Some other observations about this calculation:
1) Due to the presence of oxygen, it is necessary to use a larger plane wave cutoff. A value of 60 Ry would be recommended. The accuracy of your results should also be checked in a few cases with a larger ecut to confirm that it is sufficient.
2) Performance is usually better when the nrowmax parameter is comparable to the size of the Fourier grid (np2v) in the z-direction. In this system, with a cell size of 30 bohr in the z direction and ecut=60 Ry, the value of np2v is 160 (it can be grep'd from the output file). Therefore, a good partition size could be 160x1, or 160x2, or 160x3, etc. by setting nrowmax=160.
3) Note that the Qbox version provided with the NERSC module is working on the Cori Haswell nodes. Use the "-C haswell" option in the job script.

I will try a few more tests on this system on cori.

Hope this helps.
Francois

Re: MPI_Abort errors on different node counts for Cori KNL

Posted: Fri Jun 16, 2017 3:06 pm
by tbbishop
Hi Francois,

Thank you for your response. It seems to have worked for the ground state calculations, but on the MD step it still is failing, although there are two jobs still running which may work.
My input for this is

Code: Select all

set nrowmax 160
load gs.xml
set xc PBE
set wf_dyn JD
set scf_tol 1.e-8
set atoms_dyn MD
set dt 60
randomize_v 600
run 10 10 
save md4.xml
and the output before it fails is

Code: Select all

<unit_cell_a_norm> 68.398648 </unit_cell_a_norm>
<unit_cell_b_norm> 62.195430 </unit_cell_b_norm>
<unit_cell_c_norm> 30.000000 </unit_cell_c_norm>
<unit_cell_alpha>  90.000 </unit_cell_alpha>
<unit_cell_beta>   90.000 </unit_cell_beta>
<unit_cell_gamma>  90.000 </unit_cell_gamma>
<unit_cell_volume> 127622.500 </unit_cell_volume>
  <econst> inf </econst>
  <ekin_ion> 0.89641798 </ekin_ion>
  <temp_ion> 582.46993560 </temp_ion>
  total_electronic_charge: 3240.00000000
  <eigenvalue_sum>  -nan </eigenvalue_sum>
 DoubleMatrix::potrf, info=1
 DoubleMatrix::potrf, info=1
...
I will try some more tests on smaller systems.
Thank you for your help,
Tyler

Re: MPI_Abort errors on different node counts for Cori KNL

Posted: Sat Jun 17, 2017 1:23 am
by fgygi
Hi Tyler,
It seems that this system is pushing the wave function solvers to their limits. Large unit cells with lots of empty space are challenging. I used "set ecutprec 10" to choose a more conservative preconditioner (the automatic preconditioner used when ecutprec=0 seems to be too aggressive here). The following script worked:

Code: Select all

set nrowmax 160
load scratch/gs.xml
set xc PBE
set wf_dyn JD
set ecutprec 10
set scf_tol 1.e-8
set atoms_dyn MD
set dt 60
randomize_v 600
run 10 10
save scratch/md.xml
I will also try changing the number of tasks per node and the number of threads per task. For example, using nrowmax=80, 16 tasks per node and 4 threads seems to go a bit faster.

Francois

Re: MPI_Abort errors on different node counts for Cori KNL

Posted: Mon Jun 19, 2017 4:03 pm
by tbbishop
Hi Francois,

Unfortunately I still couldn't get it to work; I am looking to see if I made an error in my inputs. However, everything is running fine for 8x8x4 supercells which will work. Thank you very much for all your help. It is much appreciated.

Best,
Tyler