Page 1 of 1

Problems with MVAPICH2

Posted: Sat Feb 16, 2013 8:31 pm
by fgygi
Building Qbox on stampede.tacc.utexas.edu using the following environment leads to hangs when running on more than one task:
intel/13.0.2.146
mkl/11.0
mvapich2/1.9a2

Replacing mvapich2 with impi solves the problem (this can be done on stampede using the "module swap" command: module swap mvapich2 impi):

impi/4.1.0.030

This appears to be a problem with mvapich2. The BLACS tester program provided with the BLACS netlib distribution fails when built with mvapich2 but passes all tests when built with impi.

In summary, the following modules should be used to build Qbox on stampede:

Code: Select all

  1) TACC-paths       4) intel/13.0.2.146    7) cluster
  2) Linux            5) mkl/11.0            8) impi/4.1.0.030
  3) cluster-paths    6) TACC                9) fftw2/2.1.5
The following file should be used (stampede.mk), and the environment variable TARGET should be set to "stampede" before invoking make

Code: Select all

# modules required: intel impi mkl fftw2
# Before using make, use:
# $ module load intel mkl fftw2
# $ module load swap mvapich2 impi 
#-------------------------------------------------------------------------------
#
# Copyright (c) 2013 The Regents of the University of California
#
# This file is part of Qbox
#
# Qbox is distributed under the terms of the GNU General Public License 
# as published by the Free Software Foundation, either version 2 of 
# the License, or (at your option) any later version.
# See the file COPYING in the root directory of this distribution
# or <http://www.gnu.org/licenses/>.
#
#-------------------------------------------------------------------------------
#
#  stampede.mk
#
#-------------------------------------------------------------------------------
#
 XERCESCDIR=$(HOME)/software/xerces/xerces-c-src_2_8_0

 PLTOBJECTS = 
 CXX=mpicxx
 LD=mpicxx

 PLTFLAGS += -DUSE_FFTW
 PLTFLAGS += -DUSE_MPI -DSCALAPACK -DADD_
 PLTFLAGS += -D__linux__
 PLTFLAGS += -DUSE_XERCES
 PLTFLAGS += -DUSE_DFFTW
 PLTFLAGS += -D_LARGEFILE64_SOURCE  -D_FILE_OFFSET_BITS=64
 PLTFLAGS += -DPARALLEL_FS
#PLTFLAGS += -DCHOLESKY_REMAP=16
 PLTFLAGS += -DMPICH_IGNORE_CXX_SEEK

 INCLUDE = -I$(TACC_FFTW2_INC) -I$(XERCESCDIR)/include

 CXXFLAGS= -O3 $(INCLUDE) $(PLTFLAGS) $(DFLAGS)

 LDFLAGS = $(LIBPATH) $(LIBS)
 LIBPATH =  

 LIBS =  -L$(MKLROOT)/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64  \
         -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread \
         -lmkl_lapack95_lp64 -lm \
         -L$(TACC_FFTW2_LIB) -ldfftw \
         -luuid \
         -Bstatic \
         -L$(XERCESCDIR)/lib -lxerces-c \
         -Bdynamic

#-------------------------------------------------------------------------------
The Xerces-C 2.8.0 library was built using:

Code: Select all

./runConfigure -p linux -r none -s

Re: Problems with MVAPICH2

Posted: Fri Mar 29, 2013 6:08 pm
by fgygi
Following up on the mvapich2 issue with the XSEDE staff, it appears that there is a bug in mvapich2:
FROM: Snead, Bryan
(Concerning ticket No. 226548)

Francois:

> The BLACS test suite fails when using mvapich2, as I indicated in the first
message of this ticket. Is there a way to understand why? Should I conclude that
mvapich2 is broken...

Turns out, I'm told, that there is a bug in mvapich2 that will be fixed in a
future update. Administrators worked for quite some time with mvapich2
developers to find it. (And, you're correct, impi works much better with BLACS.)

I would recommend using impi until an mvapich2 update is pushed.

Re: Problems with MVAPICH2

Posted: Fri Apr 05, 2013 9:04 pm
by naromero
Can we get more info.?

In particular, which BLACS test fail or did all the tests simply just fail.

Also, can we get more info. about the mvapich2 fix so that other sites that use a subset/supertset of it can benefit.

Thanks,
Nick Romero

Re: Problems with MVAPICH2

Posted: Tue Apr 09, 2013 6:29 pm
by fgygi
The test is run on stampede.tacc.utexas.edu using mvapich2/1.9.
The test program xCbtest_MPI-LINUX-0 fails in the BSBR section for INTEGER, REAL, DOUBLE PRECISION, COMPLEX, and DOUBLE COMPLEX. A typical error message is:

PROCESS { 1, 1} REPORTS ERRORS IN TEST# 2162:
Invalid element at A( 2, 1): Expected=[-.775427163 ,0.447751433 ]; Received=[-.199999996E-01,-.19999
9996E-01]
PROCESS { 1, 1} DONE ERROR REPORT FOR TEST# 2162.


For each precision, 10 BSBR tests fail.
The complete output file is attached.