Page 1 of 1

Segmentation fault

Posted: Sat Dec 14, 2013 3:29 pm
by ckande
Hello there,

I am trying to compile qbox-1.58 with Intel 13.1 / Intel MKL / MVAPICH2-2-2.0b / XERCES-2.8.0. The compilation goes well and I get an executable. But the executable makes a runtime segmentation trying to read the pseudopotential files. I am not sure what I am doing wrong. Looks like there is something wrong with the XERCES library.

I compiled XERCES with the following options ./runConfigure -p linux -r none -s -c icc -x icpc -P $HOME/opt/xerces/2.8.0_32 and ./runConfigure -p linux -b 64 -r none -s -c icc -x icpc -P $HOME/opt/xerces/2.8.0_64. Both the 32 bit and the 64 bit executables segfault.

The mk file and the Segmentation fault error are below.

Code: Select all

#-------------------------------------------------------------------------------
#
# Copyright (c) 2008 The Regents of the University of California
#
# This file is part of Qbox
#
# Qbox is distributed under the terms of the GNU General Public License 
# as published by the Free Software Foundation, either version 2 of 
# the License, or (at your option) any later version.
# See the file COPYING in the root directory of this distribution
# or <http://www.gnu.org/licenses/>.
#
#-------------------------------------------------------------------------------
#
#  pencil x86_64_gcc.mk
#
#-------------------------------------------------------------------------------
# $Id: pavane.mk,v 1.2 2006/08/22 15:23:28 fgygi Exp $
#
 PLT=x86_64
#-------------------------------------------------------------------------------
# GCCDIR=/usr/lib/gcc/x86_64-redhat-linux/3.4.3
 XERCESCDIR=$(HOME)/opt/xerces/2.8.0_32
 PLTOBJECTS = readTSC.o

 CXX=mpic++
 LD=$(CXX)

 PLTFLAGS += -DIA32 -DUSE_FFTW -D_LARGEFILE_SOURCE \
             -D_FILE_OFFSET_BITS=64 -DUSE_MPI -DSCALAPACK -DADD_ \
             -DAPP_NO_THREADS -DXML_USE_NO_THREADS -DUSE_XERCES #-DXERCESC_3

 MKLROOT = $(HOME)/opt/intel/mkl

 FFTWDIR=$(MKLROOT)/lib/intel64
 BLASDIR=$(MKLROOT)/lib/intel64
 LAPACKDIR=$(MKLROOT)/lib/intel64
#BLASDIR=/usr/lib64
#LAPACKDIR=/usr/lib64 software/lapack/LAPACK

 INCLUDE = -I$(MPIDIR)/include -I$(MKLROOT)/include/fftw -I$(XERCESCDIR)/include

 CXXFLAGS= -O3 -D$(PLT) $(INCLUDE) $(PLTFLAGS) $(DFLAGS) -openmp -traceback -vec-report0

 LIBPATH = -L$(GCCDIR)/lib -L$(FFTWDIR) -L/usr/X11R6/lib \
           -L$(BLASDIR) -L$(LAPACKDIR) \
           -L$(XERCESCDIR)/lib

 LIBS =  $(PLIBS) \
         -lmkl_blas95_lp64 -lmkl_lapack95_lp64 \
         -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_core -lmkl_sequential \
         -lmkl_blacs_intelmpi_lp64 -lpthread -lm -lfftw2xf_double_intel \
         -lxerces-c
         #$(XERCESCDIR)/lib/libxerces-c.a

# Parallel libraries
 PLIBS = $(SCALAPACKLIB) $(CBLACSLIB)

 LDFLAGS = $(LIBPATH) $(LIBS) -openmp

Code: Select all

<release> 1.58.0 compass </release>
<user> cande </user>
<sysname> Linux </sysname>
<nodename> compass.phys.tue.nl </nodename>
<start_time> 2013-12-14T15:20:01Z </start_time>
<mpi_processes count="1">
<process id="0"> compass.phys.tue.nl </process>
</mpi_processes>
<omp_max_threads> 1 </omp_max_threads>
[qbox] <cmd>set cell 16 0 0  0 16 0  0 0 16</cmd>
<unit_cell 
    a="16.00000000  0.00000000   0.00000000  "
    b="0.00000000   16.00000000  0.00000000  "
    c="0.00000000   0.00000000   16.00000000 " />
[qbox] <cmd>species oxygen O_HSCV_PBE-1.0.xml</cmd>
  SpeciesCmd: defining species oxygen as O_HSCV_PBE-1.0.xml
[compass.phys.tue.nl:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
Segmentation fault

Re: Segmentation fault

Posted: Sun Dec 15, 2013 8:56 am
by fgygi
I suggest the following tests to narrow down the possible causes:
1) build the Xerces-C example programs (in the directory "samples"). Try to run SAX2Count on a pseudopotential file. This test program counts the XML elements in the document. This should verify that Xerces-C was properly compiled.

2) If the Xerces samples test program work, try compiling the program testSpecies.C in the Qbox src directory. This program simply reads a pseudopotential file and dumps it on output.

3) Try running Qbox in a debugger if possible to locate exactly the place where the signal 11 occurs.

Re: Segmentation fault

Posted: Sun Dec 15, 2013 6:36 pm
by fgygi
The following makefile works on the TACC Stampede cluster. It uses intel libraries and intel mpi.

Code: Select all

# modules required: intel impi mkl fftw2
# Before using make, use:
# $ module load intel mkl fftw2
# $ module swap mvapich2 impi 
#-------------------------------------------------------------------------------
#
# Copyright (c) 2013 The Regents of the University of California
#
# This file is part of Qbox
#
# Qbox is distributed under the terms of the GNU General Public License 
# as published by the Free Software Foundation, either version 2 of 
# the License, or (at your option) any later version.
# See the file COPYING in the root directory of this distribution
# or <http://www.gnu.org/licenses/>.
#
#-------------------------------------------------------------------------------
#
#  stampede.mk
#
#-------------------------------------------------------------------------------
#
 XERCESCDIR=$(HOME)/software/xerces/xerces-c-src_2_8_0

 PLTOBJECTS = 
 CXX=mpicxx
 LD=mpicxx

 PLTFLAGS += -DUSE_FFTW
 PLTFLAGS += -DUSE_MPI -DSCALAPACK -DADD_
 PLTFLAGS += -D__linux__
 PLTFLAGS += -DUSE_XERCES
 PLTFLAGS += -DUSE_DFFTW
 PLTFLAGS += -D_LARGEFILE64_SOURCE  -D_FILE_OFFSET_BITS=64
 PLTFLAGS += -DPARALLEL_FS
#PLTFLAGS += -DCHOLESKY_REMAP=16
 PLTFLAGS += -DMPICH_IGNORE_CXX_SEEK

 INCLUDE = -I$(TACC_FFTW2_INC) -I$(XERCESCDIR)/include

 CXXFLAGS= -O3 -vec-report1 \
           $(INCLUDE) $(PLTFLAGS) $(DFLAGS)

 LDFLAGS = $(LIBPATH) $(LIBS)
 LIBPATH =  

 LIBS =  -L$(MKLROOT)/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64  \
         -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread \
         -lmkl_lapack95_lp64 -lm \
         -L$(TACC_FFTW2_LIB) -ldfftw \
         -luuid \
         -Bstatic \
         -L$(XERCESCDIR)/lib -lxerces-c \
         -Bdynamic

#-------------------------------------------------------------------------------

Re: Segmentation fault

Posted: Sun Dec 15, 2013 7:38 pm
by ckande
XERCES-C installation works.

Code: Select all

cande@compass:~/tools/xerces-c-src_2_8_0/bin> ./SAXCount ~/tools/qbox-1.58.0/test/h2ogs/O_HSCV_PBE-1.0.xml 
/home/cande/tools/qbox-1.58.0/test/h2ogs/O_HSCV_PBE-1.0.xml: 2 ms (18 elems, 7 attrs, 0 spaces, 101856 chars)
But, testSpecies fails. With openmpi/1.6.5 it fails with the follwing:

Code: Select all

cande@compass:~/tools/qbox-1.58.0/src> ./testSpecies ../test/h2ocg/O_HSCV_PBE-1.0.xml 
--------------------------------------------------------------------------
[[20429,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: compass.phys.tue.nl

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
 s.uri() = 
 testSpecies: invoking SpeciesReader::uri_to_species:
 SpeciesReader::uri_to_species done
[compass:11802] *** Process received signal ***
[compass:11802] Signal: Segmentation fault (11)
[compass:11802] Signal code: Address not mapped (1)
[compass:11802] Failing at address: 0xfffffffffffffff1
[compass:11802] [ 0] /lib64/libpthread.so.0() [0x3eb940f500]
[compass:11802] [ 1] /home/cande/opt/intel/mkl/lib/intel64/libmkl_core.so(mkl_serv_free+0x10) [0x2b364d8b82b0]
[compass:11802] [ 2] /home/cande/opt/intel/mkl/lib/intel64/libmkl_intel_lp64.so(+0x346918) [0x2b364d42e918]
[compass:11802] [ 3] ./testSpecies(_Z5sinftiPd+0x19d) [0x4cf29d]
[compass:11802] [ 4] ./testSpecies(_ZN7Species10initializeEd+0xa59) [0x4ccfe9]
[compass:11802] [ 5] ./testSpecies(main+0x1c9) [0x4c9aa9]
[compass:11802] [ 6] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3eb881ecdd]
[compass:11802] [ 7] ./testSpecies() [0x4c9819]
[compass:11802] *** End of error message ***
Segmentation fault
With mvapich2/2-2.0b

Code: Select all

cande@compass:~/tools/qbox-1.58.0/src> ./testSpecies ../test/h2ogs/O_HSCV_PBE-1.0.xml 
 s.uri() = 
 testSpecies: invoking SpeciesReader::uri_to_species:
 SpeciesReader::uri_to_species done
[compass.phys.tue.nl:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
Segmentation fault
I will try the mk file above and report again.

Re: Segmentation fault

Posted: Sun Dec 15, 2013 7:53 pm
by ckande
Also fails at the same place with stampede.mk. Only changes that I make to stampede.mk are using an Intel compiled FFTW2 library.

The diff is as follows:

Code: Select all

cande@compass:~/tools/qbox-1.58.0/src> diff compass.mk stampede.mk 
23,25c23
< XERCESCDIR=$(HOME)/opt/xerces/2.8.0_32
< MKLROOT = $(HOME)/opt/intel/mkl
< FFTW2INC=$(MKLROOT)/include/fftw
---
> XERCESCDIR=$(HOME)/software/xerces/xerces-c-src_2_8_0
35c33
< PLTFLAGS += -DUSE_FFTW
---
> PLTFLAGS += -DUSE_DFFTW
41c39
< INCLUDE = -I$(FFTW2INC) -I$(XERCESCDIR)/include
---
> INCLUDE = -I$(TACC_FFTW2_INC) -I$(XERCESCDIR)/include
52c50
<          -lfftw2xf_double_intel \
---
>          -L$(TACC_FFTW2_LIB) -ldfftw \

Re: Segmentation fault

Posted: Sun Dec 15, 2013 10:31 pm
by fgygi
There have been cases where seg faults were caused by incorrect linking to a single precision fftw library.
You can check that the fftw lib works by compiling the program test_fftw.C in the Qbox src directory:

Code: Select all

$ make test_fftw
To test it on a set of 50 transforms of length 128, use:

Code: Select all

$ ./test_fftw 128 50

Re: Segmentation fault

Posted: Mon Dec 16, 2013 4:33 pm
by ckande
Thanks for the help so far.

The good part is that as of now I am able to compile and run the parallel executable on a single node with 16 cores using Intel 13.1 / Intel MKL / MVAPICH2 2-2.0b / FFTW-2.1.5. As you indicated there was in fact a problem with FFTW-2.1.5. I compiled the FFTW2 interface supplied with the Intel MKL and tried to use that. But, that did not work. It works when I compile and link to the FFTW-2.1.5 from the FFTW website.

But, the exe still seg faults when I am trying to run on more than one node when using MVAPICH2. If I use openmpi/1.6.5, it immediately segfaults immaterial on one node or more.

With MVAPICH2 test_fftw works on a single node and also on two nodes:

Code: Select all

cande@compass:~/tools/qbox-1.58.0/src> cat hosts 
compute-0-4
compute-0-8
cande@compass:~/tools/qbox-1.58.0/src> mpirun -n 2 ./test_fftw 128 50 > /dev/null && echo $?
0
cande@compass:~/tools/qbox-1.58.0/src> mpirun -f hosts -n 2 ./test_fftw 128 50 > /dev/null && echo $?
0
But, testSpecies works on a single node, but fails when running on two nodes

Code: Select all

cande@compass:~/tools/qbox-1.58.0/src> mpirun -n 2 ./testSpecies ../test/h2ogs/O_HSCV_PBE-1.0.xml > /dev/null && echo $?
0
cande@compass:~/tools/qbox-1.58.0/src> mpirun -f hosts -n 2 ./testSpecies ../test/h2ogs/O_HSCV_PBE-1.0.xml > /dev/null && echo $?
[cli_1]: aborting job:
Fatal error in MPI_Init:
Other MPI error
[mpiexec@compass.phys.tue.nl] HYDU_sock_read (utils/sock/sock.c:239): read error (Bad file descriptor)
[mpiexec@compass.phys.tue.nl] control_cb (pm/pmiserv/pmiserv_cb.c:199): unable to read command from proxy
[mpiexec@compass.phys.tue.nl] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@compass.phys.tue.nl] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpiexec@compass.phys.tue.nl] main (ui/mpich/mpiexec.c:331): process manager error waiting for completion
Finally, the executable qb itself runs without a problem on one node with 16 cores but fails with a similar message above when trying to run on two nodes.

With openmpi/1.6.5 both test_fftw AND testSpecies are running on both one or two nodes.

Code: Select all

cande@compass:~/tools/qbox-1.58.0/src> mpirun -np 2 ./test_fftw 128 50 > /dev/null && echo $?
0
cande@compass:~/tools/qbox-1.58.0/src> mpirun --hostfile hosts -np 2 ./test_fftw 128 50 > /dev/null && echo $?
0
cande@compass:~/tools/qbox-1.58.0/src> mpirun -np 2 ./testSpecies ../test/h2ogs/O_HSCV_PBE-1.0.xml > /dev/null && echo $?
cande@compass:~/tools/qbox-1.58.0/src> mpirun --hostfile hosts -np 2 ./testSpecies ../test/h2ogs/O_HSCV_PBE-1.0.xml > /dev/null && echo $?
0
But the executable itself segfaults on both a single node or more than one nodes. Any guidelines/diagnostics will be helpful.

Also, I just realized that the executable compiles on the headnode but not on the slave nodes. I am not sure how important this is. When I try to compile on the slave node, the compilation fails saying that g++ is not found. I also don't understand why icpc is looking for g++.

Code: Select all

[cande@compute-0-4 src]$ make
mpicxx -O3 -I/home/cande/opt/fftw/2.1.5/include -I/home/cande/opt/xerces/2.8.0_32/include -DUSE_MPI -DSCALAPACK -DADD_ -DUSE_XERCES -DUSE_FFTW -D_LARGEFILE64_SOURCE  -D_FILE_OFFSET_BITS=64 -DMPICH_IGNORE_CXX_SEEK   -DTARGET='"compass"'   -c -o qb.o qb.C
icpc: error #10001: could not find directory in which g++ resides
make: *** [qb.o] Error 1

Re: Segmentation fault

Posted: Mon Dec 16, 2013 6:32 pm
by fgygi
Similar symptoms were in some cases related to incorrect environment settings on the head node and/or the compute nodes.
First, note that running test_fftw with mpirun will not actually test any MPI functionality (test_fftw does not use MPI) but it will only run separate copies of the program.

I understand that you are testing two different MPI implementations (openmpi and mvapich2). Make sure that when switching from one to the other you also recompile Qbox and change the MKL libraries you link to (in particular the BLACS library used must change).

Regarding mvapich2, you may want to check the following topic: http://fpmd.ucdavis.edu/qbox-list/viewt ... ?f=4&t=215 . If the Intel MPI implementation is available on your cluster you may want to try it.

Getting openmpi to work can be challenging. The invocation of the mpirun command depends on environment variables set by various scripts. In particular, if you are using Intel's "compilervars.sh" script to define the PATH to icc and icpc (or if it is used in an initialization script such as .bashrc) then the script also changes the PATH to invoke the Intel mpirun program (i.e. overriding the openmpi mpirun command). Therefore, you have to use an explicit path when using mpirun, such as e.g. /opt/openmpi/bin/mpirun. There seems to be also a difference in environment between your head node and compute nodes, since the mpicxx command seems to invoke different compilers.
These are most likely MPI configuration issues, and it is probably best to sort them out with a simple MPI C++ program such as the following:

Code: Select all

// hello.C
#include "mpi.h"
#include <iostream>
using namespace std;
int main( int argc, char *argv[])
{
    int myid, numprocs;
    MPI_Init(&argc,&argv);
    MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD,&myid);
    cout << myid << " of " << numprocs << endl;
    MPI_Finalize();
    return 0;
}

Re: Segmentation fault

Posted: Tue Dec 17, 2013 10:04 am
by ckande
I finally managed to compile with OpenMPI. As you pointed out in your last mail, I was using a wrong BLACS MKL library in the mk file. Everything now works.

As for the MVAPICH2 executable not working on more than one node, the simple helloworld program also did not work, so I think there is some issue with the Infiniband network on the cluster and will need further investigation.

Thanks for the help all along.