Fixing illegal instruction issues

From ParabolaWiki
Jump to: navigation, search


1 Introduction

Sometimes, running a program results in a crash, and a message about illegal instruction(s). Here's an example from the bug report #2789:

$ prboom-plus
Illegal instruction (core dumped)

What error message means is that proboom-plus has some CPU instructions that the CPU that runs it doesn't understand.

This typically happens when some packages like prboom-plus are compiled on a very recent computer with more instructions (like SSE3, AVX, etc), and that the PKGBUILD or software build system (autotools, cmake, etc) somehow ends up detecting the CPU instructions. This result in the package being compiled with instructions that are not supported on older CPUs, or even on CPUs of a different vendor or family.

This tend to happen more with Parabola packages that are based on AUR packages: as users typically compile AUR packages themselves, and run the package only on the computer that compiles it, the issue doesn't show up.

2 Debugging it

2.1 Finding which instruction caused the issue

So here the way to debug is to use gdb on the original package and find the instruction that causes the illegal instruction, and find where it comes from too. It can come from a library that the package depends on too.

The the next step would be to understand from which instruction set it comes from (sse4, etc) and then find how that got enabled in the build.

Packages are supposed to run on every x86_64 CPU, so either the instruction set have to be detected at runtime, through libraries or special GCC support, or such optimizations have to be disabled in the PKGBUILD of the affected package(s) (which can also be dependencies) or in that package package(s)'s build system (autotools, etc).

Once it's fixed, the way to go is to retry to run the program, in case there are still other illegal instructions coming from other places (other libraries, etc)

So for that we need to first find a (virtual or physical) machine that can reproduce the bug. If you don't have such machine, it's probably easier to disassemble the program and look for specific instruction sets than to follow this tutorial.

If we have the following, it should be good enough to start tracking the issue:

$ prboom-plus
Illegal instruction (core dumped)

As we might not find the exact same CPU, it might crash in different places of the program too. That's good enough too for starting to track the issue:

$ prboom-plus
M_LoadDefaults: Load system defaults.
 default file: /home/gnutoo/.prboom-plus/prboom-plus.cfg
 found /usr/share/games/doom/prboom-plus.wad

PrBoom-Plus v2.5.1.4 (http://prboom-plus.sourceforge.net/)
I_SetAffinityMask: manual affinity mask is 1
 found /usr/share/games/doom/freedoom2.wad
IWAD found: /usr/share/games/doom/freedoom2.wad
PrBoom-Plus (built Oct 28 2019 14:30:57), playing: DOOM 2: Hell on Earth
PrBoom-Plus is released under the GNU General Public license v2.0.
You are welcome to redistribute it under certain conditions.
It comes with ABSOLUTELY NO WARRANTY. See the file COPYING for details.
I_SignalHandler: Exiting on signal: Illegal instruction

So now that we reprodced the bug, we can start looking at it with gdb.

To do that we can start by load the program that crashes under gdb:

$ gdb prboom-plus
GNU gdb (GDB) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from prboom-plus...
(No debugging symbols found in prboom-plus)

Here as we don't have the debugging symbols, we'll do without it. This will limit you to assembly debugging. Practically speaking, having the ability to look at source code would be useful to understand from which package the problem comes from: If a program like prboom-plus is crashing, it might come from prboom, but it could also come from any of its libraries, or even both.

Once the program is loaded in gdb, we can run it to produce the crash again:

(gdb) run
Starting program: /usr/bin/prboom-plus 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
M_LoadDefaults: Load system defaults.
 default file: /home/gnutoo/.prboom-plus/prboom-plus.cfg
 found /usr/share/games/doom/prboom-plus.wad

PrBoom-Plus v2.5.1.4 (http://prboom-plus.sourceforge.net/)
I_SetAffinityMask: manual affinity mask is 1
 found /usr/share/games/doom/freedoom2.wad
IWAD found: /usr/share/games/doom/freedoom2.wad
PrBoom-Plus (built Oct 28 2019 14:30:57), playing: DOOM 2: Hell on Earth
PrBoom-Plus is released under the GNU General Public license v2.0.
You are welcome to redistribute it under certain conditions.
It comes with ABSOLUTELY NO WARRANTY. See the file COPYING for details.

Program received signal SIGILL, Illegal instruction.
0x56578f4a in ?? ()
(gdb)

So we can observe that the programs crash again as expected.

We can also print the bactrace ('bt') but since we don't have debug symbols that doesn't tell us much:

(gdb) bt
#0  0x56578f4a in  ()
#1  0x566072f3 in  ()
#2  0x5656bf88 in  ()
#3  0x5656c1d2 in  ()
#4  0x56591bf3 in  ()
#5  0x565642b5 in main ()
(gdb) 

We can then enable printing the instructions to make it print the last instruction (which is the one that crashed the program):

(gdb) display/i $pc
1: x/i $pc
=> 0x56578f4a:    vmovq  0x714(%edx),%xmm0

Here we can see that the instruction name is 'vmovq'. To someone that is used to look at x86 assembly, it looks like some SMID instruction, however you don't need to have this kind of knowledge to debug this issue as we will find the instruction set later on.

So to recap:

  • We now know that the vmovq instruction is causing this specific crash
  • We don't know if that instruction is part of a library or from the prboom-plus program

Once we'll have fixed that:

  • We might need to repeat the process until we fix all illegal instructions
  • As we might not have the exact same CPU than people other that have the issue as well, so we probably need to check with them if that issue is fixed for them as well.

2.2 Finding which instruction set has the problematic instruction

Now that we found the name of a problematic instruction, here 'vmovq', we need to find more information about that instruction.

We want to know:

  • Which extended instruction set it's part of. For instance it might be from SSE3, AVX, etc
  • Maybe which CPU supports it or don't support it

One way to find the information is to use a search engine. Another way is go go straight to the authritative information.

As I'm not good with search engine, I'll give an example using an Intel architecture manual. There are probably many other ways to do it, such as using other manuals, using other online resources, etc.

Here I used the 325462-sdm-vol-1-2abcd-3abcd.pdf which is the "Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4" for that.

Here in the description of the "MOVD/MOVQ—Move Doubleword/Move Quadword" instruction in the "INSTRUCTION SET REFERENCE", I see a table that looks like that:

Opcode/ Instruction [...] CPUID feature flag [...]
[...] [...] [...] [...]
VMOVQ xmm1, r64/m64 [...] AVX [...]

2.3 Finding why it compiled for that extended instruction set

So now we need to understand why it enables AVX during the compilation, if you compile with a CPU that has AVX.

To do that, starting by looking at the package (here prboom-plus) PKGBUILD would be a good idea.

So here we have:

build() {
  cd "prboom-plus-$pkgver"

  ./configure --prefix=/usr --without-dumb
  make
}

We don't see any --enable-avx or things like that, so we then need to look into prboom-plus source code.

makepkg enables us to easily get the source code by running the following command:

$ makepkg -o

Then we can simply go in the source:

$ cd src/prboom-plus-2.5.1.4

And try to see if the software build system has things like --enable-avx

$ ./configure --help
`configure' configures PrBoom-Plus 2.5.1.4 to adapt to many kinds of systems.

Usage: ./configure [OPTION]... [VAR=VALUE]...

To assign environment variables (e.g., CC, CFLAGS...), specify them as
VAR=VALUE.  See below for descriptions of some of the useful variables.

Defaults for the options are specified in brackets.

Configuration:
  -h, --help              display this help and exit
      --help=short        display options specific to this package
      --help=recursive    display the short help of all the included packages
  -V, --version           display version information and exit
  -q, --quiet, --silent   do not print `checking ...' messages
      --cache-file=FILE   cache test results in FILE [disabled]
  -C, --config-cache      alias for `--cache-file=config.cache'
  -n, --no-create         do not create output files
      --srcdir=DIR        find the sources in DIR [configure dir or `..']

Installation directories:
  --prefix=PREFIX         install architecture-independent files in PREFIX
                          [/usr/local]
  --exec-prefix=EPREFIX   install architecture-dependent files in EPREFIX
                          [PREFIX]

By default, `make install' will install all the files in
`/usr/local/bin', `/usr/local/lib' etc.  You can specify
an installation prefix other than `/usr/local' using `--prefix',
for instance `--prefix=$HOME'.

For better control, use the options below.

Fine tuning of the installation directories:
  --bindir=DIR            user executables [EPREFIX/bin]
  --sbindir=DIR           system admin executables [EPREFIX/sbin]
  --libexecdir=DIR        program executables [EPREFIX/libexec]
  --sysconfdir=DIR        read-only single-machine data [PREFIX/etc]
  --sharedstatedir=DIR    modifiable architecture-independent data [PREFIX/com]
  --localstatedir=DIR     modifiable single-machine data [PREFIX/var]
  --runstatedir=DIR       modifiable per-process data [LOCALSTATEDIR/run]
  --libdir=DIR            object code libraries [EPREFIX/lib]
  --includedir=DIR        C header files [PREFIX/include]
  --oldincludedir=DIR     C header files for non-gcc [/usr/include]
  --datarootdir=DIR       read-only arch.-independent data root [PREFIX/share]
  --datadir=DIR           read-only architecture-independent data [DATAROOTDIR]
  --infodir=DIR           info documentation [DATAROOTDIR/info]
  --localedir=DIR         locale-dependent data [DATAROOTDIR/locale]
  --mandir=DIR            man documentation [DATAROOTDIR/man]
  --docdir=DIR            documentation root [DATAROOTDIR/doc/prboom-plus]
  --htmldir=DIR           html documentation [DOCDIR]
  --dvidir=DIR            dvi documentation [DOCDIR]
  --pdfdir=DIR            pdf documentation [DOCDIR]
  --psdir=DIR             ps documentation [DOCDIR]

Program names:
  --program-prefix=PREFIX            prepend PREFIX to installed program names
  --program-suffix=SUFFIX            append SUFFIX to installed program names
  --program-transform-name=PROGRAM   run sed PROGRAM on installed program names 

System types:
  --build=BUILD     configure for building on BUILD [guessed]
  --host=HOST       cross-compile to build programs to run on HOST [BUILD]
  --target=TARGET   configure for building compilers for TARGET [HOST]

Optional Features:
  --disable-option-checking  ignore unrecognized --enable/--with options
  --disable-FEATURE       do not include FEATURE (same as --enable-FEATURE=no)
  --enable-FEATURE[=ARG]  include FEATURE [ARG=yes]
  --enable-silent-rules   less verbose build output (undo: "make V=1")
  --disable-silent-rules  verbose build output (undo: "make V=0")
  --disable-maintainer-mode
                          disable make rules and dependencies not useful (and
                          sometimes confusing) to the casual installer
  --enable-dependency-tracking
                          do not reject slow dependency extractors
  --disable-dependency-tracking
                          speeds up one-time build
  --enable-debug          turns on various debugging features, like range
                          checking and internal heap diagnostics
  --enable-profile        turns on profiling
  --disable-cpu-opt       turns off cpu specific optimisations
  --disable-gl            disable OpenGL rendering code
  --disable-sdltest       Do not try to compile and run a test SDL program
  --disable-nonfree-graphics
                          build prboom.wad without non-free menu text lumps
  --disable-dogs          disables support for helper dogs
  --enable-heapcheck      turns on continuous heap checking (very slow)
  --enable-heapdump       turns on dumping the heap state for debugging

Optional Packages:
  --with-PACKAGE[=ARG]    use PACKAGE [ARG=yes]
  --without-PACKAGE       do not use PACKAGE (same as --with-PACKAGE=no)
  --with-waddir           Path to install prboom.wad and look for other WAD
                          files
  --with-dmalloc          use dmalloc, as in http://www.dmalloc.com
  --with-sdl-prefix=PFX   Prefix where SDL is installed (optional)
  --with-sdl-exec-prefix=PFX Exec prefix where SDL is installed (optional)
  --without-mixer         Do not use SDL_mixer even if available
  --without-net           Do not use SDL_net even if available
  --without-pcre          Do not compile with libpcre
  --without-mad           Do not use MAD mp3 library even when available
  --without-fluidsynth    Do not use fluidsynth library even when available
  --without-dumb          Do not use dumb tracker library even when available
  --without-vorbisfile    Do not use vorbisfile library even when available
  --without-portmidi      Do not use portmidi library even when available
  --without-image         Do not use SDL_image even if available
  --without-png           Do not use libpng even if available

Some influential environment variables:
  CC          C compiler command
  CFLAGS      C compiler flags
  LDFLAGS     linker flags, e.g. -L<lib dir> if you have libraries in a
              nonstandard directory <lib dir>
  LIBS        libraries to pass to the linker, e.g. -l<library>
  CPPFLAGS    (Objective) C/C++ preprocessor flags, e.g. -I<include dir> if
              you have headers in a nonstandard directory <include dir>
  CPP         C preprocessor

Use these variables to override the choices made by `configure' or to help
it to find libraries and programs with nonstandard names/locations.


And looking at it I already see suspicious things:

  --disable-cpu-opt       turns off cpu specific optimisations

So we can try to find what disable-cpu-opt is really doing:

$ grep cpu-opt -r *
autotools/ac_cpu_optimisations.m4:AC_ARG_ENABLE(cpu-opt,AC_HELP_STRING([--disable-cpu-opt],[turns off cpu specific optimisations]),[
[...]

So here the ./configure script is generated by the autotools build system, configure.ac, and m4 files are used to generate ./configure

So here we see in that m4:

AC_ARG_ENABLE(cpu-opt,AC_HELP_STRING([--disable-cpu-opt],[turns off cpu specific optimisations]),[],[
AC_MSG_CHECKING(whether compiler supports -march=native)
OLD_CFLAGS="$CFLAGS" 

So that's already enough to cause illegal instructions. -mach=native shall not be used to build Parabola packages  as it will enable the all the optimizations it can use (like AVX) for the CPU that is on the machine that builds the package. However the machine that runs the package doesn't necessarily have AVX.

2.4 Fixing the illegal instruction

So here we need to run configure with --disable-cpu-opt in the PKGBUILD.

Once we did that, it would be a good idea to test the result, and ask people to test it as they might have different CPUs.

If there are still illegal instructions, that process needs to be repeated.

Here the probability to have other things than --disable-cpu-opt cause illegal instruction is really low, so if there are still illegal instructions, it would be a good idea to look at the libraries that prboom-plus uses.

2.5 To upstream or not to upstream

It would also be a good idea to consider weather or not to send the patch upstream in AUR. In the short term, sending the patch usually takes more time than fixing the PKGBUILD in Parabola, and they might refuse patches like that if the justification is not well written enough or if they don't care about other distributions.

As some maintainers are are willing to accept patches for things like that, it could save a lot of time in the long run, especially if the package changes often. If the patch is not upstream yet, it would be a good idea to document why we used --disable-cpu-opt in the PKGBUILD.

Example of successful upstreaming of patches:

The way to submit a patch is to paste it to the AUR page as there is no other formal way to do it. Then some maintainers don't manage to import it and ask you to send it again by mail, while other manage to import it fine.

If the package is unmaintained, you can probably take it over and fix it directly.