Fixing illegal instruction issues
Contents
1 Introduction
Sometimes, running a program results in a crash, and a message about illegal instruction(s). Here's an example from the bug report #2789:
$ prboom-plus Illegal instruction (core dumped)
What error message means is that proboom-plus has some CPU instructions that the CPU that runs it doesn't understand.
This typically happens when some packages like prboom-plus are compiled on a very recent computer with more instructions (like SSE3, AVX, etc), and that the PKGBUILD or software build system (autotools, cmake, etc) somehow ends up detecting the CPU instructions. This result in the package being compiled with instructions that are not supported on older CPUs, or even on CPUs of a different vendor or family.
This tend to happen more with Parabola packages that are based on AUR packages: as users typically compile AUR packages themselves, and run the package only on the computer that compiles it, the issue doesn't show up.
2 Debugging it
2.1 Finding which instruction caused the issue
So here the way to debug is to use gdb on the original package and find the instruction that causes the illegal instruction, and find where it comes from too. It can come from a library that the package depends on too.
The the next step would be to understand from which instruction set it comes from (sse4, etc) and then find how that got enabled in the build.
Packages are supposed to run on every x86_64 CPU, so either the instruction set have to be detected at runtime, through libraries or special GCC support, or such optimizations have to be disabled in the PKGBUILD of the affected package(s) (which can also be dependencies) or in that package package(s)'s build system (autotools, etc).
Once it's fixed, the way to go is to retry to run the program, in case there are still other illegal instructions coming from other places (other libraries, etc)
So for that we need to first find a (virtual or physical) machine that can reproduce the bug. If you don't have such machine, it's probably easier to disassemble the program and look for specific instruction sets than to follow this tutorial.
If we have the following, it should be good enough to start tracking the issue:
$ prboom-plus Illegal instruction (core dumped)
As we might not find the exact same CPU, it might crash in different places of the program too. That's good enough too for starting to track the issue:
$ prboom-plus M_LoadDefaults: Load system defaults. default file: /home/gnutoo/.prboom-plus/prboom-plus.cfg found /usr/share/games/doom/prboom-plus.wad PrBoom-Plus v2.5.1.4 (http://prboom-plus.sourceforge.net/) I_SetAffinityMask: manual affinity mask is 1 found /usr/share/games/doom/freedoom2.wad IWAD found: /usr/share/games/doom/freedoom2.wad PrBoom-Plus (built Oct 28 2019 14:30:57), playing: DOOM 2: Hell on Earth PrBoom-Plus is released under the GNU General Public license v2.0. You are welcome to redistribute it under certain conditions. It comes with ABSOLUTELY NO WARRANTY. See the file COPYING for details. I_SignalHandler: Exiting on signal: Illegal instruction
So now that we reprodced the bug, we can start looking at it with gdb.
To do that we can start by load the program that crashes under gdb:
$ gdb prboom-plus GNU gdb (GDB) 9.2 Copyright (C) 2020 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "i686-pc-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from prboom-plus... (No debugging symbols found in prboom-plus)
Here as we don't have the debugging symbols, we'll do without it. This will limit you to assembly debugging. Practically speaking, having the ability to look at source code would be useful to understand from which package the problem comes from: If a program like prboom-plus is crashing, it might come from prboom, but it could also come from any of its libraries, or even both.
Once the program is loaded in gdb, we can run it to produce the crash again:
(gdb) run Starting program: /usr/bin/prboom-plus [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib/libthread_db.so.1". M_LoadDefaults: Load system defaults. default file: /home/gnutoo/.prboom-plus/prboom-plus.cfg found /usr/share/games/doom/prboom-plus.wad PrBoom-Plus v2.5.1.4 (http://prboom-plus.sourceforge.net/) I_SetAffinityMask: manual affinity mask is 1 found /usr/share/games/doom/freedoom2.wad IWAD found: /usr/share/games/doom/freedoom2.wad PrBoom-Plus (built Oct 28 2019 14:30:57), playing: DOOM 2: Hell on Earth PrBoom-Plus is released under the GNU General Public license v2.0. You are welcome to redistribute it under certain conditions. It comes with ABSOLUTELY NO WARRANTY. See the file COPYING for details. Program received signal SIGILL, Illegal instruction. 0x56578f4a in ?? () (gdb)
So we can observe that the programs crash again as expected.
We can also print the bactrace ('bt') but since we don't have debug symbols that doesn't tell us much:
(gdb) bt #0 0x56578f4a in () #1 0x566072f3 in () #2 0x5656bf88 in () #3 0x5656c1d2 in () #4 0x56591bf3 in () #5 0x565642b5 in main () (gdb)
We can then enable printing the instructions to make it print the last instruction (which is the one that crashed the program):
(gdb) display/i $pc 1: x/i $pc => 0x56578f4a: vmovq 0x714(%edx),%xmm0
Here we can see that the instruction name is 'vmovq'. To someone that is used to look at x86 assembly, it looks like some SMID instruction, however you don't need to have this kind of knowledge to debug this issue as we will find the instruction set later on.
So to recap:
- We now know that the vmovq instruction is causing this specific crash
- We don't know if that instruction is part of a library or from the prboom-plus program
Once we'll have fixed that:
- We might need to repeat the process until we fix all illegal instructions
- As we might not have the exact same CPU than people other that have the issue as well, so we probably need to check with them if that issue is fixed for them as well.
2.2 Finding which instruction set has the problematic instruction
Now that we found the name of a problematic instruction, here 'vmovq', we need to find more information about that instruction.
We want to know:
- Which extended instruction set it's part of. For instance it might be from SSE3, AVX, etc
- Maybe which CPU supports it or don't support it
One way to find the information is to use a search engine. Another way is go go straight to the authritative information.
As I'm not good with search engine, I'll give an example using an Intel architecture manual. There are probably many other ways to do it, such as using other manuals, using other online resources, etc.
Here I used the 325462-sdm-vol-1-2abcd-3abcd.pdf which is the "Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4" for that.
Here in the description of the "MOVD/MOVQ—Move Doubleword/Move Quadword" instruction in the "INSTRUCTION SET REFERENCE", I see a table that looks like that:
Opcode/ Instruction | [...] | CPUID feature flag | [...] |
---|---|---|---|
[...] | [...] | [...] | [...] |
VMOVQ xmm1, r64/m64 | [...] | AVX | [...] |
2.3 Finding why it compiled for that extended instruction set
So now we need to understand why it enables AVX during the compilation, if you compile with a CPU that has AVX.
To do that, starting by looking at the package (here prboom-plus) PKGBUILD would be a good idea.
So here we have:
build() { cd "prboom-plus-$pkgver" ./configure --prefix=/usr --without-dumb make }
We don't see any --enable-avx or things like that, so we then need to look into prboom-plus source code.
makepkg enables us to easily get the source code by running the following command:
$ makepkg -o
Then we can simply go in the source:
$ cd src/prboom-plus-2.5.1.4
And try to see if the software build system has things like --enable-avx
$ ./configure --help `configure' configures PrBoom-Plus 2.5.1.4 to adapt to many kinds of systems. Usage: ./configure [OPTION]... [VAR=VALUE]... To assign environment variables (e.g., CC, CFLAGS...), specify them as VAR=VALUE. See below for descriptions of some of the useful variables. Defaults for the options are specified in brackets. Configuration: -h, --help display this help and exit --help=short display options specific to this package --help=recursive display the short help of all the included packages -V, --version display version information and exit -q, --quiet, --silent do not print `checking ...' messages --cache-file=FILE cache test results in FILE [disabled] -C, --config-cache alias for `--cache-file=config.cache' -n, --no-create do not create output files --srcdir=DIR find the sources in DIR [configure dir or `..'] Installation directories: --prefix=PREFIX install architecture-independent files in PREFIX [/usr/local] --exec-prefix=EPREFIX install architecture-dependent files in EPREFIX [PREFIX] By default, `make install' will install all the files in `/usr/local/bin', `/usr/local/lib' etc. You can specify an installation prefix other than `/usr/local' using `--prefix', for instance `--prefix=$HOME'. For better control, use the options below. Fine tuning of the installation directories: --bindir=DIR user executables [EPREFIX/bin] --sbindir=DIR system admin executables [EPREFIX/sbin] --libexecdir=DIR program executables [EPREFIX/libexec] --sysconfdir=DIR read-only single-machine data [PREFIX/etc] --sharedstatedir=DIR modifiable architecture-independent data [PREFIX/com] --localstatedir=DIR modifiable single-machine data [PREFIX/var] --runstatedir=DIR modifiable per-process data [LOCALSTATEDIR/run] --libdir=DIR object code libraries [EPREFIX/lib] --includedir=DIR C header files [PREFIX/include] --oldincludedir=DIR C header files for non-gcc [/usr/include] --datarootdir=DIR read-only arch.-independent data root [PREFIX/share] --datadir=DIR read-only architecture-independent data [DATAROOTDIR] --infodir=DIR info documentation [DATAROOTDIR/info] --localedir=DIR locale-dependent data [DATAROOTDIR/locale] --mandir=DIR man documentation [DATAROOTDIR/man] --docdir=DIR documentation root [DATAROOTDIR/doc/prboom-plus] --htmldir=DIR html documentation [DOCDIR] --dvidir=DIR dvi documentation [DOCDIR] --pdfdir=DIR pdf documentation [DOCDIR] --psdir=DIR ps documentation [DOCDIR] Program names: --program-prefix=PREFIX prepend PREFIX to installed program names --program-suffix=SUFFIX append SUFFIX to installed program names --program-transform-name=PROGRAM run sed PROGRAM on installed program names System types: --build=BUILD configure for building on BUILD [guessed] --host=HOST cross-compile to build programs to run on HOST [BUILD] --target=TARGET configure for building compilers for TARGET [HOST] Optional Features: --disable-option-checking ignore unrecognized --enable/--with options --disable-FEATURE do not include FEATURE (same as --enable-FEATURE=no) --enable-FEATURE[=ARG] include FEATURE [ARG=yes] --enable-silent-rules less verbose build output (undo: "make V=1") --disable-silent-rules verbose build output (undo: "make V=0") --disable-maintainer-mode disable make rules and dependencies not useful (and sometimes confusing) to the casual installer --enable-dependency-tracking do not reject slow dependency extractors --disable-dependency-tracking speeds up one-time build --enable-debug turns on various debugging features, like range checking and internal heap diagnostics --enable-profile turns on profiling --disable-cpu-opt turns off cpu specific optimisations --disable-gl disable OpenGL rendering code --disable-sdltest Do not try to compile and run a test SDL program --disable-nonfree-graphics build prboom.wad without non-free menu text lumps --disable-dogs disables support for helper dogs --enable-heapcheck turns on continuous heap checking (very slow) --enable-heapdump turns on dumping the heap state for debugging Optional Packages: --with-PACKAGE[=ARG] use PACKAGE [ARG=yes] --without-PACKAGE do not use PACKAGE (same as --with-PACKAGE=no) --with-waddir Path to install prboom.wad and look for other WAD files --with-dmalloc use dmalloc, as in http://www.dmalloc.com --with-sdl-prefix=PFX Prefix where SDL is installed (optional) --with-sdl-exec-prefix=PFX Exec prefix where SDL is installed (optional) --without-mixer Do not use SDL_mixer even if available --without-net Do not use SDL_net even if available --without-pcre Do not compile with libpcre --without-mad Do not use MAD mp3 library even when available --without-fluidsynth Do not use fluidsynth library even when available --without-dumb Do not use dumb tracker library even when available --without-vorbisfile Do not use vorbisfile library even when available --without-portmidi Do not use portmidi library even when available --without-image Do not use SDL_image even if available --without-png Do not use libpng even if available Some influential environment variables: CC C compiler command CFLAGS C compiler flags LDFLAGS linker flags, e.g. -L<lib dir> if you have libraries in a nonstandard directory <lib dir> LIBS libraries to pass to the linker, e.g. -l<library> CPPFLAGS (Objective) C/C++ preprocessor flags, e.g. -I<include dir> if you have headers in a nonstandard directory <include dir> CPP C preprocessor Use these variables to override the choices made by `configure' or to help it to find libraries and programs with nonstandard names/locations. And looking at it I already see suspicious things: --disable-cpu-opt turns off cpu specific optimisations
So we can try to find what disable-cpu-opt is really doing:
$ grep cpu-opt -r * autotools/ac_cpu_optimisations.m4:AC_ARG_ENABLE(cpu-opt,AC_HELP_STRING([--disable-cpu-opt],[turns off cpu specific optimisations]),[ [...]
So here the ./configure script is generated by the autotools build system, configure.ac, and m4 files are used to generate ./configure
So here we see in that m4:
AC_ARG_ENABLE(cpu-opt,AC_HELP_STRING([--disable-cpu-opt],[turns off cpu specific optimisations]),[],[ AC_MSG_CHECKING(whether compiler supports -march=native) OLD_CFLAGS="$CFLAGS" So that's already enough to cause illegal instructions. -mach=native shall not be used to build Parabola packages as it will enable the all the optimizations it can use (like AVX) for the CPU that is on the machine that builds the package. However the machine that runs the package doesn't necessarily have AVX.
2.4 Fixing the illegal instruction
So here we need to run configure with --disable-cpu-opt in the PKGBUILD.
Once we did that, it would be a good idea to test the result, and ask people to test it as they might have different CPUs.
If there are still illegal instructions, that process needs to be repeated.
Here the probability to have other things than --disable-cpu-opt cause illegal instruction is really low, so if there are still illegal instructions, it would be a good idea to look at the libraries that prboom-plus uses.
2.5 To upstream or not to upstream
It would also be a good idea to consider weather or not to send the patch upstream in AUR. In the short term, sending the patch usually takes more time than fixing the PKGBUILD in Parabola, and they might refuse patches like that if the justification is not well written enough or if they don't care about other distributions.
As some maintainers are are willing to accept patches for things like that, it could save a lot of time in the long run, especially if the package changes often. If the patch is not upstream yet, it would be a good idea to document why we used --disable-cpu-opt in the PKGBUILD.
Example of successful upstreaming of patches:
- Asterisk patch to fix illegal instruction
- The patch was posted on the corresponding AUR page: https://aur.archlinux.org/packages/asterisk/?O=10&PP=10#comment-698015
- And the maintainer picked it up and merged it: https://aur.archlinux.org/cgit/aur.git/commit/PKGBUILD?h=asterisk&id=31e6bea3101c08130fced223138ec3a32d1b3943
The way to submit a patch is to paste it to the AUR page as there is no other formal way to do it. Then some maintainers don't manage to import it and ask you to send it again by mail, while other manage to import it fine.
If the package is unmaintained, you can probably take it over and fix it directly.