Porting a software project to a new operating system is always interesting and fun, and Alpine Linux with its musl libc brings a unique set of challenges.
A while ago, I was tasked with porting OverOps’ native agent to Alpine Linux. In this previous blog post, I got started with Alpine Linux by setting up a fresh Alpine disk install and desktop environment on my laptop, stacked up with C++ and Java development tools.
Then, it was time to pull our project’s source code and get it up and running on Alpine. Building was easy and smooth, but debugging and troubleshooting of the early crashes and issues turned out to be especially complicated. Even getting GDB up and running proved to be a major challenge!
In this post, I’ll describe the weird behavior I encountered when debugging the JVM process in GDB, which turned out to be an actual “GDBug” on Alpine Linux, and how I was able to eventually work around it. Plus, I’ll share some useful tips for debugging the JVM with GDB and the status of the most popular C++ debuggers on Alpine.
Enter GDB and the Unknown Signal
After building the Alpine OverOps agent for the first time, I was ready to try it out. For starters, I picked a simple test jar and launched Java with the -agentpath command line option for attaching the OverOps native agent. Not surprisingly, it crashed – so it was time for some debugging.
I already had the gdb package installed in advance, so I fired up gdb, set the java command to execute and hit run. Instead of hitting the crashing code, though, I got the following error:
Thread 1 “java” received signal ?, Unknown signal. Not quite what I was expecting. What is this unknown signal?
A quick Google search came up with a variety of possible reasons for this GDB error: an overflowed stack, an uncaught exception, and even GDB bugs seen on some platforms (Cygwin and macOS High Sierra, for example). Most threads suggested the issue may be shell related, so I tried some of the proposed quick solutions:
- Setting GDB’s startup-with-shell option to “off”;
- Starting GDB from different shells (ash, bash);
- Running GDB with root permissions;
None of these worked. I kept searching for solutions, but couldn’t find any other good leads, and none for Alpine Linux in particular.
Then, I recalled that I forgot an important GDB setup step needed for debugging the java process: instructing GDB to properly handle JVM signals.
Coping with JVM Signals
When debugging native code running on the JVM or the JVM itself, it’s usual for segmentation faults (SIGSEGV) and other kinds of signals to be raised. In most cases, it is just the JVM doing its thing – using signals for normal operations, such as internal signaling and supporting JIT optimizations. For example, the JIT compiler may remove null checks intentionally from the generated machine code, so when the exceptional path occurs and the program attempts to access a null object, a SIGSEGV is raised, caught and handled by the JVM signal handler. This is beneficial in the common case in which the object is valid, since removal of the explicit null check is likely to result with more efficient code.
In the HotSpot JVM, there are 11 different signals which are used by the implementation, out of the 32 standard signals supported by the Linux kernel (for Linux signal types, see the signal man page). These signals and their functions in the JVM are described in the following Oracle Java documentation page.
Seeing different kinds of signals during JVM debugging is normal, but by default, each signal raised causes GDB to stop program execution. This greatly interferes with debugging. Fortunately, GDB provides a solution by allowing customization for the way signals are handled. For each signal, we can choose whether to print a message when the signal is raised, if the program should stop or continue, and whether the signal should be passed on to the debugged program.
For configuring GDB to mask out signals used by HotSpot JVM, we should add the following commands to our .gdbinit file:
These five signals are the ones used during normal JVM operation (there are also a bunch of signals used during JVM exit, that were excluded). Let’s break down the handle SIGNAL command:
- nostop tells GDB to continue execution when the signal is caught.
- noprint tells GDB not to print any message when this signal occurs (it also implies nostop).
- pass is for allowing the signal to be passed on and letting the JVM handle it.
For more information on these GDB commands, refer to the Debugging with GDB: Signals page.
With my .gdbinit file properly set up, I repeated the test and hoped the Unknown Signal error would go away. Unfortunately, it was still there!
Since I was able to successfully debug simple C++ applications and different Linux executables, I wasn’t entirely sure there was something wrong with GDB – more likely, something wrong with my freshly built Alpine agent. For ruling out an agent issue, I ran GDB again with a simple java -version command, without attaching any agent. The issue reproduced here, too! Meaning, java and GDB didn’t play well, regardless to whether an agent was attached or not.
Next, I wanted to check whether the issue is specific to GDB or Java versions. I tried different GDB package versions (8.0.1-r6, 8.0.1-r3, 7.12.1-r1) and different Java versions (7, 8 and even 11 early access build), and the same exact symptom reproduced in all. Regardless of what caused this issue, it seemed like a long standing problem.
At that point, I couldn’t think of any more simple troubleshooting steps, so I figured I’ll have to clone the OpenJDK sources and build my own OpenJDK debug build for investigating the issue. Then, assuming the issue would reproduce in my debug build as well, run and step through the JVM’s source code to the point where GDB breaks.
Luckily, I was spared! It turns out building debug OpenJDK from source is not mandatory for JVM debugging, since Alpine ships with an OpenJDK debug symbols package.
The package name is openjdk8-dbg, found in the community repository, providing debug symbols to the corresponding openjdk8 package. With the debug symbols, we can get informative stack traces with function names and source line numbers in GDB and in JVM error reports, and it’s possible to attach the source code, set breakpoints and inspect variable values.
With JVM debug symbols, I launched GDB again with java -version and hit the Unknown signal error once more. But this time, I was able to extract some meaningful information. By running the info threads command, the location of the error in the JVM source code with the complete call stack were presented:
Now I could see the error was coming from the JVMInit function (#7 in the call stack, highlighted). This function is the entry point to VM initialization, meaning, the VM didn’t get to initialize completely by the time the signal was thrown.
I started browsing the relevant OpenJDK source code, and it seemed getting to the bottom of this issue would take a while. I was not happy about this, since I was eager to move forward with actual Alpine development and not having a functional debugger was painful. Some of the simple crashes I could solve using debug prints, but not much more than that. So, before committing myself entirely to investigating this issue further, I decided to try one other popular C++ debugger.
LLDB on Alpine Linux
LLDB is a C/C++ debugger based on the LLVM compiler project. It is highly performant, scriptable, and has a similar look and feel to GDB. Its main strength comes from leveraging the powerful LLVM infrastructure: for example, it features the most up to date C++ expression parsing, powered by Clang, LLVM’s C/C++ front end. LLDB is the default debugger in macOS’s XCode, and is becoming increasingly popular on other operating systems, too.
On Alpine, LLDB is not yet officially supported, and currently only available from the experimental edge/testing repository (version 5.0.1-r0 at the time of writing). I was happy to try it out, but sadly that version didn’t seem to be functional and got stuck when I tried to run it on a hello world C++ project. After some tracing and googling, I found a similar but old report on Debian, but without a remedy.
Still having faith in LLDB, I thought I’d try a more recent version. LLDB 5.0.1 is quite old, and version 8.0.0 was already under development, so I decided to clone the latest tree and build the project from source. Building LLDB was fairly simple (though lengthy) and completed successfully on the first shot. This time I was able to successfully debug a simple C++ project – but it was too soon to be joyous. When I tried to run java -version in LLDB, it didn’t go through and exited with an error:
This looked like a problem with similar magnitude as the GDB one, so I decided to call it quits and move back to GDB. I guess JVM debugging support on Alpine has not fully matured just yet.
Back to square one, I was determined to finally get to the root cause of the problem. I cloned OpenJDK 8 sources corresponding to my version from https://github.com/frohoff/jdk8u-jdk, and placed them in a folder for GDB to pick. Then, launched GDB again with java -version and hit the same crashing stack trace as before.
According to the stack trace, the crash occurred in the JVMInit function – that’s the high-level function that performs JVM initialization. Going deeper, JVMInit calls ContinueInNewThread, which creates a new thread for launching the JavaMain function, that starts the VM and ultimately calls the main function of the Java program. I’ve set a breakpoint in JavaMain, and from that point, carefully stepped and ran through the program prior to the point of explosion.
After a few hours of careful debugging, I managed to zoom in on the faulty stack trace:
Note the highlighted functions, setrlimit and __synccall.
I noticed that the Unknown signal error happens right after stepping over the __synccall function call, which is called from setrlimit. What is this function that is so special that causes GDB to break?
Setrlimit and musl libc
The setrlimit function is used for adjusting Linux resources limits. It’s the C function equivalent of the ulimit shell command, which is useful for setting limit values for resource usage metrics like the number of open files, number of processes, process memory usage and core dump size.
In the JVM initialization code, setrlimit is called from the function in os::init_2 (called itself from Threads::create_vm), that performs Linux/Unix specific initializations. Let’s look at the code calling setrlimit:
Note the call to setrlimit inside the else clause. setrlimit is used for increasing the open files limit, for not hitting the limit during JVM operation and being killed by the kernel (more on this later).
Digging further, let’s look at the setrlimit implementation of musl, Alpine’s C runtime library:
Note the call to __synccall. And, here’s the code for __synccall – it’s quite large, so not pasting it entirely here.
This is what’s happening:
- setrlimit stores its parameters in a struct and passes it to the __synccall function, along with a pointer to the do_setrlimit function.
- __synccall blocks all outside signals.
- __synccall reads the list of threads of the process from /proc/self/task, then loops over the threads and signals each one with SIGSYNCCALL:
- __syscall(SYS_tkill, td->tid, SIGSYNCCALL)
- After all threads were signaled, the callback function (do_setrlimit) is called.
- do_setrlimit sets the actual limits.
Eureka! Stepping again through the code, I could see that the line with SIGSYNCCALL signaling is what’s causing GDB to explode. So after all, the “Unknown signal” is actually a SIGSYNCCALL signal! Finally, the root cause was revealed.
SIGSYNCCALL is not among the advertised Linux signals, and a quick check showed that it’s specific and internal to musl libc. It seemed GDB was not recognizing this signal and bailed out when it was encountered. Since GDB is open source, I tried to verify that assumption. The signals.c source file is where the different signal kinds are handled. Each signal type is handled individually – and indeed, there’s no evidence whatsoever to SIGSYNCCALL.
Now that I knew the culprit, I began searching for a workaround.
Ultimately, I wanted to disable the call to setrlimit entirely. During JVM initialization, setrlimit is used for setting the number of open files to the maximum allowed. In Alpine 3.8 (and other Linuxes, Ubuntu 16.04 for example), the default open files limit is set to 1024, which is already quite high. I figured disabling the call to setrlimit would probably be OK – and hey, if it caused any issues, I could just increase the limit from the shell using ulimit -n.
In the os::init_2 function, the setrlimit call and surrounding code are enclosed in an if block, if (MaxFDLimit), where the MaxFDLimit boolean is always true. If there’s a way to set it to false, then problem is solved.
It turns out, there’s an option for controlling this very flag, -XX:-MaxFDLimit, but sadly it is only supported on Solaris, and ignored on all other platforms (see on the Java HotSpot VM Options page). The reasoning behind a Solaris-only flag for toggling MaxFDLimit is explained in David Holmes’s comment to issue JDK-8010126 in the JDK bug tracker:
“Back in the Solaris 8 days, the default soft-limit of 256 was woefully inadequate, so we bumped it to the hard limit of 1024. Older linuxes already tended to have a hard and soft limit of 1024, so it had no affect there. These days some linuxes have a hard limit of 4K. Solaris 10 has a soft limit of 16K and a hard limit of 64K.”
Now I was completely sure disabling the setrlimit call is OK. But, I needed to find a different way of doing this… After some late night thinking and a good night sleep, it hit me: I could just break on the os::init_2 function, set MaxFDLimit to 0, and continue debugging normally!
Enter, HotSpot Alpine GDB debugging workaround:
- Start GDB
- break os::init_2
- Run java with the desired command line arguments
- When the breakpoint is hit, set MaxFDLimit=0
- Continue execution and debug away.
Viola. Now, I could finally debug the OverOps native agent on Alpine OpenJDK!
Note, this workaround has to be taken with a grain of salt. While it saved my day, it is far from being a satisfying solution. For applying the workaround, GDB has to be started and commands entered manually. Worse, it is only applicable with a debuggable JDK, and not suitable for official OpenJDK builds that don’t have debug symbols. But it’s enough for now…
After several days of head scratching and hair pulling, I was finally successful in debugging the JVM process on Alpine Linux. It was a great learning experience for me, and perhaps also a good demonstration for the potential delicate issues that could rise when porting complex GNU-libc applications to musl libc (granted, there were more to come!).
Now that I realised what the problem was exactly, I could do some more online research. It turns out, this issue was not only seen with the JVM but also with the CoreCLR. I couldn’t find an open GDB ticket for this, so I added my own. Unfortunately, it has not received much attention. It’s possible the GDB fix may not be a difficult one, so with some time on my hands, I hope I’ll be able to propose a fix, or at least a patch that could be applied to local Alpine GDB builds.
That’s it for today. Thanks for reading, and stay tuned for more Alpine adventures!
Since writing this blog post, and after a short pause from Alpine development, I came back to meet this issue. This time it was especially painful, and there was no update on my GDB ticket, so decided to give a try and patch GDB myself. Gladly, as I suspected, this wasn’t very difficult!
If you’re facing this issue and looking for a workaround, you could try building GDB yourself and apply this quick-and-dirty patch.
Or, if you’re willing, try my locally built GDB (use at your own risk).
All that’s left is masking out SIGSYNCCALL using the GDB command: handle SIGSYNCCALL nostop noprint pass.
Achieving Observability: How to Address the Unknown Unknowns in Your Application
Subscribe for Post Updates