Debugging crashes/hangs on Linux

We do your best to make the server as stable and reliable as possible. None the less, there might be conditions that never happen at our test, be it different timings, printer results or configurations, that cause the server to crash. Just to be clear about this, I mean not just a print stopping to talk or a reconnect in your web browser. I mean the backend server instances crashes. If that happens you will see in /var/log/syslog something like

Jul 13 23:14:59 Repetier-Server kernel: [85421.242500] Alignment trap: not handling instruction e19c2f9f at [<76db67d4>]
Jul 13 23:14:59 Repetier-Server kernel: [85421.242515] Unhandled fault: alignment exception (0x001) at 0x72740061
Jul 13 23:14:59 Repetier-Server kernel: [85421.242529] pgd = b1820000
Jul 13 23:14:59 Repetier-Server kernel: [85421.242545] [72740061] *pgd=318c0835, *pte=3abf679f, *ppte=3abf6e7f
Jul 13 23:14:59 Repetier-Server systemd[1]: RepetierServer.service: Main process exited, code=killed, status=6/ABRT
Jul 13 23:14:59 Repetier-Server systemd[1]: RepetierServer.service: Unit entered failed state.
Jul 13 23:14:59 Repetier-Server systemd[1]: RepetierServer.service: Failed with result 'signal'.
Jul 13 23:14:59 Repetier-Server systemd[1]: RepetierServer.service: Service has no hold-off time, scheduling restart.
Jul 13 23:14:59 Repetier-Server systemd[1]: Stopped Repetier-Server 3D Printer Server.
Jul 13 23:14:59 Repetier-Server systemd[1]: Starting Repetier-Server 3D Printer Server...
Jul 13 23:15:00 Repetier-Server systemd[1]: Started Repetier-Server 3D Printer Server.

 

This can cause a print abort or a flash in the web browser. As you see we start the server after a crash. So without testing syslog you will never know if it is a crash or connection problem.

Now to help us solve the problem with the next release, we need to find the location where the crash happened. To find this the server needs to run inside the debugger. So here we define the steps and which information we need to find the source. All this is done in a linux console. If you use a Raspberry Pi or similar you might do this over a ssh connection using e.g. putty. On regular linux versions you can simply open the console application.

At first you need to have the gnu debugger gdb installed. On Debian systems you install this easily with

sudo apt-get update
sudo apt-get install gdb

 

Once it is installed we will start gdb and connect to the existing Repetier-Server. To do so we need the PID of the server. We get this like this:

root@FriendlyELEC:/var/lib/Repetier-Server/configs# ps aux | grep tier
repetie+   928  0.6  2.9 260784 29236 ?        Ssl  Jul13   9:28 /usr/local/Repetier-Server/bin/RepetierServer -c /usr/local/Repetier-Server/etc/RepetierServer.xml --daemon
pi       23206  0.0  0.0   1376   376 pts/1    S+   12:37   0:00 tail -f /var/lib/Repetier-Server/logs/server.log
root     27250  0.0  0.0   2064   508 pts/2    S+   12:53   0:00 grep --color=auto tier

 

Look at the line containing /usr/local/Repetier-Server/bin/RepetierServer and remember the first number, here 928 which is the PID. Now run gdb

gdb
attach 928

 

At this moment the process is halted and you can analyse it. Now it depends what problem you like to debug. If you want to debug a crash that might happen later you simply continue by sending “c” and once the server crashes it will stop in debugger. It is important to keep the console open. If it closes it might pause the server or even stop it. With ssh this means the opening OS (windows) should not go into sleep mode. If the server is running but unresponsive, so you think it is hanging for some reason you directly continue with the analysis:

First step is to check the active thread by sending bt:

(gdb) bt
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
#1  0xf749f4ae in do_sigwait (set=, set@entry=0xffe3ef04, sig=sig@entry=0xffe3ef84) at ../sysdeps/unix/sysv/linux/sigwait.c:61
#2  0xf749f50a in __sigwait (set=0xffe3ef04, sig=0xffe3ef84) at ../sysdeps/unix/sysv/linux/sigwait.c:96
#3  0x0050e0a0 in Poco::Util::ServerApplication::waitForTerminationRequest() ()
#4  0x002fc080 in repetier::RepetierServerApplication::main(std::vector<std::string, std::allocator > const&) ()
#5  0x00501886 in Poco::Util::Application::run() ()
#6  0x0050e3e0 in Poco::Util::ServerApplication::run(int, char**) ()
#7  0x002fabc6 in main ()

 

In a crash case this will show where it crashed and which functions called which to create this.

Especially in the case of the hang it is also important to get the backtrace of all threads, so you run

(gdb)thread apply all bt

This returns a very long list with infos about all thread. Especially if you did many reloads.

What we need to find and resolve the problem you experience is

  • What were you doing when it happened
  • Any message you see in gdb why it stopped
  • Backlog of active thread
  • Backlog of all threads
  • If it only happens on special files being processed, these files as well
  • Last 100 lines of /var/lib/RepetierServer/logs/server.log

Send all this to repetierdev@gmail.com and we will try to find the cause and fix it for the next release.