5

I have a few functions in my code that are randomly causing SegmentationFault error. I've identified them by enabling the faulthandler. I'm a bit stuck and have no idea how to reliably eliminate this problem.

I'm thinking about some workaround. Since the functions are crashing randomly, I could potentially retry them after a failure. The problem is that there's no way to recover from SegmentationFault crash.
The best idea I have for now is to rewrite these functions a bit and run them via subprocess. This solution will help me, that a crashed function won't crash the whole application, and can be retried.

Some of the functions are quite small and often executed, so it will significantly slow down my app. Is there any method to execute function in a separate context, faster than a subprocess that won't crash whole program in case of segfault?

Djent
  • 1,311
  • 4
  • 24
  • 48
  • Maybe you can open spare processes ahead of time, so that if/when one crashes the next one takes over immediately – Yuri Feldman Nov 02 '20 at 11:15
  • You mean, constantly forking the app before entering a dangerous function? I don't know how would I do that in practice... – Djent Nov 02 '20 at 11:20
  • 4
    I have written a lot of Python and Never had a SegmentationFault. You are doing something wrong. Show one of the functions that cause a SegmentationFault – rioV8 Nov 02 '20 at 11:26
  • My code crashes mostly when there are some C modules involved, e.g. Jinja templates resolving. There's nothing fancy there. – Djent Nov 02 '20 at 11:32
  • No, I mean maintaining a pool of processes in the background, waiting to take over - preferably using a separate process to manage them. But as @rioV8 wrote - maybe you would be better off figuring out why the segfault happens in the first place. – Yuri Feldman Nov 02 '20 at 12:42
  • I don't see how it could work. What will be executed in the subprocess? Do you mean a worker process that will be communicating via ZeroMQ or something similar to the main one and perform dangerous tasks? Also, I've tried to resolve the `SegmentationFault` problems multiple times, I can't even reproduce it reliably, so I will never be sure that they are actually gone. – Djent Nov 02 '20 at 12:50
  • 5
    Where in your code do you get `SegmentationFault`. Yes it can take a long time to find the error. Reduce your code to the base minimum and add line by line till you get SegmentationFault's. Reliable code does NOT have SegmentationFault – rioV8 Nov 02 '20 at 14:48
  • Thanks to faulthandler I know in which function the `SegmentationFault` occurs. The error is raised e.g. by Jinja's `from_string()` function. I can remove it, and my code would be stable but it won't resolve templates. How it will help? I have a few such places which involve modules that have bindings to C and cause segfaults. There's nothing wrong with the code, and I can't isolate it to post anywhere to be reproduced. Even I have a problem with reproducing it in a real environment. I've already given up here, and now I'm looking for a workaround. – Djent Nov 02 '20 at 19:37
  • I have even tried to run the crashing function in a subproces million times (literally million times - in a loop) to see if it crash - it won't. It crashes only in my app that I can't post anywhere. – Djent Nov 02 '20 at 19:38
  • @MisterMiyagi - what you want to bring with that comment? I've already said that I can't fix this problem, nor ask anywhere as it can't be reproduced reliably. I'm not saying it's Jinja - probably isn't, I'm not asking for help in resolving the problem, I just need a workaround as I've given up already. – Djent Nov 09 '20 at 08:38

2 Answers2

10

I had some unreliable C extensions throw segfaults every once in a while and, since there was no way I was going to be able to fix that, what I did was create a decorator that would run the wrapped function in a separate process. That way you can stop segfaults from killing the main process.

Something like this: https://gist.github.com/joezuntz/e7e7764e5b591ed519cfd488e20311f1

Mine was a bit simpler, and it did the job for me. Additionally it lets you choose a timeout and a default return value in case there was a problem:

#! /usr/bin/env python3

# std imports
import multiprocessing as mp


def parametrized(dec):
    """This decorator can be used to create other decorators that accept arguments"""

    def layer(*args, **kwargs):
        def repl(f):
            return dec(f, *args, **kwargs)

        return repl

    return layer


@parametrized
def sigsev_guard(fcn, default_value=None, timeout=None):
    """Used as a decorator with arguments.
    The decorated function will be called with its input arguments in another process.

    If the execution lasts longer than *timeout* seconds, it will be considered failed.

    If the execution fails, *default_value* will be returned.
    """

    def _fcn_wrapper(*args, **kwargs):
        q = mp.Queue()
        p = mp.Process(target=lambda q: q.put(fcn(*args, **kwargs)), args=(q,))
        p.start()
        p.join(timeout=timeout)
        exit_code = p.exitcode

        if exit_code == 0:
            return q.get()

        logging.warning('Process did not exit correctly. Exit code: {}'.format(exit_code))
        return default_value

    return _fcn_wrapper

So you would use it like:


@sigsev_guard(default_value=-1, timeout=60)
def your_risky_function(a,b,c,d):
    ...

imochoa
  • 636
  • 2
  • 8
  • Hmm, multiprocess should indeed be faster than subprocess, I didn't know it will protect from segfault though. – Djent Nov 05 '20 at 08:50
  • I'm not sure about the speed (the main reason I used multiprocessing is to use their nice Queue/Process workflow). But I can confirm that the main process will survive a SEGFAULT in the child process and return the default_value instead – imochoa Nov 05 '20 at 13:02
4

tl;dr: You can write C code using signal, setjmp, longjmp.


You have multiple choices to handle SIGSEGV:

  • spawing sub process using subprocess library
  • forking using multiprocessing library
  • writing custom signal handler

Subprocess and fork have already been describe, so I will focus on signal handler point of view.

Writing signal handler

From a kernel perspective, there is no difference between SIGSEGV and any other signals like SIGUSR1, SIGQUIT, SIGINT, etc. In fact, some libraries (like JVM) use them as way of communication.

Unfortunately you can't override signal handler from python code. See doc:

It makes little sense to catch synchronous errors like SIGFPE or SIGSEGV that are caused by an invalid operation in C code. Python will return from the signal handler to the C code, which is likely to raise the same signal again, causing Python to apparently hang. From Python 3.3 onwards, you can use the faulthandler module to report on synchronous errors.

This mean, error management should be done in C code.

You can write custom signal handler and use setjmp and longjmp to save and restore stack context.

For example, here is a simple CPython C extension:

#include <signal.h>
#include <setjmp.h>

#define PY_SSIZE_T_CLEAN
#include <Python.h>

static jmp_buf jmpctx;

void handle_segv(int signo)
{
    longjmp(jmpctx, 1);
}

static PyObject *
install_sig_handler(PyObject *self, PyObject *args)
{
    signal(SIGSEGV, handle_segv);
    Py_RETURN_TRUE;
}

static PyObject *
trigger_segfault(PyObject *self, PyObject *args)
{
    if (!setjmp(jmpctx))
    {
        // Assign a value to NULL pointer will trigger a seg fault
        int *x = NULL;
        *x = 42;

        Py_RETURN_TRUE; // Will never be called
    }

    Py_RETURN_FALSE;
}

static PyMethodDef SpamMethods[] = {
    {"install_sig_handler", install_sig_handler, METH_VARARGS, "Install SIGSEGV handler"},
    {"trigger_segfault", trigger_segfault, METH_VARARGS, "Trigger a segfault"},
    {NULL, NULL, 0, NULL},
};

static struct PyModuleDef spammodule = {
    PyModuleDef_HEAD_INIT,
    "crash",
    "Crash and recover",
    -1,
    SpamMethods,
};

PyMODINIT_FUNC
PyInit_crash(void)
{
    return PyModule_Create(&spammodule);
}

And the caller app:

import crash

print("Install custom sighandler")
crash.install_sig_handler()

print("bad_func: before")
retval = crash.trigger_segfault()
print("bad_func: after (retval:", retval, ")")

This will produces following output:

Install custom sighandler
bad_func: before
bad_func: after (retval: False )

Pros and cons

Pros:

  • From an OS perspective, the app just catch SIGSEGV as a regular signal. Error handling will be fast.
  • It does not need forking (not always possible if your app hold various kind of file descriptor, socket, ...)
  • It does not need spawning sub processes (not always possible and much slower method).

Cons:

  • Might cause memory leak.
  • Might hide undefined / dangerous behavior

Keep in mind that segmentation fault is a really serious error ! Always try to first fix it instead of hiding it.

Few links and references:

arthurlm
  • 308
  • 1
  • 7
  • As I explain, I am only trying to find an alternative solution faster than forking / sub-processing. This method can be used to create more complex wrapper above raw python code. But I have try to make my example as simple as possible. – arthurlm Nov 08 '20 at 10:53
  • Here you can find a more complex example which wraps the call https://gist.github.com/arthurlm/d358e04781b6308e83be7fcbd1d7e05f – arthurlm Nov 08 '20 at 12:03
  • That's a very interesting solution, I'm a bit concerned about the undefined behavior, especially that my code uses threads. I'll try it out. – Djent Nov 09 '20 at 08:53
  • Threads, heap management, GC and files descriptors are the main resources which may suffer from undefined behaviour. If you have some threads, I suggest using lock from python side. It will avoid having too much complex C code. Remember, this method should be mainly use to trigger a clean exit / restart of the whole application. – arthurlm Nov 09 '20 at 09:47
  • The multiprocessing seems to be the safest solution, however for my small functions that are executed quite often and don't use any external resources this solution might be actually good to just retry the call. – Djent Nov 09 '20 at 10:07