Why is segmentation fault not recoverable?-CodePudding

Following a previous question of mine, most comments say "just don't, you are in a limbo state, you have to kill everything and start over". There is also a "safeish" workaround.

What I fail to understand is WHY is segmentation fault inherently non recoverable.

The moment in which writing to protected memory is caught - otherwise, the SIGSEGV would not be sent.

If the moment of writing to protected memory can be caught, I don't see why - in theory - it can't be reverted, at some low level, and have the SIGSEGV converted to a standard software exception.

Please explain why after a segmentation fault the program is in an undetermined state, as very obviously, the fault is thrown BEFORE memory was actually changed (I am probably wrong and don't see why). Had it been thrown after, one could create a program that changes protected memory, one byte at a time, getting segmentation faults, and eventually re-programming the kernel - A security risk that is not present, as we can see the world still stands.

When exactly does segmentation fault happen (=when is SIGSEGV sent)?
Why is the process in undefined behavior state after that point?
Why is it not recoverable?
Why does this solution avoid that unrecoverable state? Does it even?

CodePudding user response：

When exactly does segmentation fault happen (=when is SIGSEGV sent)?

When you attempt to access memory you don't have access to, such as accessing an array out of bounds or de-referencing an invalid pointer. The signal SIGSEGV is standardized but different OS might implement it differently. "Segmentation fault" is mainly a term used in *nix systems, Windows calls it "access violation".

Why is the process in undefined behavior state after that point?

Because one or several of the variables in the program didn't behave as expected. Lets say you have some array that is supposed to store a number of values, but you didn't allocate enough room for all them. So only those you allocated room for gets written correctly, the rest written out of bounds of the array can hold any values. How exactly is the OS to know how critical those out of bounds values are for your application to function? It knows nothing of their purpose.

Furthermore, writing outside allowed memory can often corrupt other unrelated variables, which is obviously dangerous and can cause any random behavior. Such bugs are often hard to track down. Stack overflows for example are such seg faults prone to overwrite adjacent variables, unless the error was caught by protection mechanisms.

If we look at the behavior of "bare metal" microcontroller systems with no OS and no virtual memory features, just raw physical memory - they will just silently do exactly as told - for example overwriting unrelated variables and keep on going. Which in turn could cause disastrous behavior in case the application is mission-critical.

Why is it not recoverable?

Because the OS doesn't know what your program is supposed to be doing.

Though in the "bare metal" scenario above, the system might be smart enough to place itself in a safe mode and keep going. Critical applications such as automotive and med-tech aren't allowed to just stop or reset, as that in itself might be dangerous. They will rather try to "limp home" with limited functionality.

Why does this solution avoid that unrecoverable state? Does it even?

That solution is just ignoring the error and keeps on going, it doesn't fix the problem that caused it. It's a very dirty patch and setjmp/longjmp in general are very dangerous functions that should be avoided for any purpose.

We have to realize that a seg fault is a symptom of a bug, not the cause.

CodePudding user response：

This might not be a complete answer, and it is by no means complete or accurate, but it doesn't fit into a comment

So a SIGSEGV can occur when you try to access memory in a way that you should not (like writing to it when it is read-only or reading from an address range that is not mapped). Such an error alone might be recoverable if you know enough about the environment.

But how do you want to determine why that invalid access happened in the first place.

In one comment to another answer you say:

short-term practice, I think my errors are only access to null and nothing more.

No application is error-free so why do you assume if null pointer access can happen that your application does not e.g. also have a situation where a use after free or an out of bounds access to "valid" memory locations happens, that doesn't immediately result in an error or a SIGSEGV.

A use-after-free or out-of-bounds access could also modify a pointer into pointing to an invalid location or into being a nullptr, but it could also have changed other locations in the memory at the same time. If you now only assume that the pointer was just not initialized and your error handling only considers this, you continue with an application that is in a state that does not match your expectation or one of the compilers had when generating the code.

In that case, the application will - in the best case - crash shortly after the "recovery" in the worst case some variables have faulty values but it will continue to run with those. This oversight could be more harmful for a critical application than restarting it.

If you however know that a certain action might under certain circumstances result in a SIGSEGV you can handle that error, e.g. that you know that the memory address is valid, but that the device the memory is mapped to might not be fully reliable and might cause a SIGSEGV due to that then recovering from a SIGSEGV might be a valid approach.

CodePudding user response：

At the machine-code level, many platforms would allow programs that are "expecting" segmentation faults in certain circumstances to adjust the memory configuration and resume execution. This may be useful for implementing things like stack monitoring. If one needs to determine the maximum amount of stack that was ever used by an application, one could set the stack segment to allow access only to a small amount of stack, and then respond to segmentation faults by adjusting the bounds of the stack segment and resuming code execution.

At the C language level, however, supporting such semantics would greatly impede optimization. If one were to write something like:

void test(float *p, int *q)
{
  float temp = *p;
  if (*q  = 1)
    function2(temp);
}

a compiler might regard the read of *p and the read-modify-write sequence on *q as being unsequenced relative to each other, and generate code that only reads *p in cases where the initial value of *q wasn't -1. This wouldn't affect program behavior anything if p were valid, but if p was invalid this change could result in the segment fault from the access to *p occurring after *q was incremented even though the access that triggered the fault was performed before the increment.

For a language to efficiently and meaningfully support recoverable segment faults, it would have to document the range of permissible and non-permissible optimizations in much more detail than the C Standard has ever done, and I see no reason to expect future versions of the C Standard to include such detail.

CodePudding user response：

It is recoverable, but it is usually a bad idea. For example Microsoft C compiler has option to turn segfaults into exceptions.

You can see the Microsoft SEH documentation, but even they do not suggest using it.

CodePudding user response：

Please explain why after a segmentation fault the program is in an undetermined state

I think this is your fundamental misunderstanding -- the SEGV does not cause the undetermined state, it is a symptom of it. So the problem is (generally) that the program is in an illegal, unrecoverable state WELL BEFORE the SIGSEGV occurs, and recovering from the SIGSEGV won't change that.

When exactly does segmentation fault happen (=when is SIGSEGV sent)?

The only standard way in which a SIGSEGV occurs is with the call raise(SIGSEGV);. If this is the source of a SIGSEGV, then it is obviously recoverable by using longjump. But this is a trivial case that never happens in reality. There are platform-specific ways of doing things that might result in well-defined SEGVs (eg, using mprotect on a POSIX system), and these SEGVs might be recoverable (but will likely require platform specific recovery). However, the danger of undefined-behavior related SEGV generally means that the signal handler will very carefully check the (platform dependent) information that comes along with the signal to make sure it is something that is expected.

Why is the process in undefined behavior state after that point?

It was (generally) in undefined behavior state before that point; it just wasn't noticed. That's the big problem with Undefined Behavior in both C and C -- there's no specific behavior assocaited with it, so it might not be noticed right away.

Why does this solution avoid that unrecoverable state? Does it even?

It does not, it just goes back to some earlier point, but doesn't do anything to undo or even identify the undefined behavior that cause the problem.

CodePudding user response：

So far, answers and comments have responded through the lens of a higher-level programming model, which fundamentally limits the creativity and potential of the programmer for their convenience. Said models define their own semantics and do not handle segmentation faults for their own reasons, whether simplicity, efficiency or anything else. From that perspective, a segfault is an unusual case that is indicative of programmer error, whether the userspace programmer or the programmer of the language's implementation. The question, however, is not about whether or not it's a good idea, nor is it asking for any of your thoughts on the matter.

In reality, what you say is correct: segmentation faults are recoverable. You can, as any regular signal, attach a handler for it with sigaction. And, yes, your program can most certainly be made in such a way that handling segmentation faults is a normal feature.

One obstacle is that a segmentation fault is a fault, not an exception, which is different in regards to where control flow returns to after the fault has been handled. Specifically, a fault handler returns to the same faulting instruction, which will continue to fault indefinitely. This isn't a real problem, though, as it can be skipped manually, you may return to a specified location, you may attempt to patch the faulting instruction into becoming correct or you may map said memory into existence if you trust the faulting code. With proper knowledge of the machine, nothing is stopping you, not even those spec-wielding knights.