How do i know when a function body ends in assembly-CodePudding

I want to know when a function body end in assemby, for example in c you have this brakets {} that tell you when the function body start and when it ends but how do i know this in assembly?

Is there a parser that can extract me all the functions from assembly and start line and endline of their body?

CodePudding user response：

I wan't to know when a function body ends in assembly, [...]

There are mainly four ways that the execution of a stream of (userspace) instructions can "end":

An unconditional jump like jmp or a conditional one like Jcc (je,jnz,jg ...)
A ret instruction (meaning the end of a subroutine) which probably comes closest to the intent of your question (including the ExitProcess "ret" command)
The call of another "function"
An exception. Not a C style exception, but rather an exception like "Invalid instruction" or "Division by 0" which terminates the user space program

[...] for example in c you have this brakets {} that tell you when the function body start and when it ends but how do i know this in assembly?

Simple answer: you don't. On the machine level every address can (theoretically) be an entry point to a "function". So there is no unique entry point to a "function" other than defined - and you can define anything.

On a tangent, this relates to self-modifying code and viruses, but it must not. The exit/end is as described in the first part above.

Is there a parser that can extract me all the functions from assembly and start line and endline of their body?

Disassemblers create some kind of "functions" with entry and exit points. But they are merely assumed. No way to know if that assumption is correct. This may cause problems.

The usual approach is using a disassembler and the work to recombinate the stream of instructions to different "functions" remains to the person that mandated this task (vulgo: you). Some tools exist that claim to simplify this, but I cannot judge their efficacy.

From the perspective of a high level language, there are decompilers that try to reverse the transformation from (for example) C to assembly/machine code that try to automatize that task and will work more or less or in some cases.

CodePudding user response：

There's no foolproof way, and there might not even be a well-defined correct answer in hand-written asm.

Usually (e.g. in compiler-generated code) you know a function ends when you see the next global symbol, like objdump does to decide when to print a new "banner". But without all function-start symbols being visible, there's no unambigious way. That's why some object file formats have room for size metadata associated with a symbol. Like .size foo, . - foo in GAS syntax.

It's not as easy as looking for a ret; some functions end with a jmp tail-call to another function. And some call a noreturn function like abort or __stack_chk_fail (not tailcall because they want to push a return address for a backtrace.) Or just fall off into whatever's next because that path had undefined behaviour in the source so the compiler assumed it wasn't reachable and stopped generating instructions for it, e.g. a C non-void function where execution can/does fall off the end without a return.

In general, assembly can blur the lines of what a function is.
Asm has features you can use to implement the high-level concept of a function, but you're not restricted to that.

e.g. multiple asm functions could all return by jumping to a common block of code that pops some registers before a ret. Is that shared tail a separate function that's called with a tail-called with a special calling convention?

Compilers don't usually do that, but humans could.

As for function entry points, usually some other code somewhere in the program will contain a call to it. But not necessarily; it might only be reachable via a table of function pointers, and you don't know that a block of .rodata holds function pointers until you find some code loading from it and calling or jumping.

But that doesn't work if the lowest-address instruction of the function isn't its entry point. See Does a function with instructions before the entry-point label cause problems for anything (linking)? for an example

Compilers don't generate code like that, but humans can. (It's a handy trick sometimes for https://codegolf.stackexchange.com/ questions.)

Or in the general case, a function might have multiple entry points. Or you could describe that as multiple functions with overlapping implementations. Sometimes it's as simple as one tailcalling another by falling into it without needing a jmp, i.e. it starts a few instructions before another.