Extracting instructions from an ELF file-CodePudding

For a program I'm working on, I need to extract the instructions of an ELF binary compiled for the risc-v arch. The way i'm trying to extract the instructions is the following:

void dumpCode(FILE *file, Elf32_Phdr *segm, Elf32_Ehdr *header)
{
    char *fileptr;
    struct stat statbuf;
    int *opcode_ptr;
    unsigned int i, vaddr, offset;

    int fd = fileno(file);
    if (fstat(fd, &statbuf)) {
        fprintf(stderr, "[-] Error while stating the file!\n");
        goto fail;
    }

    fileptr = (char *)mmap(0, statbuf.st_size, PROT_READ, MAP_PRIVATE, fd, 0);

    if (MAP_FAILED == fileptr) {
        fprintf(stderr, "[-] Error mapping the file!\n");
        goto fail;
    }

    offset = (0 == segm->p_offset ? header->e_ehsize   header->e_phnum * header->e_phentsize : segm->p_offset); // Mark 1

    opcode_ptr = (int *)(fileptr   offset);

    vaddr = (0 == segm->p_offset ? segm->p_vaddr   header->e_ehsize   header->e_phnum * header->e_phentsize : segm->p_vaddr); // Mark 2

    for (i = 0; i < segm->p_filesz / 4; i  , vaddr  = 4) { // Mark 3
        unsigned char *opcode = getOpcode(*opcode_ptr  );
        if (1 == disas(opcode, vaddr)) {
            free(opcode);
            break;
        }
        free(opcode);
    }

    munmap(fileptr, statbuf.st_size);
fail:
    close(fd);
}

To test my function, first I wrote a simple assembly program:

.global _start

_start:
    addi a0, x0, 1
    la a1, str
    addi a2, x0, 6
    addi a7, x0, 64
    ecall

    addi a0, x0, 0
    addi a7, x0, 93
    ecall

.data
    str: .ascii "Hello\n"

As a second test file I wrote a different code, this time in C

#include <stdio.h>
#include <math.h>

int main(void)
{
    printf("%.5f", sqrt(2.0));
    return 0;
}

The first test file has been compiled and assembled using: riscv32-linux-gnu-as -o test1.o test1.s; riscv32-linux-gnu-ld -o test1 test1.o The second test file has been compiled directly with gcc riscv32-linux-gnu-gcc -o test2 test2.c -lm Returning to the dumpCode function, I've marked three lines.

The first line is the offset where the segment is placed inside the file, in case it's 0, if i'm not wrong I need to add (header->e_ehsize header->e_phnum * header->e_phentsize) bytes in order to start dumping at the right place. Using this approach, with my test1 example it work's, but when i'm using the second test file it doesn't work. The second mark follow the same approach as this one, to extract the right virtual address.
The third mark is placed in the for loop i'm using to iterate over the instructions, but using (segm->p_filesz / 4) to count the exact number of instructions presents in the binary, it gives me much more instructions. As far as I know, the extra data belongs to padding, but I would like to know if I could stop prcessing data when arrived to the padding section.

How could I calculate the right amount of bytes I need to process from the elf file? If that amount of bytes includes padding, could I ignore it somehow?

CodePudding user response：

Source

offset = (0 == segm->p_offset
    ? header->e_ehsize   header->e_phnum * header->e_phentsize
    : segm->p_offset); // Mark 1

This is wrong. The document is very specific:

  p_offset
         This member holds the offset from the beginning of the
         file at which the first byte of the segment resides.

The data starts at segm->p_offset. No ifs or buts.

If you are seeing 0, I suspect it is because the segment doesn't have the PT_LOAD flag, meaning it's not in the file at all (and the offset in the file doesn't make sense) or because the segment is supposed to contain the ELF header (so offset 0 isn't wrong).

There is no distinction to the CPU between instructions and non-instructions. Every 4 bytes could possibly be an instruction. Even 00000000 is an instruction. An instruction is whatever the program counter points to. You could try to figure out where the program counter can point, but that's equivalent to the halting problem, therefore impossible.

There may be debug information or symbols that say which say which part of the file is padding, but since the CPU doesn't care, neither does the main part of the ELF file.