How to simplify algorithm of pasting addresses of data to immediate fields in machine code?-CodePudding

When developing my assembler I got one problem. In assembly language we can define data values (e. g. msg db 'Hi') and paste address of this data anywhere, even above this data. However when assembling code assembler don't know address of data until is not processed the line with those data.

Of course, assembler can remember addresses of machine code where we use address of defined data and after processing a code replace values in remembered addresses to address of our data. But if we define data in 2-byted address (e. g. at 01 F1) assembler would be to replace 1 bytes in remembered addresses to 2 bytes of address (01 F1) and so immediate field size will be changed (imm8 -> imm16) and assembler shall be rewrite instruction in same address (change bits w and s at opcode and maybe set prefix 0x66). If assembler will set prefix 0x66 and our data definition be after this instruction it shall be rewrite immediate field bytes (increment address value).

Illustration of this algoritm :

The following code:

mov dh, 09h
add dx, msg

;...

msg db 'Hello$'

will be assembled in the following principle:

preparing the code:

Comment :                       |===> Remember address of this byte (0x0004)
Comment :           |  ADD DX,MSG  |
Address : 0000 0001 |0002 0003 0004| ... 01F1 01F2 01F3 01F4 01F5 01F6
Code    :  B4   09  | 83   C2   00 | ...  48   65   6C   6C   6F   24
Comment :           ----------------      H    e    l    l    o    $

rewriting code in remebered addresses:

Comment :                         |=============|-This address (msg)
Comment :           |    ADD DX,01F1    |       v  
Address : 0000 0001 |0002 0003 0004 0005| ... 01F2 01F3 01F4 01F5 01F6 01F7
Code    :  B4   09  | 83   C2   F1   01 | ...  48   65   6C   6C   6F   24
Comment :           ---------------------      H    e    l    l    o    $

rewriting instruction's opcode 83h -> 81h (10000011b -> 10000001b: bit s=0):

Comment :                         |=============|-This address (msg)
Comment :           |    ADD DX,01F1    |       v  
Address : 0000 0001 |0002 0003 0004 0005| ... 01F2 01F3 01F4 01F5 01F6 01F7
Code    :  B4   09  | 81   C2   F1   01 | ...  48   65   6C   6C   6F   24
Comment :           ---------------------      H    e    l    l    o    $

write to immediate field the new address of data (0x01F2):

Comment :                         |=============|-This address (msg)
Comment :           |    ADD DX,01F2    |       v  
Address : 0000 0001 |0002 0003 0004 0005| ... 01F2 01F3 01F4 01F5 01F6 01F7
Code    :  B4   09  | 81   C2   F2   01 | ...  48   65   6C   6C   6F   24
Comment :           ---------------------      H    e    l    l    o    $

I think that this algorithm is difficult. Is it possible to simplify its?

CodePudding user response：

If the assembler isn't emitting a flat binary (i.e. also being a linker), the assembler has to assume the symbol address might be 2 bytes, because the final absolute address won't be known until link time, after the assembler is done. (So it will just leave space for a 2-byte address and a relocation for the linker to fill it in).

But if you are assembling directly into a flat binary and want to do this optimization, presumably you'd treat it like branch displacements with a start-small algorithm and do multi-pass optimization until everything fits. In fact you'd do this as part of the same passes that look at jmp/jcc rel8 vs. jmp/jcc rel16. (Why is the "start small" algorithm for branch displacement not optimal? - it is optimal if you don't have stuff like align 8, otherwise there are corner cases where it does ok but not optimal.)

These optimization passes are just looping over internal data-structures that represent the code, not actually writing final machine-code at each step. There's no need to calculate or look up the actual opcodes and ModRM encodings until after the last optimization pass.

You just need your optimizer to know rules for instruction size, e.g. that add reg, imm8 is 3 bytes, add reg, imm16 is 4 bytes (except for AX where add ax, imm16 has a 3-byte special encoding, same as add ax, imm8, so add to AX doesn't need to be part of the multi-pass optimization at all, it can just choose an encoding when we reach it after all symbol addresses are known.)

Note that it's much more common to use addresses as immediates for mov which doesn't allow narrow immediates at all (mov reg, imm16 is always 3 bytes). But this optimization is also relevant for disp8 vs. disp16 in addressing modes, e.g. for xor cl, [di msg] could use reg disp8 for small addresses so it's worth having this optimization.

So again, your optimizer passes would know that [di disp8] takes 1 byte after the ModRM, [di disp16] takes 2 extra.

And [msg] always takes 2 bytes after ModRM, there is no [disp8] encoding. The asm source would need to have a zeroed register if it wanted to take advantage of disp8 for small addresses.

Of course a simplistic or one-pass assembler could always just assume addresses are 16-bit and encode the other parts of the machine code accordingly, only going back to fill in numeric addresses once unresolved symbols are seen. (Or emit relocation info at the end for a linker to do it.)