The code below takes 20 bytes. Yet, there’s a way to make it even smaller through interrupts. How?
A
MOV AH,9
MOV DX,108
INT 21
RET
DB 'HELLO WORLD$'
R CX
14
N MYHELLO.COM
W
CodePudding user response:
Print a shorter message, like db 'hi$'
:P Or as Vitsoft suggests, take the string as an arg like the Unix echo
command, so it doesn't take up space in your program.
Or depend on some values that some DOS versions leave in registers when your program starts, if you don't care about portability or only relying on documented guarantees.
Almost certainly int 21h
/ ah=9
is the most compact way to print multiple bytes of text. You need to get AH=9 and DX=pointer somehow. Without relying on existing bytes in convenient places in registers or memory that some DOS version might happen to leave lying around, that takes a 2-byte mov ah,9
and a 3-byte mov dx, imm16
.
You can set DX=0 with xor dx,dx
, but even the very start of your file is at offset 100h
in a .com program. (And that would mean letting the ASCII text execute as machine code without a jmp
over it!)
call label
/ db "text"
/ label: pop dx
would be 4 bytes total to get the pointer into DX.
Use uninitialized register values left by some known DOS version.
http://www.fysnet.net/yourhelp.htm linked from Tips for golfing in x86/x64 machine code on codegolf.SE found the startup register values across an array of DOS versions. This is not standardized, AFAIK, so it's just a happens-to-work. Later versions of FreeDOS became more and more similar to MS-DOS, because presumably some existing software was written to rely on it, on purpose, by accident, or because some people didn't know that "works on my machine" isn't the same thing as "guaranteed future proof and portable", but various other DOS versions differ. This is not something you should rely on for production use, only silly computer tricks like code golf or "demo scene" programs.
Most DOS versions happen to leave SI=0100h
at program startup. So if we can put our string there without messing up the machine (or SI), we can mov dx, si
(2 byte) instead of mov dx, 108h
or 107h
(3 bytes). But lea dx, [si 8]
is 3 bytes (opcode modrm disp8), so no saving unless we let the string execute.
Or even better, if there's something on the stack you could use for pop dx
? Or popa
, if you're extremely lucky also setting AX=09xx
. But I don't know if any DOS versions happen to leave any known stuff on the stack, other than the "return" address which points at an int 20h
instruction or something. Popping that would mean exiting manually with int 20h
instead of ret
, costing 1 more byte.
Actually, xchg ax, reg
is only 1 byte, so if any register starts with 09xx
, we can use that. MS-DOS 4.0 and later, FreeDOS 1.0, and IBM PC-DOS 4.0 and later, all start with BP=09xxh
. So we can save a byte in AH init by using xchg ax, bp
, separate from anything with DX. (Fun fact: This is where the 90h
NOP encoding comes from: it's just a special case of xchg ax,ax
, until x86-64 had to document it as an actual NOP because in 64-bit mode it doesn't zero-extend EAX into RAX like xchg eax,eax
would).
Letting the text execute as code
To save any more bytes, the only hope I see for making this shorter is using text that happens to decode as instructions that let execution come out the other side, without messing up SI, so you can put it in the path of execution. But at best you're saving 1 byte, unless the text also contains instructions that do anything useful.
But your message don't work for that. I checked out how 3 capitalizations of it would disassemble, using nasm
to make a flat binary and ndisasm -b16
to disassemble the result. (I used align 16
so I could find the boundary, and to give a sort of nop slide so if the last byte of the string was not the last byte of an instruction, it would consume some of that padding instead of changing decoding of the next string.) I don't have DOS or a debug.exe, so I'm using trailing-h
syntax on hex numbers. In DOS Debug, all numbers are implicitly hex, that's why int 21
is the right number. I also haven't tested these, I'm not that interested in obsolete 16-bit stuff, but the x86 machine code shenanigans are fun. Although true code-golf challenge questions are off-topic on Stack Overflow, this kind of single-language optimization question is a better fit here than on https://codegolf.stackexchange.com/
; just to look at disassembly, to see if there's any hope of letting them execute
DB 'HELLO WORLD$'
align 16
db 'Hello World$'
align 16
db 'hello world$'
;; DB 'HELLO WORLD$'
00000000 48 dec ax ; early-alphabet upper-case
00000001 45 inc bp ; is all single-byte inc/dec
00000002 4C dec sp ; the same opcodes x86-64 repurposed as REX
00000003 4C dec sp ;; Modified SP breaks RET
00000004 4F dec di
00000005 20574F and [bx 0x4f],dl ;; step on part of the PSP
00000008 52 push dx ; 'R' also modifies SP
00000009 4C dec sp ; 'L'
0000000A 44 inc sp ; 'D' cancel each other's effect on SP
0000000B 2490 and al,0x90
0000000D 90 nop
0000000E 90 nop
0000000F 90 nop
Nifty. So it doesn't actually do anything fatal to the machine (depending on where BX is pointing). With BX=0 from the DOS versions that give the BP and SI values we want, that would mask away some bits in [ds: 4f]
, which is in the reserved part of the PSP (program segment prefix). This may be fine if nothing else ever looks there before we exit, or during the DOS exit call.
But note the and al, 0x90
ending: the string itself ended with 24h
, aka '$'
, as the start of an instruction. That's the opcode for and al, imm8
, so it consumes 1 byte of whatever's next as part of that instruction.
So you'd need a byte of padding after it before you could put the start of a useful instruction. That would kill the 1-byte size saving.
And it messes up SP, so we can't ret
anymore. We'd need int 20h
to exit, unless you can bail out with CC int3
or something. Not sure what DOS does on that exception.
;; 20 bytes, cancelling out saving from xchg ax,bp
DB 'HELLO WORLD$' ; executes as machine code without doing anything too bad
nop ; but this is needed. It's actually consumed as an immediate for 24h
mov dx, si
xchg ax, bp ; AH=09 on some DOS versions, in 1 byte instead of 2.
int 21
int 20 ; larger than ret, making this a net loss.
Other capitalizations are a problem, involving 'l'
as 6C insb
IO instructions (https://www.felixcloutier.com/x86/ins:insb:insw:insd), and similarly 'o'
as 6F outsw
.
;; db 'Hello World$'
00000010 48 dec ax
00000011 656C gs insb ; big problem, IO could crash the machine
00000013 6C insb ; using port=DX, data from [DS:SI]
00000014 6F outsw
00000015 20576F and [bx 0x6f],dl
00000018 726C jc 0x86 ; conditional branch, but AND always clears CF so this will be not-taken
0000001A 642490 fs and al,0x90 ; FS prefix was new with 386
0000001D 90 nop ; the align 16 padding, including previous immediate
0000001E 90 nop
0000001F 90 nop
;; db 'hello world$'
00000020 68656C push word 0x6c65
00000023 6C insb
00000024 6F outsw
00000025 20776F and [bx 0x6f],dh
00000028 726C jc 0x96
0000002A 64 fs
0000002B 24 db 0x24
Again the 24h byte is left dangling, as the start of an instruction. If there'd been a nop
after it, ndisasm would have decoded it as fs and al, 0x90
like in the previous block.
Looks like ello
is a problem, with IO instructions.
We need the 2nd-last byte of the string to be something else, like the start of a 2-byte instruction, ideally something like 3C ib cmp al, imm8
. That's ASCII <
.
And we need it not to mess up SP. If it decrements it, we need to increment or pop into dummy registers, so it's once again pointing at the return address.
18 byte version, printing a modified string of same length
;; 18 bytes
DB 'HELLO_WOLD<$' ; executes as machine code, returning SP to original position without overwriting return address
mov dx, si ; mov dx,0100h MS-DOS (all versions), FreeDOS 1.0, many other DOSes
xchg ax, bp ; mov ah,9 MS-DOS 4.0 and later, and FreeDOS 1.0
int 21h
ret
Disassembles as
00000000 48 dec ax
00000001 45 inc bp ; affects BP, which we want to use later
00000002 4C dec sp
00000003 4C dec sp ; SP offset by -2
00000004 4F dec di
00000005 5F pop di ; restore SP
00000006 57 push di ; SP offset by -2
00000007 4F dec di
00000008 49 dec cx
00000009 59 pop cx ; restore SP
0000000A 3C24 cmp al,0x24 ; '<' consumes the '$' as an imm8
0000000C 89F2 mov dx,si ; instructions from the source, as written.
0000000E 95 xchg ax,bp
0000000F CD21 int 0x21
00000011 C3 ret
This does inc bp
, modifying one of the registers we're relying on for an initial value. But unless the low byte was FF
to start with, it won't wrap and change the 09
in the high half. On FreeDOS 1.0 specifically, the initial BP value is 091Eh
. On MS-DOS versions from Win9x, it's 0912h
. On DOS from Win-NT derived versions, it's 09xxh
, which doesn't rule out 09FFh
I had to mangle the string pretty seriously to balance the stack, with an even number of dec sp
instructions and pops to balance that. The 1-byte 58 rw
pop reg
includes some of the late upper-case alphabet letters.
Also had to avoid add [si], sp
or things like that, since the initial SI points at our string. (The initial BX typically doesn't.)
HELLO_WOLD<
has one too many pushes, but the 'LD'
part cancels out, dec sp
/ inc sp
. In that order so it doesn't temporarily leave part of your return address below SP, where an interrupt or debugger could clobber it.
If you really wanted to get serious about coming up with a string that visually looked more like HELLO WORLD
, you'd want to make a table of ASCII character and the corresponding instruction. Many upper-case ASCII characters are opcodes for single-byte instructions, either inc/dec or push/pop.
You could use a good assembler like NASM with a %rep
/ %assign i, i 1
/ db i
/ %endrep
block, and run it through a disassembler. Or write a program to output a binary file and disassemble that.
Or look at http://ref.x86asm.net/coder32.html and match it up with https://asciitable.com/
Can we use the text as machine code to at least exit the program? Unlikely; ret
is c3
, int 20h
is CD 20
, so neither of those opcodes will appear in ASCII text.
AFAIK, you can't tail-call a DOS routine to have it print and then exit without needing a ret
or equivalent in your own code. Or if you can, it would be a 3-byte jmp rel16
, or more likely a far jmp, which would take more bytes than 2 2 for mov ah, 9
/ int 21h
if we're talking about the jmp ptr16:16
form.
CodePudding user response:
You can shorten it to 8 bytes if you don't mind providing the text as its argument:
R:\>debug
-A
0DB2:0100 MOV AH,9
0DB2:0102 MOV DX,82
0DB2:0105 INT 21
0DB2:0107 RET
0DB2:0108
-R CX
CX 0000 :8
-N HELLO8.COM
-W
Writing 0008 bytes
-Q
R:\>HELLO8.COM HELLO WORLD$
HELLO WORLD
R:\>