Is there a speed difference between call gates and trap gates?-CodePudding

I understand the difference between call gates and trap gates. I am wondering if anyone knows of a significant difference in speed between the two gates.

Also is there a difference in speed if you leave the DwordCount field as 0 (meaning no parameters are copied from the outer stack to the inner stack)?

CodePudding user response：

In general; call gates access the GDT only, while trap gates access the IDT and then the GDT. This makes interrupt trap gates "likely slower in practice regardless of CPU" due to worse locality (higher chance of TLB misses and cache misses). Note that this is relatively important as these are typically used for kernel APIs, where you can assume the CPU spent enough time in user-space to cause all of kernel's stuff to become "not recently used" and evicted from caches/TLBs.

For specific speeds, ages ago I benchmarked both in a loop (with everything cached) on a Pentium 4 and found the trap gate was slower than "call gate with no parameters" by about 15%, but don't remember the exact costs in cycles. I haven't tested any other micro-arches.

I would also expect that using parameters will increase the cost of call gates (due to the cost of copying from one stack to another) and may make it slower than a trap gate.

However, an "int n" software interrupt is 2 bytes of code while the instruction to use a call gate (a far call) is much larger (7 bytes?), partly due to an offset that isn't even used. Raw execution speed is not the only thing that impacts performance - things like loading a program from disk, dynamic linking, instruction cache footprint, etc also matter. With this in mind the fastest strategy is likely to support both, and use software interrupts in rarely used code ("executed once" initialization code, error handling) and call gates in frequently executed code.

Note that typically (for kernel APIs) a function number is placed in a register (EAX) by the caller; and the call gate handler or trap handler does some preliminary work (setting up calling conventions, etc) and then does a "call [table eax*4]" to transfer control to the requested function (or a "invalid function number" function). This means that (assuming parameters are passed in registers) it's trivial for a kernel to support both (and use the same table of function pointers for both). Further; this same strategy can be extended to support SYSCALL and SYSENTER, which are faster (if supported by the CPU) and can be emulated in the invalid opcode exception handler (if not supported by the CPU).

Finally; for kernel APIs, the absolute fastest way is to get rid of that "call [table eax*4]" (which is likely to cause TLB and cache misses, and branch mispredictions) by having a call gate (or trap) for each specific function. This can't work for SYSCALL or SYSENTER; and for traps it's limited by the number of usable IDT entries; so it really only makes sense for call gates.