Are IA-32 segment descriptors that do not cover the full 4GiB linear address space slower?-CodePudding

I am wondering if using segment descriptors that do not cover the whole linear address space are slower then using ones that do?

I am hoping that there is no difference in speed.

CodePudding user response：

Technically no: having a smaller limit isn't slower AFAIK.

But having a non-zero base increases load-use latency by 1 cycle on modern mainstream CPUs like Skylake. (i.e. an extra cycle for address generation.) CPU hardware has an optimistic fast path to skip that addition in the normal case of it being zero, otherwise it's part of the critical path for load latency.

That will be most noticeable in pointer-chasing scenarios like linked lists, (binary) trees, and other cases where a load address being ready is part of the critical path dependency chain for data depending on that load. Otherwise out-of-order exec can mostly hide it.

Load-use latency for GP-integer regs on modern Intel CPUs is 5 cycles (with a zero segment base) for an L1d hit, so an extra 1 cycle is significant for workloads sensitive to that. (Or 4 cycles in pointer-chasing use cases (when the base reg itself came from a load) on some Intel CPUs: see Is there a penalty when base offset is in a different page than the base? for details about their optimistic TLB lookup using the base reg instead of waiting for the proper AGU result. But I think they dropped it for Ice Lake, so it's always 5 cycles.)

Non-zero segment bases are still used for thread-local storage, so CPUs do still support them without disastrous performance penalties. (e.g. not trapping to a microcode assist like sub-normal FP values in some cases.) But it does cost some performance.

See Agner Fog's microarch guide (https://agner.org/optimize/), and other links in https://stackoverflow.com/tags/x86/info. Also I think Intel's optimization manual mentions this for Intel CPUs; I haven't checked AMD's.

For example for AMD K8/K10, Agner writes:

AMD manuals say that the branch misprediction penalty is 10 clock cycles if the code segment base is zero and 12 clocks if the code segment base is nonzero. In my measurements, I have found a minimum branch misprediction penalty of 12 and 13 clock cycles, respectively.

And re: data load/stores on K8/K10:

The time it takes to calculate an address and read from that address in the level-1 cache is 3 clock cycles if the segment base is zero and 4 clock cycles if the segment base is nonzero, according to my measurements. Modern operating systems use paging rather than segmentation to organize memory. You can therefore assume that the segment base is zero in 32-bit and 64-bit operating systems (except for the thread information block which is accessed through FS or GS). The segment base is almost always nonzero in 16-bit systems in protected mode as well as real mode

I don't actually see a mention of the extra latency on Intel CPUs in Agner's guide, since it's irrelevant for most modern uses. So check Intel's optimization guide.

If you're using a segmented memory model, you'll probably end up using segment-override prefixes occasionally; some CPUs have limits on number of prefixes on a single instruction they can decode efficiently, but at least 32-bit mode won't have REX prefixes. pmovzxbd xmm0, [es: eax] will use 3 prefixes (2 mandatory as part of an SSE4.1 instruction, plus ES), and an escape byte, which would be a problem for early Silvermont-family (but not I think later low-power cores). In 32-bit mode you don't have REX prefixes, so that helps avoid that limit.

Agner Fog's microarch guide says there's no special penalty for decoding a segment override on most CPUs.

Also, Intel CPUs (Core 2, Nehalem, and Sandybridge-family at least) do seem to rename segment registers, so modifying them like mov es, eax doesn't have to serialize out-of-order exec. But it's still not cheap, like 10 fused-domain uops for the front-end on Skylake, with 18c throughput. (But can pipeline with writes to other segment regs). See Is a mov to a segmentation register slower than a mov to a general purpose register? for some more details and test results.