Can Apache web server make use of CPU AVX instructions?-CodePudding

I have replaced my old home server (i3-6100, 51W TDP, 3,7GHz, SSE4.1, SSE4.2, AVX2) with a thin client (Celeron J4105, 10W TDP, 1.5/2.5GHz turbo, SSE4.2).

Can Apache make use of CPU AVX instructions?

CodePudding user response：

Glibc automatically uses AVX/AVX2 if available for memcpy, memcmp, strlen, and stuff like that, which is nice for small to medium-length strings hot in L1d or L2 cache. (e.g. maybe twice as fast for strings of 100B to 128KiB). For shorter strings, startup and cleanup overhead are a significant fraction. Hopefully apache doesn't spend a lot of time looping over strings.

There might possibly be some auto-vectorized loops inside apache itself if you compile with -O3 -march=native, but unlikely.

I doubt there's anything in Apache that would be worth manually dispatching based on CPUID (except for libc functions), so you probably won't find any AVX instructions in the apache binary on your i3 server if you check with a disassembler, unless it was specifically compiled for that machine or for AVX-capable machines. If the whole binary was compiled with AVX enabled, even scalar FP math would use instructions like vmovsd / vucomisd instead of movsd / ucomisd, so if you see any movsd it wasn't compiled that way.

See How to check if compiled code uses SSE and AVX instructions? and note the SIMD (packed) vs. scalar.

One interesting feature of AVX that's relevant for multithreaded programs: Intel recently documented that the AVX feature flag implies 16-byte aligned load/store is guaranteed atomic. (And I think AMD is planning to do so if they haven't already, since it's also true in practice on their CPUs.) Previously the only support for 16-byte lock-free atomics was via lock cmpxchg16b, meaning that pure-load cost as much as an RMW. GCC-compiled code can take advantage of this via libatomic, including via updates to a shared libatomic which dispatches to more efficient load/store functions on CPUs with AVX.

So anyway, cheaper lock-free atomics for objects the size of two pointers in 64-bit mode. Not a game-changer for code that doesn't spend a ton of time communicating between threads. And it doesn't help the kernel because you can't take advantage of it with -mgeneral-regs-only; 16-byte load/store require an XMM reg, unless cmpxchg16b without a lock prefix counts. But that could do a non-atomic RMW if the compare succeeds, so that's unusable.

Probably more relevant is that AVX2 support comes with faster memcpy inside the kernel, for copy_to_user (from the pagecache) for read system calls. rep movsb can work in 32-byte chunks internally in microcode, vs. 16-byte chunks on CPUs whose load/store data paths are only 16 bytes wide.

(AVX can be implemented on CPUs with 16-byte load/store paths, like Zen 1 and Ivy Bridge, but your i3 with AVX2 has 32-byte datapaths between execution units and L1d cache. https://www.realworldtech.com/haswell-cpu/5/)