Do I need to use smp_mb() after binding the CPU-CodePudding

Suppose my system is a multicore system， if I bind my program on a cpu core, still I need the smp_mb() to guard the cpu would not reorder the cpu instructions?

I have this point because I know that the smp_mb() on a single-core systems is not necessary,but I'm no sure this point is correct.

CodePudding user response：

You rarely need a full barrier anyway, usually acquire/release is enough. And usually you want to use C11 atomic_load_explicit(&var, memory_order_acquire), or in Linux kernel code, use one of its functions for an acquire-load, which can be done more efficiently on some ISAs (notably AArch64 or 32-bit ARMv8) than a plain load and an acquire barrier.

But yeah, if all threads are sharing the same logical core, run-time memory reordering is impossible, only compile-time. So you just need a CPU memory barrier like asm("" ::: "memory"), not the Linux kernel's SMP memory barrier (smp_mb() is x86 mfence or equivalent, or ARM dmb ish, for example).

See Why memory reordering is not a problem on single core/processor machines? for more details about the fact that all instructions on the same core observe memory effects to have happened in program order, regardless of interrupts. e.g. a later load must see the value from an earlier store, otherwise the CPU is not maintaining the illusion of instructions on that core running in program order.

And if you can convince your compiler to emit atomic RMW instructions without the x86 lock prefix, for example, they'll be atomic wrt. context switches (and interrupts in general). Or use gcc -Wa,-momit-lock-prefix=yes to have GAS remove lock prefixes for you, so you can use <stdatomic.h> functions efficiently. At least on x86; for RISC ISAs, there's no way to do a read-modify-write of a memory location in a single instruction.

Or if there is (ARMv8.1), it implies an atomic RMW that's SMP-safe, like x86 lock add [mem], eax. But on a CISC like x86, we have instructions like add [mem], eax or whatever which are just like separate load / ADD / store glued into a single instruction, which either executes fully or not at all before an interrupt. (Note that "executing" a store just means writing into the store buffer, not globally visible cache, but that's sufficient for later code on the same core to see it.)

See also Is x86 CMPXCHG atomic, if so why does it need LOCK? for more about non-locked use-cases.