How to measure latency of thread migration away from a busy core by Linux's scheduler?-CodePudding

I have the task of measuring the migration time of a thread in Linux, in programs written in C or C .

I'm running a single thread on an idle core (with no affinity), then to enforce the scheduler to load-balance and migrate the thread, I pin a second thread on the same core. We know that as soon as a new thread/process is created the scheduler balances the threads causing the first thread to migrate from that core to the least busy one (because the second thread is pinned to that core).

I'd like to measure this thread migration time with a high-resolution timer. I tried perf sched but it only keeps track of the migration events. To clarify, by latency I mean, the difference between the time that the second thread is created on the core, and the time that the first thread is completely migrated to a new core.

How can I measure this latency?

CodePudding user response：

Kernelshark is your friend for this task. Take a look at The Graph Window, and especially the Task Plots (about half way down). You have to have the right options compiled into your kernel (FTRACE for a start), and there's a certain amount of fineness required to control capture of data so that it encompasses the event you want, but then you can get very fine detail of what's going on in the whole machine.

Basically, you can see the whole fate of every thread in the system, what CPU it was running on, when, why, etc. There's what seems to be a useful tutorial here

This kind of tool is brilliant for seeing exactly what's happened in a system. Other things show up too, like CPU clock speed power mode switches (everything can stop stock still for no apparent reason), etc.

CodePudding user response：

If you're using a modern x86 CPU, the rdtscp instruction is probably useful to get a nanosecond-precise timestamp and a core ID as a single asm instruction, so both come from the same core. Linux sets the IA32_TSC_AUX MSR a small integer matching the core numbering that it uses for affinity, e.g. run under taskset -c 3 ./a.out you get a ID = 3 from the following test program:

#include <x86intrin.h>
#include <stdio.h>

int main(){
   unsigned id;
   unsigned long tsc = __rdtscp(&id);
   printf("tsc = %ld    ID = %d\n", tsc, id);
}

See also How to get the CPU cycle count in x86_64 from C ? for more details on the TSC in general, and the fact that most modern x86 systems have synchronized TSCs across cores. (Especially in single-socket multi-core CPUs, but well-designed multi-socket motherboards can also do it.) Also that the TSC runs at a fixed reference frequency, not core clock cycles, so it's a proxy for wall-clock time.

So we can collect a core-ID with a timestamp as part of a single asm instruction, meaning that both definitely came from the same core. (An interrupt can't come in the middle of a single instruction.) This rules out some kinds of problems.

You might write a loop that spins on __rdtscp until it sees the core ID change, then record or print the last TSC value from the old core and the first TSC value from the new core. (And the delta, but you want the absolute TSC value to compare against a TSC from before starting a new thread or setting thread affinity.)

uint64_t spin_until_migration(some output args)
{
   unsigned old_id, new_id;
   uint64_t old_tsc;
   uint64_t new_tsc = __rdtscp(&old_id);
   do {
       old_tsc = new_tsc;
       new_tsc = __rdtscp(&new_id);
   }while(old_id != new_id);
   // *arg1 = old_tsc;
   // ...
   return new_tsc - old_tsc;
}

(While spinning on the same core, the deltas will typically be around 32 core clock cycles on Skylake for example (https://uops.info/). If the CPU's current frequency is near it's TSC reference frequency (often near its "rated" sticker frequency, e.g. 4008 MHz on my i7-6700k 4.0GHz with 4.2GHz turbo), that's also approximately 32 core cycles. Bigger deltas will happen for interrupt handlers or other things that temporarily stop user-space from running.)

Note that this code just measures the last time this thread got a timeslice on the old core. The scheduler decision to migrate the thread may come later, after the pinned thread has already been running for some time on this core.

You'll want to record TSC timestamps at various steps of your competing load, like before starting a new thread and after. Or before/after making a CPU-affinity system call that migrates an existing thread. I haven't thought through the full details.

CodePudding user response：

Could you use a hardware performance counter? You didnt say exactly what your CPU was and honestly I couldnt help you a lot about specifics, but most CPUs have a cycle counter in there somewhere. You would write code into your threads to cache the value of the counter and then later you can print it out for cycle accuracy of the timing between two events. Obviously you cant print it right away or it will affect the time

https://en.wikipedia.org/wiki/Hardware_performance_counter

How to get the CPU cycle count in x86_64 from C ?