perf stat
displays some interesting statistics that can be gathered from examining hardware and software counters.
In my research, I couldn't find any reliable information about what counts as a context-switch in perf stat
. In spite of my efforts, I was unable to understand the kernel code in its entirety.
Suppose my InfiniBand network application calls a blocking read
system call in the event mode 2000 times and perf stat
counts 1,241 context switches. The context-switches refer to either the schedule-in process or the schedule-out process, or both?
The __schedule()
function (kernel/sched/core.c
) increments the switch_count
counter whenever prev != next
.
It seems that perf stats
' context-switches include involuntary switches as well as voluntary switches.
It seems to me that only deschedule events are counted if the current context runs the schedule code and increases the nvcsw
and nivcsw
counters in the task_struct
.
output from perf stat -- my_application
:
1,241 context-switches
Meanwhile, if I only count the sched:sched_switch
event the output is close to the expected number.
output from perf stat -e sched:sched_switch -- my_application
:
2,168 sched:sched_switch
Is there a difference between context-switches
and the sched_switch
- event?
CodePudding user response:
I think you only get a count for context-switches
if a different task actually runs on a core that was running one of your threads. A read()
that blocks, but resumes before any user-space code from any other task runs on the core, probably won't count.
Just entering the kernel at all for a system-call clearly doesn't count; perf stat ls
only counts one context-switch in a largish directory for me, or zero if I ls
a smaller directory like /
. I get much higher counts, like 711
for a recursive ls
of a directory that I hadn't accessed recently, on a magnetic HDD. So it spent significant time waiting for I/O, and maybe running bottom-half interrupt handlers.
The fact that the count can be odd means it's not counting both deschedule and re-schedule separately; since I'm looking at counts for a single-threaded process that eventually exited, if it was counting both the count would have to be even.
I expect the counting is done when schedule()
decides that current
should change to point to a new task that isn't this one. (current
is the Linux kernel's per-core variable that points to the task_struct
of the current task, e.g. a user-space thread.) So every time that happens to a thread that's part of your process, you get 1 count.
Indeed, the OP helpfully tracked down the source code; it's in __schedule
in kernel/sched/core.c
. For example in Linux 6.1
static void __sched notrace __schedule(unsigned int sched_mode)
{
struct task_struct *prev, *next;
unsigned long *switch_count;
// and some other declarations I omitted
...
cpu = smp_processor_id();
rq = cpu_rq(cpu); // stands for run queue
prev = rq->curr;
...
switch_count = &prev->nivcsw; // either Num InVoluntary CSWs I think
...
if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
...
switch_count = &prev->nvcsw; // or Num Voluntary CSWs
}
next = pick_next_task(rq, prev, &rf);
...
if (likely(prev != next)) {
...
*switch_count; //// INCREMENT THE SELECTED COUNTER
...
trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next, prev_state);
// then make some function calls to actually do the context switch
...
}
I would guess the context-switches
perf event sums both involuntary and voluntary switches away from a thread. (Assuming that's what nv
and niv
stand for.)