Linux PREEMPT_RT: SCHED_OTHER performs better than SCHED

I'm experimenting with the realtime capabilities of the Raspberry Pi 3/4. I've written the following C program to test.


// Compile with:
// g   realtime_task.cpp -o realtime_task -lrt && sudo setcap CAP_SYS_NICE ep realtime_task

#include <cstdio>
#include <sched.h>
#include <unistd.h>
#include <fcntl.h>
#include <cstdbool>
#include <chrono>
#include <algorithm> 

using namespace std;
using namespace chrono;
using namespace chrono_literals;

int main(int argc, char **argv)
{   
    // allocate this process to the 4th core (core 3)
    pid_t pid = getpid();
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(3, &cpuset);
    int result = 0;
    result = sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset);
    if (result != 0)
    {
        perror("`sched_setaffinity` failed");
    }

    struct sched_param param;

    // Use SCHED_FIFO
    param.sched_priority = 99;
    result = sched_setscheduler(pid, SCHED_FIFO, &param);

    // Use SCHED_OTHER
    // param.sched_priority = 0;
    // result = sched_setscheduler(pid, SCHED_OTHER, &param);

    if (result != 0)
    {
        perror("`sched_setscheduler` failed");
    }

    uint32_t count = 0;
    uint32_t total_loop_time_us = 0;
    uint32_t min_loop_time_us = numeric_limits<uint32_t>::max();
    uint32_t avg_loop_time_us = 0;
    uint32_t max_loop_time_us = 0;
    
    while(true)
    {
        count  ;

        auto start = steady_clock::now();

        auto loop_time_us = duration_cast<microseconds>(steady_clock::now() - start).count();
        while(loop_time_us < 500)
        { 
            loop_time_us = duration_cast<microseconds>(steady_clock::now() - start).count();
        }

        min_loop_time_us = min((uint32_t)loop_time_us, (uint32_t)min_loop_time_us);
        total_loop_time_us  = loop_time_us;
        avg_loop_time_us = total_loop_time_us / count;
        max_loop_time_us = max((uint32_t)loop_time_us, (uint32_t)max_loop_time_us);

        if ((count % 1000) == 0)
        {
            printf("%u %u %u\r", min_loop_time_us, avg_loop_time_us, max_loop_time_us);
            fflush(stdout);
        }
    }

    return 0;
}

I've patched the kernel with the PREEMPT_RT patch. uname reports 5.15.84-v8 #1613 SMP PREEMPT and everything runs fine.

Kernel command line arguments have isolcpus=3 irqaffinity=0-2 to isolate the 4th core (core 3) and reserve it for the program above. I can see in htop that my program is the only process running on the 4th core (core 3).

When using the SCHED_FIFO policy, it reports the following minimum, average, and maximum loop times...

MIN AVG MAX
500 522 50042

.. and htop reports:

CPU▽ CTXT     PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME   Command
  3     1   37238 pi         RT   0  4672  1264  1108 R 97.5  0.0  0:07.57 ./realtime_task

When using the SCHED_OTHER policy, it reports the following minimum, average, and maximum loop times...

MIN AVG MAX
500 500 524

.. and htop reports:

CPU▽ CTXT     PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME   Command
  3     0   36065 pi         20   0  4672  1260  1108 R 100.  0.0  1:30.16 ./realtime_task

This is the opposite of what I expect. I expect SCHED_FIFO to give me the lower maximum loop time and fewer context switches. Why am I getting these results?

CodePudding user response：

Ive a guess here. If you run SCHED_OTHER but set the priority of your process to something other then default, it will promote this process over any other default process. So your task is basicly the single one other all others and therefore hold the cpu even longer. This is what the documentation reads: https://man7.org/linux/man-pages/man7/sched.7.html

SCHED_OTHER: Default Linux time-sharing scheduling SCHED_OTHER can be used at only static priority 0 (i.e., threads under real-time policies always have priority over SCHED_OTHER processes). SCHED_OTHER is the standard Linux time-sharing scheduler that is intended for all threads that do not require the special real-time mechanisms.

CodePudding user response：

The problem turned out to be realtime throttling. When throttling occurs a message appears in the dmesg output.

Once disabled with echo -1 > /proc/sys/kernel/sched_rt_runtime_us, the SCHED_FIFO policy worked as expected. When a stressor program is introduced on cores 0~2, then SCHED_FIFO performs much better than SCHED_OTHER.

However, there is a better way to avoid realtime throttling without having to disable it for the entire system. For the program listing in the original question change this code ...

auto loop_time_us = duration_cast<microseconds>(steady_clock::now() - start).count();
while(loop_time_us < 500)
{ 
    loop_time_us = duration_cast<microseconds>(steady_clock::now() - start).count();
}

... to this code ...

this_thread::sleep_for(400us);
auto loop_time_us = duration_cast<microseconds>(steady_clock::now() - start).count();
while(loop_time_us < 500)
{
    loop_time_us = duration_cast<microseconds>(steady_clock::now() - start).count();
}

The this_thread::sleep_for call will prevent the process from consuming the entire allotted time in /proc/sys/kernel/sched_rt_period_us and thus prevent realtime throttling. Since sleep_for is not very precise, you just don't sleep for the full 500 microseconds, and use the while(loop_time_us < 500) loop to fill the remaining 100 microseconds with a more precise spin-wait.

This method also prevents the realtime core from turning into a space heater.