Inconsistent busy waiting time in c-CodePudding

I have a character that should "eat" for 200 microseconds, "sleep" for 200 microseconds, and repeat, until they die, which happens if they haven't eaten for time_to_die microseconds.

In the code snippet in function main indicated below, the struct time_to_die has a member tv_usec configured for 1000 microseconds and I expect it to loop forever.

After some time, one execution of the function busy_wait takes around 5 times more than it is supposed to (enough to kill the character), and the character dies. I want to know why and how to fix it.

#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>

struct timeval  time_add_microseconds(struct timeval time, long long microseconds)
{
    
    time.tv_usec  = microseconds;
    while (time.tv_usec >= 1000000)
    {
        time.tv_sec  = 1;
        time.tv_usec -= 1000000;
    }
    return (time);
}


short   time_compare(struct timeval time_one, struct timeval time_two)
{
    if (time_one.tv_sec != time_two.tv_sec)
    {
        if (time_one.tv_sec > time_two.tv_sec)
            return (1);
        else
            return (-1);
    }
    else
    {
        if (time_one.tv_usec > time_two.tv_usec)
            return (1);
        else if (time_one.tv_usec == time_two.tv_usec)
            return (0);
        else 
         return (-1);
    }
}

// Wait until interval in microseconds has passed or until death_time is reached.
void    busy_wait(int interval, struct timeval last_eaten_time, struct timeval time_to_die)
{
    struct timeval      time;
    struct timeval      end_time;
    struct timeval      death_time;

    gettimeofday(&time, NULL);
    end_time = time_add_microseconds(time, interval);
    death_time = time_add_microseconds(last_eaten_time, time_to_die.tv_sec * 1000000ULL   time_to_die.tv_usec);
    while (time_compare(time, end_time) == -1)
    {
        gettimeofday(&time, NULL);
        if (time_compare(time, death_time) >= 0)
        {
            printf("%llu died\n", time.tv_sec * 1000000ULL   time.tv_usec);
            exit(1);
        }
    }
}

int main(void)
{
    struct timeval time;
    struct timeval time_to_die = { .tv_sec = 0, .tv_usec = 1000};
    struct timeval last_eaten_time = { .tv_sec = 0, .tv_usec = 0 };

    while (true)
    {
        gettimeofday(&time, NULL);
        printf("%llu eating\n", time.tv_sec * 1000000ULL   time.tv_usec);
        last_eaten_time = time;
        busy_wait(200, last_eaten_time, time_to_die);

        gettimeofday(&time, NULL);
        printf("%llu sleeping\n", time.tv_sec * 1000000ULL   time.tv_usec);
        busy_wait(200, last_eaten_time, time_to_die);
    }
}

Note: Other than the system functions I already used in my code, I'm only allowed to use usleep, write, and malloc and free.

Thank you for your time.

CodePudding user response：

after some time, one execution of the function busy_wait takes around 5 times more than it is supposed to (enough to kill the character), and the character dies. I want to know why and how to fix it.

There are multiple possibilities. Many of them revolve around the fact that there is more going on in your computer while the program runs than just the program running. Unless you're running on a realtime operating system, the bottom line is that you can't fix some of the things that could cause such behavior.

For example, your program shares the CPU with the system itself and with all the other processes running on it. That may be more processes than you think: right now, there are over 400 live processes on my 6-core workstation. When there are more processes demanding CPU time than there are CPUs to run them on, the system will split the available time among the contending processes, preemptively suspending processes when their turns expire.

If your program happens to be preempted during a busy wait, then chances are good that substantially more than 200 μs of wall time will elapse before it is next scheduled any time on a CPU. Time slice size is usually measured in milliseconds, and on a general-purpose OS, there is no upper (or lower) bound on the time between the elapse of one and the commencement of the same program's next one.

As I did in comments, I observe that you are using gettimeofday to measure elapsed time, yet that is not on your list of allowed system functions. One possible resolution of that inconsistency is that you're not meant to perform measurements of elapsed time, but rather to assume / simulate. For example, usleep() is on the list, so perhaps you're meant to usleep() instead of busy wait, and assume that the sleep time is exactly what was requested. Or perhaps you're meant to just adjust an internal time counter instead of actually pausing execution at all.

CodePudding user response：

Why

Ultimately: because an interrupt or trap is delivered to the CPU core executing your program, which transfers control to the operating system.

Some common causes:

The operating system is running its process scheduling using a hardware timer which fires a regular intervals. I.e. the OS is running some kind of fair scheduler and it has to check if your process' time is up for now.
Some device in your system needs to be serviced. E.g. a packet arrived over the network, your sound card's output buffer is running low and must be refilled, etc.
Your program voluntarily makes a request to the operating system that transfers control to it. Basically: anytime you make a syscall, the kernel may have to wait for I/O, or it may decide that it's time to schedule a different process, or both. In your case, the calls to printf will at some point result in a write(2) syscall that will end up performing some I/O.

What to do

Cause 3 can be avoided by ensuring that no syscalls are made, i.e. never trapping in to the OS.

Causes 1 and 2 are very difficult to completely get rid of. You're essentially looking for a real-time operating system (RTOS). An OS like Linux can approximate that by placing processes in different scheduling domains (SCHED_FIFO/SCHED_RR). If you're willing to switch to a kernel that is tailored towards real-time applications, you can get even further. You can also look in to topics like "CPU isolation".

CodePudding user response：

Just to illustrate the printf, but also gettimeofday timings I was mentionned in comments, I tried 2 things

#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>

int main(void)
{
    struct timeval time;

    long long histo[5000];

    for(int i=0; i<5000; i  ){
        gettimeofday(&time, NULL);
        histo[i]=time.tv_sec * 1000000ULL   time.tv_usec;
    }

    long long min=1000000000;
    long long max=0;
    for(int i=1; i<5000; i  ){
        long long dt=histo[i]-histo[i-1];
        if(dt<min) min=dt;
        if(dt>max) max=dt;
        if(dt>800) printf("%d %lld\n", i, dt);
    }
    printf("avg: %f min=%lld max=%lld\n", (histo[4999]-histo[0])/5000.0, min, max);
}

So all it does here, is just looping in 5000 printf/gettimeofday iterations. And then measuring (after the loop) the mean, min and max.

On my X11 terminal (Sakura), average is 8 μs per loop, with min 1 μs, and max 3790 μs! (other measurement I made show that this 3000 or so μs is also the only one over 200 μs. In other words, it never goes over 200 μs. Except when it does "bigly"). So, on average, everything goes well. But once in a while, a printf takes almost 4ms (which is not enough, it that doesn't happen several times in a row for a human user to even notice it. But is way more than needed to make your code fail).

On my console (no X11) terminal (a 80x25 terminal, that may, or may not use text mode of my graphics card, I never was sure), mean is 272 μs, min 193 μs, and max is 1100 μs. Which (retroactively) is not surprising. This terminal is slow, but simpler, so less prone to "surprises". But, well, it fails faster, because probability of going over 200 μs is very high, even if it is not a lot over, more than half of the loops take more than 200 μs.

I also tried measurements on a loop without printf.

#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>

int main(void)
{
    struct timeval time;
    long long old=0;

    long long ntot=0;
    long long nov10=0;
    long long nov100=0;
    long long nov1000=0;

    for(int i=0;;i  ){
        gettimeofday(&time, NULL);
        long long t=time.tv_sec * 1000000ULL   time.tv_usec;
        if(old){
            long long dt=t-old;
            ntot  ;
            if(dt>10){
                nov10  ;
                if(dt>100){
                    nov100  ;
                    if(dt>1000) nov1000  ;
                }
            }
        }
        if(i000==0){
            printf("tot=%lld >10=%lld >100=%lld >1000=%lld\n", ntot, nov10, nov100, nov1000);
            old=0;
        }else{
            old=t;
        }
    }

}

So, it measures something that I could pompously call a "logarithmic histogram" of timings. This times, independent from the terminal (I put back old to 0 each times I print something so that those times doesn't count)

Result

tot=650054988 >10=130125 >100=2109 >1000=2

So, sure, 99.98% of the times, gettimeofday takes less than 10μs. But, 3 times each millions call (and that means, in your code, only a few seconds), it takes more than 100μs. And twice in my experiment, it took more than 1000 μs. Just gettimeofday, not the printf.

Obviously, it's not gettimeofday that took 1ms. But simply, something more important occurred on my system, and that process had to wait 1ms to get some cpu time from the scheduler.

And bear in mind that this is on my computer. And on my computer, your code runs fine (well, those measurement shows that it would have failed eventually if I let it run as long as I let those measurements run).

On your computer, those numbers (the 2 >1000) are certainly way more, so it fails very quickly.

preemptive multitasks OS are simply not made to guarantee executions times in micro-seconds. You have to use a Real Time OS for that (RT-linux for example. It it sills exist, anyway — I haven't used it since 2002).