Home > Net >  C : measure small elapsed times with precision
C : measure small elapsed times with precision

Time:09-27

In C , with chrono, I need to measure small elapsed times with precision.

Googling this didn't help me much... As the topic seems messy...

As this stays unclear, I came to write my own dummy code with 2 things in mind:

Here is the code (test_elapse_measure.cpp):

#include <iostream>
#include <thread>
#include <chrono>

int main(int argc, char ** argv) {
    int n = (argc == 2) ? atoi(argv[1]) : 1;

    long double time_resolution = (long double) std::chrono::high_resolution_clock::period::num;
    time_resolution /= (long double) std::chrono::high_resolution_clock::period::den;
    std::cout << "time_resolution: " << time_resolution << " s" << std::endl;

    std::chrono::high_resolution_clock::time_point start, stop;
    {
        start = std::chrono::high_resolution_clock::now();
        std::this_thread::sleep_for(std::chrono::seconds(n));
        stop = std::chrono::high_resolution_clock::now();
        auto elaps = std::chrono::duration_cast<std::chrono::seconds>(stop - start);
        long double count = elaps.count(); // precision: long double.
        std::cout << "sleep: " << count << " s" << std::endl;
    }
    {
        start = std::chrono::high_resolution_clock::now();
        std::this_thread::sleep_for(std::chrono::milliseconds(n));
        stop = std::chrono::high_resolution_clock::now();
        auto elaps = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
        long double count = elaps.count(); // precision: long double.
        std::cout << "sleep: " << count << " ms" << std::endl;
    }
    {
        start = std::chrono::high_resolution_clock::now();
        std::this_thread::sleep_for(std::chrono::nanoseconds(n));
        stop = std::chrono::high_resolution_clock::now();
        auto elaps = std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start);
        long double count = elaps.count(); // precision: long double.
        std::cout << "sleep: " << count << " ns" << std::endl;
    }
}

Running the code gives:

>> g   -o test_elapse_measure test_elapse_measure.cpp

>> ./test_elapse_measure 
time_resolution: 1e-09 s
sleep: 1 s
sleep: 1 ms
sleep: 94129 ns

>> ./test_elapse_measure 10
time_resolution: 1e-09 s
sleep: 10 s
sleep: 10 ms
sleep: 82093 ns

>> ./test_elapse_measure 100
time_resolution: 1e-09 s
sleep: 100 s
sleep: 100 ms
sleep: 70069 ns

Ideally, I need to measure elapsed from 1 nanoseconds up to several seconds. Timing resolution seems to say that I could measure nanoseconds (the timer seems precise enough), but, all nanosecond elapsed time measure are wrong?!...

Why is that? What did I miss?

EDIT

Or is there not way to measure nanosecond elapsed time?

Found this afterwards: measuring precise time in nanoseconds C .

Best one can measure reliably with C would be millisecond?

EDIT

Adding here a bit of context.

I work on an algorithm which has 3 parts in it (in an iterative loop), and, has also several parameters to it. On overall, the algorithm takes a "long" time mostly because there may be a large number of iterations (loop), but, the 3 parts in the loop may be "fast". Turns out that, with some parameters, each part in the loop may take several seconds to milliseconds. But, with some other parameters, each part in the loop may take several milliseconds to nanoseconds (as milliseconds count prints 0!). I was in the process to try to understand where time was spent, so I started using chrono deeper. Unluckily, it's a one-shot algorithm and I wanted to understand where time is spent "inside" the algorithm, so, it's quite difficult/impossible to "loop over each part (itself in a loop...) of the algorithm to get an averaged timing" (I thought about that first - not realistic as questions raise only when large graphs are passed as inputs to the algorithm which becomes tricky/intractable to debug such that profiling was possibly a way to understand what's going on and where).

From answers below, I understand this is no easy for good reasons. I actually expected that, but, I wanted to have external points of view.

At that point, say in a case I have an overall time of 3000 ms, when I time each part in the loop and sum timing across iterations, I get 300 ms! Which is 10 times less than 3000: I'd have accepted a 1.x factor but 10 is an order of magnitude lost in space?! So I was like "Ok, maybe I add up many many 0's as I use milliseconds counters but some parts may take less and they are seen as 0". For now still not sure what's going on: my timings are not consistent (overall magnitude very different from accumulated parts magnitude). Bugs are still possible although I wrote code as carefully as possible. For me, at this point, there is no bug (generated results are tested OK with expected results), but I still don't get why I have inconsistent timing (overall vs parts).

Ideally I was looking for a way, in all cases (even measuring nanoseconds), to have a sensible superior estimate (even not exact for reasons well explained in answers here) of the elapsed time spent (up to now, in nanoseconds, I get outer-space timings): I guess, there is no way to get that...

CodePudding user response:

Ideally, I need to measure elapsed from 1 nanoseconds up to several seconds.

This is not possible to measure something with the granularity of 1 ns on all mainstream processors. The most precise way to measure a very short code is to use the RDTSC instruction on x86-64 processors. This instruction typically takes several dozens of cycles. Moreover, modern processors execute instruction in parallel and in an out-of-order way so measuring the time of only few short instruction is simply not possible. In fact, the RDTSCP instruction can be used to limit a bit the out-of-order execution of the processor in such a case, but this is not enough and this instruction is also a bit slower. The only way to measure a short set of instructions is to repeat it in a loop. Micro-benchmarking is far from being easy and require a deep understanding of the target processor (which are becoming insanely complex nowadays).

Note that RDTSC provide a number of cycle. Modern processors use frequency scaling so it can be dificult to convert a number of cycle to a given time. That being said, AFAIK, new x86-64 processors now provide a number of cycle independent of the frequency.

all nanosecond elapsed time measure are wrong?!

Mainstream operating systems do not guarantee the sleeping time to be precise. The precision is generally about few milliseconds. This is because the OS schedule threads with a quantum of about few milliseconds (8 on my machine). The scheduler can be more precise regarding what other running threads (of typically other applications) do. Still, one should not expect timings to be more precise than few microsecond due to the time required to perform context switches (since sleeping cause a context switch).

CodePudding user response:

You should account for the time it takes to make the measurements themselves. I.e., the system call overhead to get the time. Time a loop of now() calls to figure out the time it takes for a single call. This isn't significant for times over one second, but when you are measuring nanoseconds, it matters.

However, this alone does not explain why your 100 ns sleep is measured so long. The reason for that is: The sleep really is that long! It is not the measurement, but the sleep, that is off.

Sleep is just a minimum time. The granularity of the operating system re-scheduling the thread after it sleeps is not 1 nanosecond. In Linux, give the thread real-time priority (SCHED_FIFO or SCHED_RR), then it will be run much more promptly when it becomes runnable. It still won't be 1, 10 or 100 nanoseconds. But maybe 10,000 instead of 100,000.

Finally, one nanosecond is a very short time. Light travels one foot in a nanosecond. It's 4 cycles for a CPU running at 4 GHz. How many instructions do you think it takes to put a thread to sleep and wake up? More than 4!

You can't measure times as short as one nanosecond, or even tens of nanoseconds, accurately from software on a general purpose computer. You need specialized hardware for that. Once you time how long the now() call takes, you'll see that it is both more than 1 ns and that the variation in the time takes is also more than 1 ns.

CodePudding user response:

The services that you are using encapsulate Linux system calls. Using a tool like strace shows the following on my machine:

$ strace ./test_elapse_measure
[...]
write(1, "time_resolution: 1e-09 s\n", 25time_resolution: 1e-09 s
) = 25
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffcb6727f50) = 0
write(1, "sleep: 1 s\n", 11sleep: 1 s
)            = 11
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000000}, 0x7ffcb6727f50) = 0
write(1, "sleep: 1 ms\n", 12sleep: 1 ms
)           = 12
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1}, 0x7ffcb6727f50) = 0
write(1, "sleep: 274136 ns\n", 17sleep: 274136 ns
)      = 17
exit_group(0)                           = ?
    exited with 0    

The above display shows that the C sleep_for() method of chrono class relies on clock_nanosleep() system call with the CLOCK_REALTIME clock identifier. Hence, the elapsed time includes the time necessary to run the system call itself (switch from user to kernel space and conversely). The less is the value of the delays, the more the cost of the system call is visible.

But there is no choice here as putting a process in the sleep state requires a switch into kernel side to change the process state and put it in the sleep queue.

We can notice, that the now() method of the chrono class does not trigger any system call visible in the strace output. It probably actually uses clock_gettime() which is a system call but mapped in the VDSO:

Why does the vDSO exist at all? There are some system calls the kernel provides that user-space code ends up using frequently, to the point that such calls can dominate overall performance. This is due both to the frequency of the call as well as the context- switch overhead that results from exiting user space and entering the kernel.

Few system calls are mapped into the VDSO. Concerning the timers, gettimeofday() and clock_gettime() are typically available through this mechanism. One trick to know if a system call is available or not through this mechanism, is to use strace tool. If the call appears in the captured system calls, than VDSO is not available for the given service. As we saw above, clock_nanosleep() is not part of VDSO as it appears in the output of strace but clock_gettime() which is normally used by now() method is part of the VDSO.

So, your problem does not come from the now() method but from the choice to measure sleep_for() duration which triggers the time consuming clock_gettime() system call.

If we re-run the program under strace with the timestamps (-tt option), we get a rough idea of the cost of the clock_nanosleep() system call:

$ strace -tt ./test_elapse_measure
[...]
18:09:24.462959 write(1, "time_resolution: 1e-09 s\n", 25time_resolution: 1e-09 s
) = 25
18:09:24.463214 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7fff359ecf70) = 0
18:09:25.463808 write(1, "sleep: 1 s\n", 11sleep: 1 s
) = 11
18:09:25.464043 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000000}, 0x7fff359ecf70) = 0
18:09:25.465512 write(1, "sleep: 1 ms\n", 12sleep: 1 ms
) = 12
18:09:25.465705 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1}, 0x7fff359ecf70) = 0
18:09:25.465891 write(1, "sleep: 180681 ns\n", 17sleep: 180681 ns
) = 17
18:09:25.466072 exit_group(0)           = ?
18:09:25.466623     exited with 0    

The timestamps show:

  • 1,000594 elapsed seconds for a sleep_for(1 second);
  • 0,001469 elapsed seconds for a sleep_for(1 millisecond);
  • 0,000186 elapsed seconds for a sleep_for(1 nanosecond).
  • Related