When we talk about updating a clock every second in Linux, I think that something similar to the following code is what comes to mind.
while :; do date %T; sleep 1; done
This piece of code always bugged me, since there's an infinite loop running two commands every second, which means context switching that produces a slight spike in the processor usage.
With that in mind, I'd like to know: is this really the best way to do this? Is there a more clever way to do it? If I would to reproduce this in a low level language like C, for example, would the only way to do it still be an infinite loop with a printf
to show the clock and a one second sleep
? That is, is there a way to avoid such context switching and use the CPU in a smarter way?
CodePudding user response:
I doubt there's a way to avoid the context switch — or if there were a way, it would be more wasteful than techniques involving sleep
.
The real problem with techniques epitomized by your
while :; do date %T; sleep 1; done
is that they lose time. For example, if I run this modification, incorporating my own dateexpr
program that has, among other things, the ability to work with subseconds:
while :; do dateexpr %H:%M:%.2S now; sleep 1; done
, this is what I see:
10:13:48.40
10:13:49.41
10:13:50.43
10:13:51.44
10:13:52.46
10:13:53.47
10:13:54.49
10:13:55.50
So it looks like the "context switches" — the overhead of firing up each sleep
and date
or dateexpr
process — are taking 10-20 ms.
I've written a program (in C) to get around this. It continually monitors the time, and computes a value of slightly less than a second to sleep for, so that it can invoke a subcommand exactly once per second, on the second. It looks like this:
$ synchro dateexpr %H:%M:%.2S now
10:17:11.01
10:17:12.01
10:17:13.01
10:17:14.01
10:17:15.01
10:17:16.01
10:17:17.01
There's still that 10ms error in starting up the invoked process, but at least it doesn't accumulate.
But in order to do its job, my synchro
program is having to make a bunch more system calls, so there are actually more context switches, not fewer.
But, of course, in general calling something like sleep
is the right thing to do when you want to pause for a while, because you are explicitly relinquishing control, and the OS knows it doesn't have to schedule your process to run at all, so you place minimal load on the rest of the system while you're sleeping. Yes, there are a couple of context switches involved, but they seem minimal, a small price to pay, and as I said, I don't think you can get around them.
I've wondered if there was a way to run a clock or timer entirely in user space, and perhaps that's what you're asking, too. But I doubt there's a way to, because there's nothing [Footnote 1] you can get your hands on in user space that gives you any information about time or clocks — that information is all over in the kernel, meaning it's going to take a system call to get to it.
(Here I'm thinking exclusively about a process running under a conventional, multitasking OS, of course. If you were writing embedded code for a microprocessor with an RTC, there's no question you could do exactly what you want, with no context switches at all.)
There's one slim possibility which is that under at least some (perhaps these days most?) versions of Linux, there's a mechanism called vDSO which enables certain system calls to be carried out in user space, without the need for a context switch. The premier candidate for a system call to receive this special treatment is gettimeofday
and related. So, on a system using vDSO, you could write a program with a busy-wait loop, repeatedly calling gettimeofday
(or time
or clock_gettime
, if those use vDSO also) until the desired time arrived, and because of vDSO, you'd be doing this without context switches. But of course busy-waiting is an almost irredeemably horrible idea, so I'm not seriously recommending this. (That's what I meant at the beginning of this answer when I said "if there were a way, it would be more wasteful than techniques involving sleep
.")
Footnote 1. I said there's "nothing you can get your hands on in user space that gives you any information about time or clocks", but that's not quite true. As the comments from Peter Cordes remind us, Intel processors, at least, give us the "Time Stamp Counter" and the rdtsc
instruction to read it. This is a potentially vital — but also hugely problematic! — tool for writing certain high-precision timing applications, but I've never used it so I won't try to explain it or its caveats.
CodePudding user response:
You don't want to avoid context-switches entirely, you want to let the kernel run other stuff during the 99% of the second where it's not running /usr/bin/date
to format time into a string and write(2)
it to stdout. (Or put this CPU core to sleep, saving power. But that actually doesn't count as a context switch, because software never changed page-tables or saved/restores FP registers. Entering the kernel at all even for a system call saves/restores integer registers, and with software Meltdown mitigation enabled on Intel CPUs that don't have an HW fix for that will actually change page-tables, though. And Spectre mitigation clearing branch prediction history is even more expensive.)
(A context switch is necessary, to your terminal emulator or sshd or whatever which is controlling the master side of the pseudo-terminal, if you aren't running this on a Linux text console, like ctrl alt F2. Only in the latter case would writing to video RAM actually happen in the write(0, buf, len)
system call made by date
, i.e. in the context of that process.)
If you want to minimize context switches (and system calls in general), you need to do the sleeping and writing from within a single process. But that's not possible in bash; it doesn't have a sleep builtin. (Bash does have printf '%(%T)T\n' $EPOCHSECONDS
to print the current time, but busy-waiting around that would be terrible). You'd want to write a program in C that just did sleeps and time-printing.
A loop using a fixed 1-second delay will accumulate error since it doesn't start the next second until after date
has started and exited, and the shell has forked/execed /usr/bin/sleep
the next iteration (plus startup overhead within the sleep
executable).
Without writing your own C program, you can get this down to just one fork/exec per second (and a bunch of other system calls) by using watch -p -t --exec
, which runs a given command at an interval, directly with fork/exec instead of /bin/sh -c
.
-t
tells it not to print a header (which includes the time)-p
(precise) has it query the current time withclock_gettime
and usenanosleep
to avoid error accumulation, aiming for the same target time within a second every time. (The default is to sleep for a fixed interval between runs of your command, no matter how long it took.)
We can trace its system calls to see what it does. (I used a shorter sleep interval so I didn't have to leave it sitting as long.) Note that clock_gettime
doesn't show up in strace
because it doesn't enter the kernel; the glibc wrapper calls into the vDSO implementation. That code exported by the kernel (mapped into every user-space process) reads data exported by the kernel: a coarse time updated by the kernel's timer interrupts, and a scale factor/offset for rdtsc
to interpolate an offset from the current coarse time, since modern x86-64 systems have a precise constant-frequency counter accessible from user-space.
(watch
actually prints on the "alternate" screen, so the output is gone from your terminal when it exits; that part of the output was faked for example purposes. The rest is copy/pasted from a terminal emulator, with ## comments added.)
# use strace -f ... to trace into child processes, and see all the syscalls from date
$ strace -o foo.tr watch -p -t -n 0.5 --exec date %T
22:31:54
control-C
$ less foo.tr
... startup stuff from watch, including some terminal-size ioctl
pipe([3, 4]) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f0f744fba10) = 3832377
# Linux implements fork() in terms of clone(2)
close(4) = 0
fcntl(3, F_GETFL) = 0 (flags O_RDONLY)
newfstatat(3, "", {st_mode=S_IFIFO|0600, st_size=0, ...}, AT_EMPTY_PATH) = 0
# (IDK why it's doing an fstat on the pipe FD)
read(3, "22:16:45\n", 4096) = 9
read(3, "", 4096) = 0
# reads from the pipe until EOF
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3832377, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
close(3) = 0
# then closes it
wait4(3832377, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 3832377
rt_sigaction(SIGTSTP, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f0f7453ada0}, {sa_handler=0x7f0
f746f4790, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f0f7453ada0}, 8) = 0
# and waits for the child PID
write(1, "\33[?1049h\33[22;0;0t\33[1;42r\33(B\33[m\33["..., 46) = 46
# clears the screen and moves cursor to the top left
write(1, "22:16:45\33[42;134H", 17) = 17
# and copies what it read from the pipe earlier.
rt_sigaction(SIGTSTP, {sa_handler=0x7f0f746f4790, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f0f7453ada0}, NULL, 8) =
0
## There's a clock_gettime() somewhere, probably herehere,
## but the vDSO implementation avoids entering the kernel so strace doesn't see it.
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=498451000}, NULL) = 0
# After calculating the exact time until the next event
# tell the kernel we're done until then
# Then the cycle starts over again when it wakes
pipe([3, 4]) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f0f744fba10) = 3832378
close(4) = 0
fcntl(3, F_GETFL) = 0 (flags O_RDONLY)
...
watch
without -t
will print the current time as part of its header. So if that's what you want, you don't need date
anymore.
But it doesn't have an option to not run any program. It stats /etc/localtime every time in case the current timezone has changed.
You could use /bin/true
, but that still has to get forked/execed and run its dynamic linker startup overhead. Or you could use watch --exec /non-existant
and let it print an execve
error every time. But even then it would still fork before trying to exec, creating a new PID and context-switching to it.