write syscall gets considerably slower when called from another thread-CodePudding

If you define THR, the code will do the same job but just in another thread. I only measured the time spent on the write call.

Running the code with ./some-file-name>/dev/null, this is the result I get, which is the accumulated clock cycles.

THR not defined

   1    48930106
   2    43946464
   3    44669126
   4    45918011
   5    44108477
   6    43608789
   7    45104427
   8    49676889
   9    44682305
  10    47516931

THR defined

   1   108347418
   2   101670307
   3   101726085
   4   100531554
   5   100137343
   6    85837022
   7   105556754
   8   104681843
   9   110303338
  10   104666783

Why is write when called from another thread so much slower?

The system is Fedora Linux.

#include <stdbool.h>
#include <stdio.h>
#include <string.h>
#include <pthread.h>
#include <semaphore.h>
#include <fcntl.h>
#include <unistd.h>
#include <immintrin.h>
#ifdef __cplusplus
#include <atomic>
using namespace std;
#else
#include <stdatomic.h>
#endif

#define SIZE 0x100000

static unsigned long long rdtscp() {
    unsigned _;
    return __rdtscp(&_);
}

static char b[SIZE];
static atomic_ullong oc;
#ifdef THR
static sem_t s[2];

void *out(void *_) {
    for (;;) {
        sem_wait(s);
        unsigned long long c = rdtscp();
        write(1, b, SIZE);
        oc  = rdtscp() - c;
        sem_post(s   1);
    }
    return _;
}
#endif

int main() {
    memset(b, 'a', SIZE);
#ifdef THR
    sem_init(s, false, 0);
    sem_init(s   1, false, 0);
    pthread_t t;
    pthread_create(&t, NULL, out, NULL);
#endif
    for (int i = 1;;   i) {
    #ifdef THR
        sem_post(s);
        sem_wait(s   1);
    #else
        unsigned long long c = rdtscp();
        write(1, b, SIZE);
        oc  = rdtscp() - c;
    #endif
        const int d = 100000;
        if (!(i % d)) {
            unsigned long long _oc = atomic_exchange(&oc, 0);
            fprintf(stderr, "Mllu\n", i / d, _oc);
        }
    }
}

Not sure if this is okay, but I made the code both compile in C and C to add the C tag. I will roll back if this is inappropriate.

CodePudding user response：

In one case, the data to be written is hot in the core's cache. There's no synchronization or dispatch overhead.

In the other case, the data to be written is modified in some other core's cache. There's both synchronization and dispatch overhead.

It's not surprising it takes more clock cycle to process the data when caches are cold, cores have to be synchronized, and a new task has to be dispatched rather than an existing task simply continuing.

A common anti-pattern is to convey data from thread to thread rather than processing it to completion in a single thread.