Home > Net >  unexpected I/O improvement with openmp
unexpected I/O improvement with openmp

Time:02-16

I have 5 vectors with different sizes in a vector (levelFreqs) and I have written a sequential code and a parallel code to store these vectors on a single disk (no parallel file system). What I see is that openMP can improve the writing time 3X with 5 cores which I cannot understand what is the reason for that. Any thoughts?

My sequential code:

 for(int f=0;f<levelFreqs.size();f  ){
    string fileName="/output/" to_string(f) ".txt";
    ofstream output_file(fileName);
    ostream_iterator<long long int> output_iterator(output_file, "|");
    copy(levelFreqs[f].begin(), levelFreqs[f].end(), output_iterator);
    output_file.close();
}

My parallel code:

omp_set_num_threads(5);
#pragma omp parallel
{
    int th_num= omp_get_thread_num();
    for(int f=0;f<levelFreqs[th_num].size();f  ){
        string fileName="/output/" to_string(th_num) "_" to_string(f) ".csv";
        ofstream output_file(fileName);
        ostream_iterator<long long int> output_iterator(output_file, "|");
        copy(levelFreqs[th_num][f].begin(), levelFreqs[th_num][f].end(), output_iterator);
        output_file.close();
    }
}

All the data has already been prepared in levelFreqs, so there should not be any improvement, but the second code runs 3 times faster than the first one!!!

Edit:. Note that in the parallel version, each vector has been divided into 5 partitions and each thread takes a partition. Could this be a reason that I write small files? overall, I have 25 files with the parallel version! So, each thread writes five files and each file contains 1/5 of each vector.

CodePudding user response:

Synchronous file io is horrible. Disks don't work on the same time scales as registers and cpus do, and sequential code (both human written and language/compiler designed) is mostly designed with that time scale and scheduling pattern.

When you naively write blocks of data to a disk, you call a function with data, and wait for it to be written, and repeat until everything is done. If it is a spinning disk, there is going to be a delay it aligns the disk head with the right track, then a delay as it waits for the disk to spin, waits for the write to occur and be confirmed, then relays that result back. At which point the disk could start on a different task.

Meanwhile, your writing process wakes up, restores its execution state, and is told the write is completed. It bundles up the next data, and sends another request. But the disk drive head is no longer in the right place, so it had to wait all over again.

Ideally this track-wait-write would write all of the data in one fell swoop. But to do this naive synchronous APIs don't work; instead you need the code providing a constant stream of data to write, either kn one large buffer or being provided on-demand just-in-time style.

Because we know that naive io sucks, libraries lie to you. They'll buffer your write commands and not actually do them until they have a solid amount of data to write. These buffers will be multiple layers; the hardware may have buffers, the OS may have buffers, the std library functions you use may have buffers, all "pretending" to write to the disk when they are just queueing up work.

Any of these buffers could become bigger with your change. Any of them could perform better with the different access patterns. Or worse. The one large file could be filling up a buffer, causing your code to wait until it actually finishes the write, while the threaded ones get more space in one of the many buffers I describe, so "finish" long before the data is on disk. Or the buffering could be failing to provide sufficient data for your output in the single threaded case, causing stalls as it writes to disk or letting other disk io use up more disk time. Or the disk you are writing to might support parallel access to distinct data blocks, and the multiple files means they can be distributed physically over the hardware address space, giving you more bandwidth; the single file written sequentially can't be distributed in a similar way.

The layers of abstraction between you and the bits on disk are extremely thick. Predicting behaviour is going to be hard.

I can say that if you care about performance, you shouldn't be using ostreams, but rather lower level async io options provided by your OS; or get a good file io library. std is not a high performance file io library; an attempt to get that into the std is ongoing.

CodePudding user response:

There are many factors to consider. Here are some.

Memory bandwidth

File operations are buffered by default. This means that data is not directly sent to the disk but copied to an in-memory disk cache. Additionally, the FILE or std::ostream object also buffers small write operations.

A single CPU is not capable of exhausting the main memory bandwidth with simple memory copies. Therefore anything involving primarily copies can be expected to become faster with more threads. I would expect the main memory to become the primary bottle neck with 2 to 4 threads doing unformatted I/O; more on a server system with multiple memory channels.

Application performance

Your example uses std::ofstream. The C I/O stream library is notoriously slow. Additionally, you use formatted output with int-to-string conversions. These take even more time. It is almost certain that most time is spent in the conversion routine and not doing any actual I/O.

Ordering guarantees

Filesystems may provide certain guarantees to guard against accidental data loss, for example guaranteeing that operations happen in order. Especially with regard to metadata operations such as file creation, this may delay operations and let the disk go idle. Multiple threads can help hide these latencies by giving the file system layer an independent set of operations to work with.

I/O scheduling

Keeping the I/O layer busy with multiple requests usually improves throughput. The I/O layer tries to aggregate multiple requests so that the disk can execute them without excessive seeking (and even SSDs benefit from sequential I/O). This is particularly important when operating on many small files. For example the metadata updates to create new file entries on disk could happen in one go.

More important for reading than writing: Multiple requests hide the latency between the disk executing an operation and the application issuing the next one.

Summary

In your example, I'm certain that the single-threaded application bottlenecks in the int-to-string conversion. If you change this to something simpler such as writing preformatted strings, you will either bottleneck on the C stream library (if the strings are small) or the other factors (if the strings are large).

  • Related