Home > Enterprise >  How to read large files in segments?
How to read large files in segments?

Time:03-08

I'm using small files currently for testing and will scale up once it works.

I made a file bigFile.txt that has:

ABCDEFGHIJKLMNOPQRSTUVWXYZ

I'm running this to segment the data that is being read from the file:

#include <iostream>
#include <fstream>
#include <memory>
using namespace std;

int main()
{
    ifstream file("bigfile.txt", ios::binary | ios::ate);
    cout << file.tellg() << " Bytes" << '\n';

    ifstream bigFile("bigfile.txt");
    constexpr size_t bufferSize = 4;
    unique_ptr<char[]> buffer(new char[bufferSize]);
    while (bigFile)
    {
        bigFile.read(buffer.get(), bufferSize);
        // print the buffer data
        cout << buffer.get() << endl;
    }
}

This gives me the following result:

26 Bytes
ABCD
EFGH
IJKL
MNOP
QRST
UVWX
YZWX

Notice how in the last line after 'Z' the character 'WX' is repeated again?

How do I get rid of it so that it stops after reaching the end?

CodePudding user response:

cout << buffer.get() uses the const char* overload, which prints a NULL-terminated C string.

But your buffer isn't NULL-terminated, and istream::read() can read less characters than the buffer size. So when you print buffer, you end up printing old characters that were already there, until the next NULL character is encountered.

Use istream::gcount() to determine how many characters were read, and print exactly that many characters. For example, using std::string_view:

#include <iostream>
#include <fstream>
#include <memory>
#include <string_view>
using namespace std;

int main()
{
    ifstream file("bigfile.txt", ios::binary | ios::ate);
    cout << file.tellg() << " Bytes" << "\n";
    file.seekg(0, std::ios::beg); // rewind to the beginning

    constexpr size_t bufferSize = 4;
    unique_ptr<char[]> buffer = std::make_unique<char[]>(bufferSize);
    while (file)
    {
        file.read(buffer.get(), bufferSize);
        auto bytesRead = file.gcount();
        if (bytesRead == 0) {
            // EOF
            break;
        }
        // print the buffer data
        cout << std::string_view(buffer.get(), bytesRead) << endl;
    }
}

Note also that there's no need to open the file again - you can rewind the original one to the beginning and read it.

CodePudding user response:

The problem is that you don't override the buffer's content. Here's what your code does:

  • It reads the beginning of the file
  • When reaching the 'YZ', it reads it and only overrides the buffer's first two characters ('U' and 'V') because it has reached the end of the file.

One easy fix is to clear the buffer before each file read:

#include <iostream>
#include <fstream>
#include <array>

int main()
{
    std::ifstream bigFile("bigfile.txt", std::ios::binary | std::ios::ate);
    int fileSize = bigFile.tellg();
    std::cout << bigFile.tellg() << " Bytes" << '\n';

    bigFile.seekg(0);
    
    constexpr size_t bufferSize = 4;
    std::array<char, bufferSize> buffer;
    
    while (bigFile)
    {
        for (int i(0); i < bufferSize;   i)
            buffer[i] = '\0';
        bigFile.read(buffer.data(), bufferSize);
        // Print the buffer data
        std::cout.write(buffer.data(), bufferSize) << '\n';
    }
}

I also changed:

  • The std::unique_ptr<char[]> to a std::array since we don't need dynamic allocation here and std::arrays's are safer that C-style arrays
  • The printing instruction to std::cout.write because it caused undefined behavior (see @paddy's comment). std::cout << prints a null-terminated string (a sequence of characters terminated by a '\0' character) whereas std::cout.write prints a fixed amount of characters
  • The second file opening to a call to the std::istream::seekg method (see @rustyx's answer).

Another (and most likely more efficient) way of doing this is to read the file character by character, put them in the buffer, and printing the buffer when it's full. We then print the buffer if it hasn't been already in the main for loop.

#include <iostream>
#include <fstream>
#include <array>

int main()
{
    std::ifstream bigFile("bigfile.txt", std::ios::binary | std::ios::ate);
    int fileSize = bigFile.tellg();
    std::cout << bigFile.tellg() << " Bytes" << '\n';

    bigFile.seekg(0);
    
    constexpr size_t bufferSize = 4;
    std::array<char, bufferSize> buffer;
    
    int bufferIndex;
    for (int i(0); i < fileSize;   i)
    {
        // Add one character to the buffer
        bufferIndex = i % bufferSize;
        buffer[bufferIndex] = bigFile.get();
        // Print the buffer data
        if (bufferIndex == bufferSize - 1)
            std::cout.write(buffer.data(), bufferSize) << '\n';
    }
    // Override the characters which haven't been already (in this case 'W' and 'X')
    for (  bufferIndex; bufferIndex < bufferSize;   bufferIndex)
        buffer[bufferIndex] = '\0';
    // Print the buffer for the last time if it hasn't been already
    if (fileSize % bufferSize /* != 0 */)
        std::cout.write(buffer.data(), bufferSize) << '\n';
}
  • Related