Home > Net >  C program gives up when reading large binary file
C program gives up when reading large binary file

Time:07-16

I'm using a file from the MNIST website as an example; specifically, t10k-images-idx3-ubyte.gz. To reproduce my problem, download that file and unzip it, and you should get a file named t10k-images.idx3-ubyte, which is the file we're reading from.

My problem is that when I try to read bytes from this file in one big block, it seems to read a bit of it and then sort of gives up. Below is a bit of C code that attempts to read (almost) all of the file at once, and then dumps it into a text file for debugging purposes. (Excuse me for the unnecessary #includes.) For context, the file is a binary file whose first 16 bytes are magic numbers, which is why I seek to byte 16 before reading it. Bytes 16 to the end are raw greyscale pixel values of 10,000 images of size 28 x 28.

#include <array>
#include <iostream>
#include <fstream>
#include <string>
#include <exception>
#include <vector>

int main() {
  try {
    std::string path = R"(Drive:\Path\To\t10k-images.idx3-ubyte)";
    std::ifstream inputStream {path};
    inputStream.seekg(16);  // Skip the magic numbers at the beginning.
    char* arrayBuffer = new char[28 * 28 * 10000];  // Allocate memory for 10,000 greyscale images of size 28 x 28.
    inputStream.read(arrayBuffer, 28 * 28 * 10000);
    std::ofstream output {R"(Drive:\Path\To\PixelBuffer.txt)"};  // Output file for debugging.
    for (size_t i = 0; i < 28 * 28 * 10000; i  ) {
      output << static_cast<short>(arrayBuffer[i]);
      // Below prints a new line after every 28 pixels.
      if ((i   1) % 28 == 0) {
        output << "\n";
      }
      else {
        output << " ";
      }
    }
    std::cout << inputStream.good() << std::endl;
    std::cout << "WTF?" << std::endl;  // LOL. I just use this to check that everything's actually been executed, because sometimes the program shits itself and quits silently.
    delete[] arrayBuffer;
  } catch (const std::exception& e) {
    std::cout << e.what() << std::endl;
  } catch (...) {
    std::cout << "WTF happened!?!?" << std::endl;
  }
  return 0;
}

When you run the code (after modifying the paths appropriately), and check the output text file, you will see that the file initially contains legitimate byte values from the file (integers between -128 and 127, but mostly 0), but as you scroll down, you will find that after some legitimate values, the printed values becomes all the same (in my case either all 0 or all -51 for some reason). What you see may be different on your computer, but in any case, they would be what I assume to be uninitialised bytes. So it seems that ifstream::read() works for a while, but gives up and stops reading very quickly. Am I missing something basic? Like, is there a buffer limit on the amount of bytes I can read at once that I don't know about?

EDIT Oh by the way I'm on Windows.

CodePudding user response:

Concerning OPs code to open the binary file:

std::ifstream inputStream {path};

It should be:

std::ifstream inputStream(path, std::ios::binary);

It's a common trap on Windows:

A file stream should be opened with std::ios::binary to read or write binary files.

cppreference.com has a nice explanation concerning this topic:

Binary and text modes

A text stream is an ordered sequence of characters that can be composed into lines; a line can be decomposed into zero or more characters plus a terminating '\n' (“newline”) character. Whether the last line requires a terminating '\n' is implementation-defined. Furthermore, characters may have to be added, altered, or deleted on input and output to conform to the conventions for representing text in the OS (in particular, C streams on Windows OS convert '\n' to '\r\n' on output, and convert '\r\n' to '\n' on input).

Data read in from a text stream is guaranteed to compare equal to the data that were earlier written out to that stream only if each of the following is true:

  • The data consist of only printing characters and/or the control characters '\t' and '\n' (in particular, on Windows OS, the character '\0x1A' terminates input).
  • No '\n' character is immediately preceded by space characters (such space characters may disappear when such output is later read as input).
  • The last character is '\n'.

A binary stream is an ordered sequence of characters that can transparently record internal data. Data read in from a binary stream always equal the data that were earlier written out to that stream, except that an implementation is allowed to append an indeterminate number of null characters to the end of the stream. A wide binary stream doesn't need to end in the initial shift state.

It's a good idea to use the std::ios::binary for stream I/O of binary files on any platform. It doesn't have any effect on platforms where it doesn't make a difference (e.g. Linux).

CodePudding user response:

To read binary files you need to use std::ifstream inputStream(path, std::ios_base::binary) else could happen that the application does not read the right things.

So, the correct code is

#include <array>
#include <iostream>
#include <fstream>
#include <string>
#include <exception>
#include <vector>

int main() {
  try {
    std::string path = R"(Drive:\Path\To\t10k-images.idx3-ubyte)";
    std::ifstream inputStream (path, std::ios_base::binary);
    inputStream.seekg(16);  // Skip the magic numbers at the beginning.
    char* arrayBuffer = new char[28 * 28 * 10000];  // Allocate memory for 10,000 greyscale images of size 28 x 28.
    inputStream.read(arrayBuffer, 28 * 28 * 10000);
    std::ofstream output {R"(Drive:\Path\To\PixelBuffer.txt)"};  // Output file for debugging.
    for (size_t i = 0; i < 28 * 28 * 10000; i  ) {
      output << static_cast<short>(arrayBuffer[i]);
      // Below prints a new line after every 28 pixels.
      if ((i   1) % 28 == 0) {
        output << "\n";
      }
      else {
        output << " ";
      }
    }
    std::cout << inputStream.good() << std::endl;
    std::cout << "WTF?" << std::endl;  // LOL. I just use this to check that everything's actually been executed, because sometimes the program shits itself and quits silently.
    delete[] arrayBuffer;
  } catch (const std::exception& e) {
    std::cout << e.what() << std::endl;
  } catch (...) {
    std::cout << "WTF happened!?!?" << std::endl;
  }
  return 0;
}

and this not depends on platform ( ex. Windows, Linux, etc )

  • Related