Home > Blockchain >  Missing bytes when reading from file
Missing bytes when reading from file

Time:10-05

I am writing some code to combine two .txt files containing test data captured for the same equipment, but taken on separate occasions. The data is stored in a .csv format.

EDIT: (As in while they are saved as .txt (UTF8 with BOM encoding), they are formatted to appear like a csv file)

Without worrying about the combining part, I was sorting out some issues with reading the files due to my relative inexperience with C when I noticed a mismatch of several thousand bytes between the file size reported by a couple methods and what was actually capable of being read in before reaching the EOF. Does anyone know what may be causing this?

Methods used to check file size before reading in:

  1. Constructing a std::filesystem::directory_entry object for the file in question. Then, calling it's .file_size() method. Returns 733435 bytes.
  2. Constructing fstream object for the file then the following code:
#include <iostream>
#include <fstream>

int main() {
    std::fstream data_file(path_to_file, std::ios::in);
    int file_size; \\ EDIT: Was in the wrong scope

    if (data_file.is_open()) {
        

        data_file.seekg(0, std::ios_base::end);
        file_size = data_file.tellg();
        data_file.seekg(0, std::ios_base::beg);
    }
    
    std::cout << file_size << std::endl; \\ --> 733435 bytes

}
  1. Checking the properties of the file in file explorer. File size = 733435 bytes, size on disc = 737280 bytes.

Then when I read in the file as follows:

#include <iostream>
#include <fstream>

int main() {
    std::fstream data_file(path_to_file, std::ios::in);
    
    if (data_file.is_open()) {
        int file_size, chars_read;

        data_file.seekg(0, std::ios_base::end);
        file_size = data_file.tellg();
        data_file.seekg(0, std::ios_base::beg);

        std::cout << "File size: " << file_size << std::endl;
        // |--> "File size: 733425"
    
        char* buffer = new char[file_size];

        // This sets both the eofbit & failbit flags for the stream
        // As is expected if the stream runs out of characters to read in
        // Before n characters are read in. (istream::read(char* s, streamsize n))
        data_file.read(buffer, file_size);

        // We can check the number of chars read in using istream::gcount()
        chars_read = data_file.gcount();

        std::cout << "Chars read: " << chars_read << std::endl;
        // |--> "Chars read: 716153"

        delete[] buffer;
        data_file.close();
    }

}

The mystery deepens somewhat when you look at the contents that are read in. The file is read in using three slightly different methods.

  1. Reading in the data line-by-line to a std::vectorstd::string directly from the filestream.
std::fstream stream(path_to_file, std::ios::in);
std::vector<std::string> v;
std::string s;

while (getline(stream, s, '\n')) {
    v.push_back(s);
}
  1. Read in the data using fstream::read(...) as above, then convert to lines using a stringstream object.
//... data read into char* buffer;
std::stringstream ss(buffer, std::ios::in);
std::vector<std::string> v2;
while (getline(ss, s, '\n')) {
    v2.push_back(s);
}

As far as I can tell, these should have the same contents. But...

std::cout << v.size() << std::endl;  //  --> 17283
std::cout << v2.size() << std::endl; // --> 17688

EDIT: The file itself has 17283 lines, the last of which is empty

In conclusion, a mismatch of just over 17000 bytes of the expected & measured file size, and a mismatch between the number of lines outputted by two different methods of processing mean that I have no idea what's going on.

Any suggestions are helpful, including more ways to test what's going on.

CodePudding user response:

fstream opens the file in "text" mode by default. On many platforms, this makes no difference, but specifically on Windows systems, text mode will automatically perform character conversion. \r\n on the filesystem will be read as simply \n.

See Difference between opening a file in binary vs text for more discussion. In one of the answers, there is a discussion about the allowable use of seek() and tell().

An easy thing to try is open in binary mode: OR this flag std::ios::binary with your ::in flag.

  • Related