Reading custom file formats in C-CodePudding

I read configuration files of the following format into my C code:

# name score
Marc 19.7
Alex 3.0
Julia 21.2

So far, I have adapted a solution found here: Parse (split) a string in C using string delimiter (standard C ). For example, the following code snippet reads in the file line by line, and for each line calls parseDictionaryLine, which discards the first line, splits the string as described in the original thread, and inserts the values into a (self-implemented) hash table.

void parseDictionaryLine(std::string &line, std::string &delimiter, hash_table &table) {
    size_t position = 0;
    std::string name;
    float score;

    while((position = line.find(delimiter)) != std::string::npos) {
        name = line.substr(0, position);
        line.erase(0, position   delimiter.length());
        score = stof(line);
        table.hinsert(name, score);
    }
}

void loadDictionary(const std::string &path, hash_table &table) {
    std::string line;
    std::ifstream fin(path);
    std::string delimiter = " ";
    int lineNumber = 0;
    if(fin.is_open()) {
        while(getline(fin, line)) {
            if(lineNumber   < 1) {
                continue; // first line
            }
            parseDictionaryLine(line, delimiter, table);
        }
        fin.close();
    }
    else {
        std::cerr << "Unable to open file." << std::endl;
    }
}

My question would be, is there a more elegant way in C to achieve this task? In particular, is there (1) a better split function as for example in Python, (2) a better method to test if a line is a comment line (starting with #), like startsWith (3) potentially even in iterator that handles files similar to a context manager in Python and makes sure the file will actually be closed? My solution works for simple cases shown here but becomes more clunky with more complicated variations such as several comment lines at unpredictable positions and more parameters. Also, it worries me that my solution does not check if the file actually agrees with the prescribed format (two values per line, first is string, second is float). Implementing these checks with my method seems very cumbersome.

I understand there is JSON and other file formats with libraries made for this use case, but I am dealing with legacy code and cannot go there.

CodePudding user response：

You can use operator>> to split at delimiters for you, like this:

#include <iostream>
#include <sstream>
#include <unordered_map>

std::istringstream input{
"# name score\n"
"Marc 19.7\n"
"Alex 3.0\n"
"Julia 21.2\n"
};


auto ReadDictionary(std::istream& stream)
{
    // unordered_map has O(1) lookup, map has n(log n) lookup
    // so I prefer unordered maps as dictionaries.
    std::unordered_map<std::string, double> dictionary;
    std::string header;

    // read the first line from input (the comment line or header)
    std::getline(stream, header);

    std::string name;
    std::string score;

    // read name and score from line (>> will split at delimiters for you)
    while (stream >> name >> score)
    {
        dictionary.insert({ name, std::stod(score) });
    }

    return dictionary;
}


int main()
{
    auto dictionary = ReadDictionary(input); // todo replace with file stream

    // range based for loop : https://en.cppreference.com/w/cpp/language/range-for
    // captured binding : https://en.cppreference.com/w/cpp/language/structured_binding
    for (const auto& [name, score] : dictionary)
    {
        std::cout << name << ": " << score << "\n";
    }

    return 0;
}

CodePudding user response：

I will try to answer all your questions.

First for splitting a string, you should not use the linked question/answer. It is from 2010 and rather outdated. Or, you need to scroll at the very bottom. There you will find more modern answers.

In C many things are done with iterators. Because a lot of algorithms or constructors in C work with iterators. So, the better approch for splitting a string is to use iterators. This will then always result in a one liner.

Background. A std::string is also a container. And you can iterate over elements like for example words or values in it. In case of space separated values you can use the std::istream_iterator on a std::istringstream. But since years there is a dedicated iterator for iterating of patterns in a string:

The std::sregex_token_iterator. And because it is specifically designed for that purpuse, it should be used.

Ans if it is used for splitting the strings, the overhead of using regexes is also minimal. So, you may split on strings, commas, colons or whatever. Example:

#include <iostream>
#include <string>
#include <vector>
#include <regex>

const std::regex re(";");

int main() {

    // Some test string to be splitted
    std::string test{ "Label;42;string;3.14" };

    // Split and store whatever number of elements in the vector. One Liner
    std::vector data(std::sregex_token_iterator(test.begin(), test.end(), re, -1), {});

    // Some debug output
    for (const std::string& s : data) std::cout << s << '\n';
}

So, regardless of the number of patterns, it will copy all data parts into the std::vector.

So, now you have a one liner solution for splitting strings.

For checking. if the first character is a string, you may use

the index operator (if (string[0] == '#'))
or, the std::string's front function (if (string.front() == '#'))
or again a regex

But, here you need to be careful. The string must not be empty, so, better write: if (not string.empty() and string.front() == '#')

Closing file or iterating over files.

If you use a std::ifstream then the constructor will open the file for you and the destructor will automatically close it, when the stream variable rund out of scope. The typical pattern here is:

// Open the file and check, if it coud be opened
if (std::iftsream fileStream{"test.txt"};fileStream) {
    
    // Do things

}  // <-- This will close the file automatically for you

Then, in general you shoud use a more object oriented approach. Data, and methods operating on this data, should be encapsulated in one class. Then you would overwrite the extractor operatoe >> and the inserter operator << to read and write the data. This, because only the class should know, how to handle the data. And if you decide to use a different mechanism, modify your class and the rest of the outside world will still work.

In your example case, input and output is that simple, that easiest IO will work. No splitting of string necessary.

Please see the following example.

And note especially the only few statements in main.

If you change something inside the classes, it will simple continue to work.

#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <vector>
#include <algorithm>

// Data in one line
struct Data {
    // Name and score
    std::string name{};
    double score{};

    // Extractor and inserter
    friend std::istream& operator >> (std::istream& is, Data& d) { return is >> d.name >> d.score; }
    friend std::ostream& operator << (std::ostream& os, const Data& d) { return os << d.name << '\t' << d.score; }
};

// Datbase, so all data from the source file
struct DataBase {
    std::vector<Data> data{};

    // Extractor
    friend std::istream& operator >> (std::istream& is, DataBase& d) {
        // Clear old data
        d.data.clear(); Data element{};

        // Read all lines from source stream
        for (std::string line{}; std::getline(is, line);) {

            // Ignore empty and command lines
            if (not line.empty() and line.front() != '#') {

                // Call extractor from Data class end get the data
                std::istringstream(line) >> element;

                // And save new data in the datbase
                d.data.push_back(std::move(element));
            }
        }
        return is;
    }
    // Inserter. Output all data
    friend std::ostream& operator << (std::ostream& os, const DataBase& d) {
        std::copy(d.data.begin(), d.data.end(), std::ostream_iterator<Data>(os, "\n"));
        return os;
    }
};

int main() {

    // Open file and check, if it is open
    if (std::ifstream ifs{ "test.txt" }; ifs) {

        // Our database
        DataBase db{};

        // Read all data
        ifs >> db;

        // Debug output show all data
        std::cout << db;
    }
    else std::cerr << "\nError: Could not open source file\n";
}