Home > front end >  Reading a file in c and comparing between lines
Reading a file in c and comparing between lines

Time:11-06

Assuming a file.txt contains random files names as follows:

a.cpp
b.txt
c.java
d.cpp
...

The idea is I want to sperate the file extension from the the file name as substring, and then compare between extensions to look for duplicates.
Here is my code:

#include<iostream>
#include<fstream>
#include<string>
using namespace std;

    int main()
    {

    ifstream infile;   
    infile.open("file.txt"); 

    string str,sub;
    int count,pos=0;

    while(infile>>str)  
    {
    pos=str.find(".");
    sub=str.substr(pos 1);
    
    
    if(sub==?)
        // I stopped here
        count  ;  
    
    }
    cout<<count;    
    return 0;
    }

I'm new to C so I don't know which function to use to jump to the next line, I searched a lot to figure it out, but nothing.

CodePudding user response:

You can use the following program to print the count corresponding to each extension in the input file. The program uses std::map to keep track of the count.


#include <iostream>
#include <map>
#include <fstream>

int main()
{
   
    std::ifstream inputFile("input.txt");
    
    std::map<std::string, int> countExtOccurence; //this will count how many time each extension occurred
    
    std::string name, extension;
    if(inputFile)
    {
        while(std::getline(inputFile, name, '.')) //this will read upto a . occurrs 
        {
            std::getline(inputFile, extension, '\n');
            {
                countExtOccurence[extension]  ; //increase the count corresponding to a given extension
            }
        }
    }
    else 
    {
        std::cout<<"input file cannot be opened"<<std::endl;
    }
    inputFile.close();
    
    //lets print out how many times each extensino occurred in the file 
    for(const std::pair<std::string, int> &pairElem: countExtOccurence)
    {
        std::cout<<pairElem.first<<" occurred: "<<pairElem.second<<" time"<<std::endl;
    }
    return 0;
}

The output of the above program can be seen here.

CodePudding user response:

OK, you want to read file names stored in a file, and then get the count for the extensions.

This seems very simple, but is not. The reason is, that filenames nowadays can contain all kind of special characters. There may be spaces in it and also multiple dotes ('.') . Depending on the file system there maybe slashes '/' (as in Unix/Linux) or backslashes '\' (as in Windows systems) as separators. There are also filenames without extensions and special files starting with a period. (like ".profile"). So basically not so easy.

Even having filenames only, the minimum that you should do is, to search the dot '.', (maybe) denoting the files extension, from the right end of the string. Never from the left side. So, in your case you should use rfind instead of find.

Now, to your question, how to read the next line. Your approach, using a formatted input function, will work for the shown filenames in the example source file, but will not work, if there are spaces in the filename. E.g your statement infile>>str will stop conversion after the first white space character.

Exampe: Filename is "Hello World.txt", then "str" would contain only "Hello" and a next read would contain "World.txt". Therefore you should read a complete line with the dedicated function std::getline. Please read a description here.

With that you can read line by line: while(std::getline(inputFile,str).

Then, later you can split of the extension and count it.

For splitting of the extension, I gave you already a hint and some caveats. But, and very good, C has a ready to use solution for you. The filesystem-library described here. This has everything that you need, ready to use.

Especially useful is the path type, which has a function extension. This will do all the nitty gritty stuff for you.

And because it does that, I strongly recommend to use it.

Now, life becomes simple. See the following example:

#include <iostream>
#include <string>
#include <filesystem>

// Namespace alias to save a lot of typing work . . .
namespace fs = std::filesystem;

int main() {
    // Read any kind of filename from the user
    std::string line{};   std:getline(std::cin, line);

    // Print the extension
    std::cout << fs::path{ line }.extension().string();
}

So, no worries about operating systems and any type of filenames. It will simply do all the groundwork for you.


Next, counting.

There is a more or less standard approach for counting something in a container or given by input and may be then additionally getting and showing its rank. So, sorting by the frequency of occurence.

For the counting part we can use an associative container like a std::map or a std::unordered_map. And here we associate a "key", in this case the "extension", to a count, with a value, in this case the count of the specific "extension".

And luckily, and basically the reason for selecting such an approach is that both maps have a very nice index operator[]. This will look for the given key and, if found, return a reference to the count. If not found, then it will create a new entry with the key ("extension") and return a reference to the new entry. So, in both cases, we will get a reference to the value used for counting. And then we can simply write:

std::unordered_map<std::string, int> counter{};
counter[extension]  ;

And that looks really intuitive.

After this operation, you have already the frequency table. Either sorted by the key (the word), by using a std::map or unsorted, but faster accessible with a std::unordered_map.

In your case, where you are just interested in a count, a std::unordered_map is advisable, because there is no need to sort data in a std::map by its key and later make no use of this sorting.


Then, maybe you want to sort according to the frequency/count. if you do not want to do that, then skip the following:

Sorting of maps by their value is infortunately not possible. Because a major property of a the map - container family is their reference to a key and not a value or count.

Therefore we need to use a second container, like a std::vector, or such, which we then can sort using std::sort for any given predicate, or, we can copy the values into a container, like a std::multiset that implicitly orders its elements. And because this is just a one liner, this is the recommended solution.

Additionally, because writing all these long names for the std containers, we create alias names, with the using keyword.

After we have the rank of the words, so, the list of words sorted by its count, we can use iterators and loops to access the data and output them.


Because you want to read from a file, I would like to give also an additional information regarding opening an closing a file (a stream).

If you read about the ifstream then you will see that is has a constructor, which takes a filename as input and a destructor, which will automatically close the file for you. File opening via the constructor will return a file stream variable which has a state. This is, by the way, true for any stream.

The background is, that its bool-operator is overwritten and will return the state of the stream. Also the not-operator ! is overwritten and can be used. Because of those overwritten operators you can write something like if (filestream) to see, if a file could be opened.

Additionally, since C 17, we have an extended if-statement, where you can use an initializer list in front of the conditions. This is important because it allows us to define a variable, that will be checked later, but with a scope limited to the if compound statement. Which in most cases is very much recommended. Example:

// Open a file and check, if it could be opened
if (std::ifstream infile("file.txt"); infile) {

   // ....   Do things fith file stream

} // Here the file will be closed automatically by the destructor

Much better than unessary open and close statements.


And now, after we thought about the design, now we can start to write code. Not before.

So, we will get now:

#include <iostream>
#include <fstream>
#include <string>
#include <filesystem>
#include <unordered_map>
#include <set>
#include <type_traits>
#include <utility>

// ------------------------------------------------------------
// Create aliases. Save typing work and make code more readable
using Pair = std::pair<std::string, unsigned int>;

// Standard approach for counter
using Counter = std::unordered_map<Pair::first_type, Pair::second_type>;

// Sorted values will be stored in a multiset
struct Comp { bool operator ()(const Pair& p1, const Pair& p2) const { return (p1.second == p2.second) ? p1.first<p2.first : p1.second>p2.second; } };
using Sorter = std::multiset<Pair, Comp>;

// Namespace alias
namespace fs = std::filesystem;
// ------------------------------------------------------------


int main() {

    // Open the source file and check, if it could be opened
    if (std::ifstream inFileStream{ "r:\\file.txt" }; inFileStream) {

        // Here we will count the extensions of the file names
        Counter counter{};

        // Read source file strings and count the extensions
        std::string line{};
        // Read all lines from file
        while (std::getline(inFileStream, line))

            // Get extensions and count them
            counter[ fs::path{ line }.extension().string() ]  ;

        // Show result to the user. 
        for (const Pair& p : counter) std::cout << p.first << " --> " << p.second << '\n';

    } // File will be closed here
    else {
        // file could not be opened
        std::cerr << "\n\n*** Error: Input file could not be opened\n\n";
    }
}

With only 8 statements in function main, we can do all the needed task, inclusive all kind of path formats and error handling.

There is more optimization possible.

As I mentioned, it is always a good concept to have a narrow scope for variable. In the above code, we can see the variable "line" is defined in the outer scope of the while loop. That is not necessary. And because we a for and a while loop are basically the same, we can better use a for loop, because it has an initializer part.

Instead of

std::string line{};
        // Read all lines from file
        while (std::getline(inFileStream, line))

we can write

        for  (std::string line{};std::getline(inFileStream, line);  )

We could even exagerate a little bit and do the counting in the iteration expression part of the for-loop. And do the whole reading and counting in just one for statement

        for (std::string line{}; std::getline(inFileStream, line); counter[fs::path{ line }.extension().string()]  )
            ;

So, do the complete reading of the file and the complete counting of all kinds of extensions in one statement in one line of code. Wow!

But readability is a little bit lower and we will not use that.


In the output statement, we could do some more readable stuff. Basically Pair and .first and .second is not that nice and understandable. C has also a solution for that. It is called structured binding.

With all the above, we come now to the final implementation, including output sorted by the frequency:

#include <iostream>
#include <fstream>
#include <string>
#include <filesystem>
#include <unordered_map>
#include <set>
#include <type_traits>
#include <utility>

// ------------------------------------------------------------
// Create aliases. Save typing work and make code more readable
using Pair = std::pair<std::string, unsigned int>;

// Standard approach for counter
using Counter = std::unordered_map<Pair::first_type, Pair::second_type>;

// Sorted values will be stored in a multiset
struct Comp { bool operator ()(const Pair& p1, const Pair& p2) const { return (p1.second == p2.second) ? p1.first<p2.first : p1.second>p2.second; } };
using Sorter = std::multiset<Pair, Comp>;

// Namespace alias
namespace fs = std::filesystem;
// ------------------------------------------------------------


int main() {

    // Open the source file and check, if it could be opened
    if (std::ifstream inFileStream{ "r:\\file.txt" }; inFileStream) {

        // Here we will count the extensions of the file names
        Counter counter{};

        // Read all lines from file
        for (std::string line{}; std::getline(inFileStream, line); )

            // Get extensions and count them
            counter[ fs::path{ line }.extension().string() ]  ;

        Sorter sorter(counter.begin(), counter.end());

        // Show result to the user. 
        for (const auto& [extension, count] : sorter) std::cout << extension << " --> " << count << '\n';

    } // File will be closed here
    else {
        // file could not be opened
        std::cerr << "\n\n*** Error: Input file could not be opened\n\n";
    }
}
  • Related