Assuming a file.txt contains random files names as follows:
a.cpp
b.txt
c.java
d.cpp
...
The idea is I want to sperate the file extension from the the file name as substring, and then compare between extensions to look for duplicates.
Here is my code:
#include<iostream>
#include<fstream>
#include<string>
using namespace std;
int main()
{
ifstream infile;
infile.open("file.txt");
string str,sub;
int count,pos=0;
while(infile>>str)
{
pos=str.find(".");
sub=str.substr(pos 1);
if(sub==?)
// I stopped here
count ;
}
cout<<count;
return 0;
}
I'm new to C so I don't know which function to use to jump to the next line, I searched a lot to figure it out, but nothing.
CodePudding user response:
You can use the following program to print the count corresponding to each extension in the input file. The program uses std::map
to keep track of the count.
#include <iostream>
#include <map>
#include <fstream>
int main()
{
std::ifstream inputFile("input.txt");
std::map<std::string, int> countExtOccurence; //this will count how many time each extension occurred
std::string name, extension;
if(inputFile)
{
while(std::getline(inputFile, name, '.')) //this will read upto a . occurrs
{
std::getline(inputFile, extension, '\n');
{
countExtOccurence[extension] ; //increase the count corresponding to a given extension
}
}
}
else
{
std::cout<<"input file cannot be opened"<<std::endl;
}
inputFile.close();
//lets print out how many times each extensino occurred in the file
for(const std::pair<std::string, int> &pairElem: countExtOccurence)
{
std::cout<<pairElem.first<<" occurred: "<<pairElem.second<<" time"<<std::endl;
}
return 0;
}
The output of the above program can be seen here.
CodePudding user response:
OK, you want to read file names stored in a file, and then get the count for the extensions.
This seems very simple, but is not. The reason is, that filenames nowadays can contain all kind of special characters. There may be spaces in it and also multiple dotes ('.') . Depending on the file system there maybe slashes '/' (as in Unix/Linux) or backslashes '\' (as in Windows systems) as separators. There are also filenames without extensions and special files starting with a period. (like ".profile"). So basically not so easy.
Even having filenames only, the minimum that you should do is, to search the dot '.', (maybe) denoting the files extension, from the right end of the string. Never from the left side. So, in your case you should use rfind
instead of find
.
Now, to your question, how to read the next line. Your approach, using a formatted input function, will work for the shown filenames in the example source file, but will not work, if there are spaces in the filename. E.g your statement infile>>str
will stop conversion after the first white space character.
Exampe: Filename is "Hello World.txt", then "str" would contain only "Hello" and a next read would contain "World.txt". Therefore you should read a complete line with the dedicated function std::getline
. Please read a description here.
With that you can read line by line: while(std::getline(inputFile,str)
.
Then, later you can split of the extension and count it.
For splitting of the extension, I gave you already a hint and some caveats. But, and very good, C has a ready to use solution for you. The filesystem
-library described here. This has everything that you need, ready to use.
Especially useful is the path type, which has a function extension. This will do all the nitty gritty stuff for you.
And because it does that, I strongly recommend to use it.
Now, life becomes simple. See the following example:
#include <iostream>
#include <string>
#include <filesystem>
// Namespace alias to save a lot of typing work . . .
namespace fs = std::filesystem;
int main() {
// Read any kind of filename from the user
std::string line{}; std:getline(std::cin, line);
// Print the extension
std::cout << fs::path{ line }.extension().string();
}
So, no worries about operating systems and any type of filenames. It will simply do all the groundwork for you.
Next, counting.
There is a more or less standard approach for counting something in a container or given by input and may be then additionally getting and showing its rank. So, sorting by the frequency of occurence.
For the counting part we can use an associative container like a std::map
or a std::unordered_map
. And here we associate a "key", in this case the "extension", to a count, with a value, in this case the count of the specific "extension".
And luckily, and basically the reason for selecting such an approach is that both maps have a very nice index operator[]
. This will look for the given key and, if found, return a reference to the count. If not found, then it will create a new entry with the key ("extension") and return a reference to the new entry. So, in both cases, we will get a reference to the value used for counting. And then we can simply write:
std::unordered_map<std::string, int> counter{};
counter[extension] ;
And that looks really intuitive.
After this operation, you have already the frequency table. Either sorted by the key (the word), by using a std::map
or unsorted, but faster accessible with a std::unordered_map
.
In your case, where you are just interested in a count, a std::unordered_map
is advisable, because there is no need to sort data in a std::map
by its key and later make no use of this sorting.
Then, maybe you want to sort according to the frequency/count. if you do not want to do that, then skip the following:
Sorting of maps by their value is infortunately not possible. Because a major property of a the map - container family is their reference to a key and not a value or count.
Therefore we need to use a second container, like a std::vector
, or such, which we then can sort using std::sort
for any given predicate, or, we can copy the values into a container, like a std::multiset
that implicitly orders its elements. And because this is just a one liner, this is the recommended solution.
Additionally, because writing all these long names for the std containers, we create alias names, with the using
keyword.
After we have the rank of the words, so, the list of words sorted by its count, we can use iterators and loops to access the data and output them.
Because you want to read from a file, I would like to give also an additional information regarding opening an closing a file (a stream).
If you read about the ifstream then you will see that is has a constructor, which takes a filename as input and a destructor, which will automatically close the file for you. File opening via the constructor will return a file stream variable which has a state. This is, by the way, true for any stream.
The background is, that its bool-operator is overwritten and will return the state of the stream. Also the not-operator !
is overwritten and can be used. Because of those overwritten operators you can write something like if (filestream)
to see, if a file could be opened.
Additionally, since C 17, we have an extended if-statement, where you can use an initializer list in front of the conditions. This is important because it allows us to define a variable, that will be checked later, but with a scope limited to the if compound statement. Which in most cases is very much recommended. Example:
// Open a file and check, if it could be opened
if (std::ifstream infile("file.txt"); infile) {
// .... Do things fith file stream
} // Here the file will be closed automatically by the destructor
Much better than unessary open
and close
statements.
And now, after we thought about the design, now we can start to write code. Not before.
So, we will get now:
#include <iostream>
#include <fstream>
#include <string>
#include <filesystem>
#include <unordered_map>
#include <set>
#include <type_traits>
#include <utility>
// ------------------------------------------------------------
// Create aliases. Save typing work and make code more readable
using Pair = std::pair<std::string, unsigned int>;
// Standard approach for counter
using Counter = std::unordered_map<Pair::first_type, Pair::second_type>;
// Sorted values will be stored in a multiset
struct Comp { bool operator ()(const Pair& p1, const Pair& p2) const { return (p1.second == p2.second) ? p1.first<p2.first : p1.second>p2.second; } };
using Sorter = std::multiset<Pair, Comp>;
// Namespace alias
namespace fs = std::filesystem;
// ------------------------------------------------------------
int main() {
// Open the source file and check, if it could be opened
if (std::ifstream inFileStream{ "r:\\file.txt" }; inFileStream) {
// Here we will count the extensions of the file names
Counter counter{};
// Read source file strings and count the extensions
std::string line{};
// Read all lines from file
while (std::getline(inFileStream, line))
// Get extensions and count them
counter[ fs::path{ line }.extension().string() ] ;
// Show result to the user.
for (const Pair& p : counter) std::cout << p.first << " --> " << p.second << '\n';
} // File will be closed here
else {
// file could not be opened
std::cerr << "\n\n*** Error: Input file could not be opened\n\n";
}
}
With only 8 statements in function main
, we can do all the needed task, inclusive all kind of path formats and error handling.
There is more optimization possible.
As I mentioned, it is always a good concept to have a narrow scope for variable. In the above code, we can see the variable "line" is defined in the outer scope of the while
loop. That is not necessary. And because we a for
and a while
loop are basically the same, we can better use a for
loop, because it has an initializer part.
Instead of
std::string line{};
// Read all lines from file
while (std::getline(inFileStream, line))
we can write
for (std::string line{};std::getline(inFileStream, line); )
We could even exagerate a little bit and do the counting in the iteration expression part of the for-loop. And do the whole reading and counting in just one for
statement
for (std::string line{}; std::getline(inFileStream, line); counter[fs::path{ line }.extension().string()] )
;
So, do the complete reading of the file and the complete counting of all kinds of extensions in one statement in one line of code. Wow!
But readability is a little bit lower and we will not use that.
In the output statement, we could do some more readable stuff. Basically Pair
and .first
and .second
is not that nice and understandable. C has also a solution for that. It is called structured binding.
With all the above, we come now to the final implementation, including output sorted by the frequency:
#include <iostream>
#include <fstream>
#include <string>
#include <filesystem>
#include <unordered_map>
#include <set>
#include <type_traits>
#include <utility>
// ------------------------------------------------------------
// Create aliases. Save typing work and make code more readable
using Pair = std::pair<std::string, unsigned int>;
// Standard approach for counter
using Counter = std::unordered_map<Pair::first_type, Pair::second_type>;
// Sorted values will be stored in a multiset
struct Comp { bool operator ()(const Pair& p1, const Pair& p2) const { return (p1.second == p2.second) ? p1.first<p2.first : p1.second>p2.second; } };
using Sorter = std::multiset<Pair, Comp>;
// Namespace alias
namespace fs = std::filesystem;
// ------------------------------------------------------------
int main() {
// Open the source file and check, if it could be opened
if (std::ifstream inFileStream{ "r:\\file.txt" }; inFileStream) {
// Here we will count the extensions of the file names
Counter counter{};
// Read all lines from file
for (std::string line{}; std::getline(inFileStream, line); )
// Get extensions and count them
counter[ fs::path{ line }.extension().string() ] ;
Sorter sorter(counter.begin(), counter.end());
// Show result to the user.
for (const auto& [extension, count] : sorter) std::cout << extension << " --> " << count << '\n';
} // File will be closed here
else {
// file could not be opened
std::cerr << "\n\n*** Error: Input file could not be opened\n\n";
}
}