I have a lot of txt files, around 10GB. What should I use in my program to merge them into one file without duplicates? I want to make sure each line in my output file will be unique.
I was thinking about making some kind of hash tree and use MPI. I want it to be effective.
CodePudding user response:
- build a table of files, so you can give every filename simply a number (a
std::vector<std::string>
works just fine for that). - For each file in a table: open it, do the following:
- read a line. Hash the line.
- Have a
std::map
that maps line hashes (step 3) tostd::pair<uint32_t filenumber, size_t byte_start_of_line>
. If your new line hash is already in the hash table, open the specified file,seek
to the specified position, and check whether your new line and the old line are identical or just share the same hash. - if identical, skip; if different or not yet present: add new entry to map, write line to output file
- read next line (i.e., go to step 3)
This only takes the RAM needed for the longest line, plus enough RAM for the filenames file numbers plus overhead, plus the space for the map, which should be far less than the actual lines. Since 10GB isn't really much text, it's relatively unlikely you'll have hash collisions, so you might as well skip the "check with the existing file" part if you're not after certainty, but a sufficiently high probability that all lines are in your output.
CodePudding user response:
If you don't have requirements to keep the memory usage low, you could just read all the lines from all the files into a std::set
or std::unordered_set
. An unordered_set
is as the name implies not ordered in any particular way while a set
is (lexicographical sort order). I've chosen a std::set
here, but you can try with a std::unordered_set
to see if that speeds things up a little.
Example:
#include <cerrno>
#include <cstring>
#include <fstream>
#include <iostream>
#include <set>
#include <string>
#include <string_view>
#include <vector>
int cppmain(std::string_view program, std::vector<std::string_view> args) {
if(args.empty()) {
std::cerr << "USAGE: " << program << " files...\n";
return 1;
}
std::set<std::string> result; // to store all the unique lines
// loop over all the filenames the user supplied
for(auto& filename : args) {
// try to open the file
if(std::ifstream ifs(filename.data()); ifs) {
std::string line;
// read all lines and put them in the set:
while(std::getline(ifs, line)) result.insert(line);
} else {
std::cerr << filename << ": " << std::strerror(errno) << '\n';
return 1;
}
}
for(auto line : result) {
// ... manipulate the unique line here ...
std::cout << line << '\n'; // and print the result
}
return 0;
}
int main(int argc, char* argv[]) {
return cppmain(argv[0], {argv 1, argv argc});
}