Home > Mobile >  How do I find same lines in two files C ?
How do I find same lines in two files C ?

Time:10-06

I need to write a program for my school project, which compares lines from two large files, one approx. 1.5G(40kk lines), and other one is approx. 5gb(100kk lines) to find duplicate lines and write those lines to new file.

I've already tried writing those programs in NodeJs and Python, however, they weren't able to compare those files, on Python it look like 30 minutes only to compare one line. Perhaps I was doing something wrong.

I wonder if C would be able to handle this task with ease, what's the fastest way to compare those files, any suggestions?

CodePudding user response:

You have multiple options to go about this and none of them are pretty.

I believe, one of the more efficient options goes about something like this:

#include <iostream>
#include <fstream>
#include <map>

int main()
{
    std::ifstream firstFile("firstFile");
    std::ifstream secondFile("secondFile");
    
    if (firstFile.is_open() && secondFile.is_open())
    {
        std::string line;
        while(!secondFile.eof())
        {
            std::map<std::string, bool> stringMap;
            for (int i = 0; 
                i < 500000 && 
                !std::getline(secondFile, line).eof();
                  i
            )
                stringMap[line] = false;

            while (!std::getline(firstFile, line).eof())
            {
                std::map<std::string, bool>::iterator it = stringMap.find(line);
                if (it != stringMap.end())
                    it->second = true;
            }

            firstFile.clear();      //Clears the eof Flag.
            firstFile.seekg(0);     //Rewinds to the beginning of the list.
            //Iterate over map and write all matches to the result file
        }
    }

    return 0;
}

This (kinda pseudo) code will read the second file in chunks of 500k lines, performs a check against all lines of the first file and gives you a list of all lines that are in both files.

Completely untested though.

  • Related