Home > Net >  What's the fastest way of appending text from one file to another with huge files
What's the fastest way of appending text from one file to another with huge files

Time:03-22

So I have 5 textfiles that are 50GB each and I'd like to combine all of them into 1 textfile and then call the LINQ statement .Distinct() so that there are only unique entries in the new file.

The way I'm doing it now is like so

foreach (var file in files)
{
    if (Path.GetExtension(file) == ".txt")
    {
        var lines = File.ReadAllLines(file);
        var b = lines.Distinct();
        File.AppendAllLines(clear, lines);
        
    }
}

The issue that occurs here is that the application now loads the entire textfile into memory, making my RAM usage go up to 100%. This solution might of worked if I had 64GB of ram but I only have 16GB. What's the best option for me to achieve what I'm trying to accomplish? Should I utilize the cores on my CPU? Running a 5900x.

CodePudding user response:

If maintaining order is not important, and if the potential characters are limited (eg A-Z), a possibility would be to say, "OK, let's start with the As".

So you start with each file, and go through line by line until you find a line starting with 'A'. If you find one, add it to a new file and a HashSet. Each time you find a new line starting with 'A', check if it is in the HashSet, and if not add it to both the new file and the HashSet. Once you've processed all files, dispose the HashSet and skip to the next letter (B).

You're going to iterate through the files 26 times this way.

Of course you can optimise it even further. Check how much memory is available and divide the possible characters by ranges, so for example with the first iteration your HashSet might contain anything starting with A-D.

  • Related