C# - Concatenating files until process memory is full then delete duplicates-CodePudding

I'm currently working on a c# form.Basically, I have a lot of log files and most of them have duplicates lines between them. This form is supposed to concatenate a lot of those files into one file then delete all the duplicates in it so that I can have one log file without duplicates. I've already successfully made it work by taking 2 files, concatenating them, deleting all the duplicates in it then reproducing the process until I have no more files. Here is the function I made for this:

private static void DeleteAllDuplicatesFastWithMemoryManagement(HashSet<string>[] path_list, string parent_path, ProgressBar pBar1, BackgroundWorker backgroundWorker1)
        {
            for (int j = 0; j < path_list.Length; j  )
            {
                HashSet<string>.Enumerator em = path_list[j].GetEnumerator();
                List<string> LogFile = new List<string>();
                while (em.MoveNext())
                {
                    var secondLogFile = File.ReadAllLines(em.Current);
                    LogFile = LogFile.Concat(secondLogFile).ToList();
                    LogFile = LogFile.Distinct().ToList();
                    backgroundWorker1.ReportProgress(1);
                }
                LogFile = LogFile.Distinct().ToList();
                string new_path = parent_path   "/new_data/probe."   j   ".log";
                File.WriteAllLines(new_path, LogFile.Distinct().ToArray());            
            }
        }

path_list contains all the path to the files I need to process. path_list[0] contains all the probe.0.log files path_list[1] contains all the probe.1.log files ...

Here is the idea I have for my problem but I have no idea how to code it :

private static void DeleteAllDuplicatesFastWithMemoryManagement(HashSet<string>[] path_list, string parent_path, ProgressBar pBar1, BackgroundWorker backgroundWorker1)
       {

            for (int j = 0; j < path_list.Length; j  )
            {
                HashSet<string>.Enumerator em = path_list[j].GetEnumerator();
                List<string> LogFile = new List<string>();
                while (em.MoveNext())
                {
                    // how I see it
                    if (currentMemoryUsage   newfile.Length > maximumProcessMemory) {
                         LogFile = LogFile.Distinct().ToList();
                    }
                    //end
                    var secondLogFile = File.ReadAllLines(em.Current);
                    LogFile = LogFile.Concat(secondLogFile).ToList();
                    LogFile = LogFile.Distinct().ToList();
                    backgroundWorker1.ReportProgress(1);
                }
                LogFile = LogFile.Distinct().ToList();
                string new_path = parent_path   "/new_data/probe."   j   ".log";
                File.WriteAllLines(new_path, LogFile.Distinct().ToArray());            
            }
        }

I think this method will be much quicker, and it will adjust to any computer specs. Can anyone help me to make this work ? Or tell me if I'm wrong.

CodePudding user response：

You are creating far too many lists and arrays and Distincts.

Just combine everything in a HashSet, then write it out

private static void CombineNoDuplicates(HashSet<string>[] path_list, string parent_path, ProgressBar pBar1)
{
    var logFile = new HashSet<string>(1000);   // pre-size your hashset to a suitable size
    foreach (var paths in path_list)
    {
        logFile.Clear();
        foreach (var path in paths)
        {
            var lines = File.ReadLines(file);
            logFile.UnionWith(lines);
            backgroundWorker1.ReportProgress(1);
        }
        string new_path = Path.Combine(parent_path, "new_data", "probe."   j   ".log");
        File.WriteAllLines(new_path, logFile);
    }
}

Ideally you should use async instead of BackgroundWorker which is deprecated. This also means you don't need to store a whole file in memory at once, except for the first one.

private static async Task CombineNoDuplicatesAsync(HashSet<string>[] path_list, string parent_path, ProgressBar pBar1)
{
    var logFile = new HashSet<string>(1000);   // pre-size your hashset to a suitable size
    foreach (var paths in path_list)
    {
        logFile.Clear();
        foreach (var path in paths)
        {
            using (var sr = new StreamReader(file))
            {
                string line;
                while ((line = await sr.ReadLineAsync()) != null)
                {
                    logFile.Add(line);
                }
            }
        }
        string new_path = Path.Combine(parent_path, "new_data", "probe."   j   ".log");
        await File.WriteAllLinesAsync(new_path, logFile);
    }
}

If you want to risk a colliding hash-code, you could cut down your memory usage even further by just putting the strings' hashes in a HashSet, then you can fully stream all files.

Caveat: colliding hash-codes are a distinct possibility, especially with many strings. Analyze your data to see fi you can risk this.

private static async Task CombineNoDuplicatesAsync(HashSet<string>[] path_list, string parent_path, ProgressBar pBar1)
{
    var hashes = new HashSet<int>(1000);   // pre-size your hashset to a suitable size
    foreach (var paths in path_list)
    {
        hashes.Clear();
        string new_path = Path.Combine(parent_path, "new_data", "probe."   j   ".log");
        using (var output = new StreamWriter(new_path))
        {
            foreach (var path in paths)
            {
                using (var sr = new StreamReader(file))
                {
                    string line;
                    while ((line = await sr.ReadLineAsync()) != null)
                    {
                        if (hashes.Add(line.GetHashCode())
                            await output.WriteLineAsync(line);
                    }
                }
            }
        }
    }
}

You can get even more performance if you would read Span<byte> arrays and parse the lines like that, I will leave that as an exercise to the reader as it's quite complex.

CodePudding user response：

Assuming your log files already contain lines that are sorted in chronological order¹, we can effectively treat them as intermediate files for a multi-file sort and perform merging/duplicate elimination in one go.

It would be a new class, something like this:

internal class LogFileMerger : IEnumerable<string>
{
    private readonly List<IEnumerator<string>> _files;

    public LogFileMerger(HashSet<string> fileNames)
    {
        _files = fileNames.Select(fn => File.ReadLines(fn).GetEnumerator()).ToList();
    }


    public IEnumerator<string> GetEnumerator()
    {
        while (_files.Count > 0)
        {
            var candidates = _files.Select(e => e.Current);
            var nextLine = candidates.OrderBy(c => c).First();
            for (int i = _files.Count - 1; i >= 0; i--)
            {
                while (_files[i].Current == nextLine)
                {
                    if (!_files[i].MoveNext())
                    {
                        _files.RemoveAt(i);
                        break;
                    }
                }
            }

            yield return nextLine;
        }
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

You can create a LogFileMerger using the set of input log file names and pass it directly as the IEnumerable<string> to some method like File.WriteAllLines. Using File.ReadLines should mean that the amount of memory being used for each input file is just a small buffer on each file, and we never attempt to have all of the data from any of the files loaded at any time.

(You may want to adjust the OrderBy and comparison operations in the above if there are requirements around case insensitivity but I don't see any evident in the question)

(Note also that this class cannot be enumerated multiple times in the current design. That could be adjusted by storing the paths instead of the open enumerators in the class field and making the list of open enumerators a local inside GetEnumerator)

¹If this is not the case, it may be more sensible to sort each file first so that this assumption is met and then proceed with this plan.