Powershell: concatenating files and piping result to Sort-Object -unique yields poor perfomance-CodePudding

I am attempting to solve the following problem:

Given a number of similarly formatted text files (~800mb worth of them, in my case), retrieve all lines in them and delete duplicates.

I attempted to solve this problem by running this command:

cat *.txt | Sort-Object -unique >output.txt

Then, powershell quickly consumed all my available RAM (over 16gb) and ran for over 20 minutes without writing anything into the output file. I then ran cat *.txt >output.log to rule out the possibility of shell reading the file it was writing to, but that command still maxed out all RAM and produced almost no output.

Why did this happen? How can 800mb of files on disk consume all RAM when concatenating?

How to solve this problem with powershell more efficiently?

Value of $PSVersionTable, if that helps:

Name                           Value
----                           -----
PSVersion                      5.1.19041.1682
PSEdition                      Desktop
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...}
BuildVersion                   10.0.19041.1682
CLRVersion                     4.0.30319.42000
WSManStackVersion              3.0
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1

Thanks in advance.

CodePudding user response：

Why did this happen? How can 800mb of files on disk consume all RAM when concatenating?

There's a number of reasons the original number of bytes might explode when read into runtime memory - strings read from an ASCII-encoded file will automatically take up twice the amount of memory because .NET internally uses 2 bytes to represent each character. Additionally you need to account for the number of individual strings you instantiate, each will need space for at least one 8-byte reference as well.

The major problem with your current approach, however, is that every single line read from files must stay resident in memory until PowerShell can perform the sort operation and discard duplicates - that's just how Sort-Object works.

How to solve this problem with powershell more efficiently?

To avoid this, use a data type optimized for only storing unique values: the [HashSet[T]] class!

$uniqueStrings = [System.Collections.Generic.HashSet[string]]::new()
cat *.txt |ForEach-Object {
    [void]$uniqueStrings.Add($_)
}

$uniqueStrings |Set-Content output.txt

When you call Add(), the hashset will inspect the new string value and test if it already has a copy of the same string value. If so, it simply discards the new value, meaning your script no longer references duplicate strings in memory, and the runtime can clean them up.

If having the output sorted is also important, pipe the values through Sort-Object as the last step before writing to disk:

$uniqueStrings |Sort-Object |Set-Content output.txt