Home > Software design >  Powershell filtering one list out of another list
Powershell filtering one list out of another list

Time:06-16

<updated, added Santiago Squarzon suggest information>

I have two lists, I pull them from csv but there is only one column in each of the two lists.
Here is how I pull in the lists in my script

$orginal_list = Get-Content -Path .\random-word-350k-wo-quotes.txt
$filter_words = Get-Content -Path .\no_go_words.txt

However, I will use a typed list for simplicity in the code example below.

In this example, the $original_list can have some words repeated. I want to filter out all of the words in $original_list that are in the $filter_words list.

Then add the filtered list to the variable $filtered_list.
In this example, $filtered_list would only have "dirt","turtle" in it.
I know the line I have below where I subtract the two won't work, it's there as a placeholder as I don't know what to use to get the result.

Of note, the csv file that feeds $original_list could have 300,000 or more rows, and $filter_words could have hundreds of rows. So would want this to be as efficient as possible.
The filtering is case insensitive.

$orginal_list = "yellow","blue","yellow","dirt","blue","yellow","turtle","dirt"
$filter_words = "yellow","blue","green","harsh"

$filtered_list = $orginal_list - $filter_words

$filtered_list

dirt
turtle

CodePudding user response:

Use System.Collections.Generic.HashSet`1 and its .ExceptWith() method:

# Note: if possible, declare the lists as [string[]] arrays to begin with.
#       Otherwise, use a [string[]] cast im the method calls below, which,
#       however, creates a duplicate array on the fly.
[string[]] $orginal_list = "yellow","blue","yellow","dirt","blue","yellow","turtle","dirt"
[string[]] $filter_words = "yellow","blue","green","harsh"

# Create a hash set based on the strings in $orginal_list,
# with case-insensitive lookups.
$hsOrig = [System.Collections.Generic.HashSet[string]]::new(
  $orginal_list,
  [System.StringComparer]::CurrentCultureIgnoreCase
)

# Reduce it to those strings not present in $filter_words, in-place.
$hsOrig.ExceptWith($filter_words)

# Convert the filtered hash set to an array.
[string[]] $filtered_list = [string[]]::new($hsOrig.Count)
$hsOrig.CopyTo($filtered_list)

# Output the result
$filtered_list

The above yields:

dirt
turtle

To also speed up reading your input files, use the following:

# Note: System.IO.File]::ReadAllLines() returns a [string[]] instance.
$orginal_list = [System.IO.File]::ReadAllLines((Convert-Path .\random-word-350k-wo-quotes.txt))
$filter_words = [System.IO.File]::ReadAllLines((Convert-Path .\no_go_words.txt))

Note:

  • .NET generally defaults to (BOM-less) UTF-8; pass a [System.Text.Encoding] instance as a second argument, if needed.

  • .NET's working dir. usually differs from PowerShell's, so the use of full paths is always advisable in .NET API calls, and that is what the Convert-Path calls ensure.

  • Related