<updated, added Santiago Squarzon suggest information>
I have two lists, I pull them from csv but there is only one column in each of the two lists.
Here is how I pull in the lists in my script
$orginal_list = Get-Content -Path .\random-word-350k-wo-quotes.txt
$filter_words = Get-Content -Path .\no_go_words.txt
However, I will use a typed list for simplicity in the code example below.
In this example, the $original_list can have some words repeated. I want to filter out all of the words in $original_list that are in the $filter_words list.
Then add the filtered list to the variable $filtered_list.
In this example, $filtered_list would only have "dirt","turtle"
in it.
I know the line I have below where I subtract the two won't work, it's there as a placeholder as I don't know what to use to get the result.
Of note, the csv file that feeds $original_list could have 300,000 or more rows, and $filter_words could have hundreds of rows. So would want this to be as efficient as possible.
The filtering is case insensitive.
$orginal_list = "yellow","blue","yellow","dirt","blue","yellow","turtle","dirt"
$filter_words = "yellow","blue","green","harsh"
$filtered_list = $orginal_list - $filter_words
$filtered_list
dirt
turtle
CodePudding user response:
Use System.Collections.Generic.HashSet`1
and its .ExceptWith()
method:
# Note: if possible, declare the lists as [string[]] arrays to begin with.
# Otherwise, use a [string[]] cast im the method calls below, which,
# however, creates a duplicate array on the fly.
[string[]] $orginal_list = "yellow","blue","yellow","dirt","blue","yellow","turtle","dirt"
[string[]] $filter_words = "yellow","blue","green","harsh"
# Create a hash set based on the strings in $orginal_list,
# with case-insensitive lookups.
$hsOrig = [System.Collections.Generic.HashSet[string]]::new(
$orginal_list,
[System.StringComparer]::CurrentCultureIgnoreCase
)
# Reduce it to those strings not present in $filter_words, in-place.
$hsOrig.ExceptWith($filter_words)
# Convert the filtered hash set to an array.
[string[]] $filtered_list = [string[]]::new($hsOrig.Count)
$hsOrig.CopyTo($filtered_list)
# Output the result
$filtered_list
The above yields:
dirt
turtle
To also speed up reading your input files, use the following:
# Note: System.IO.File]::ReadAllLines() returns a [string[]] instance.
$orginal_list = [System.IO.File]::ReadAllLines((Convert-Path .\random-word-350k-wo-quotes.txt))
$filter_words = [System.IO.File]::ReadAllLines((Convert-Path .\no_go_words.txt))
Note:
.NET generally defaults to (BOM-less) UTF-8; pass a
[System.Text.Encoding]
instance as a second argument, if needed..NET's working dir. usually differs from PowerShell's, so the use of full paths is always advisable in .NET API calls, and that is what the
Convert-Path
calls ensure.