Home > Back-end >  Powershell - Remove Duplicate lines in TXT based on ID
Powershell - Remove Duplicate lines in TXT based on ID

Time:10-23

I have a TXT-File with thousands of lines. The number after the first Slash is the image ID. I want to delete all lines so that only one line remains for every ID. Which of the lines is getting killed doesn't matter.

I tried to pipe the TXT to a CSV with Powershell and work with the unique parameter. But it didnt work. Any ideas how I can iterate through the TXT and kill all lines, so that always only one line per unique ID remains? :/

Status Today

thumbnails/4000896042746/2021-08-17_4000896042746_small.jpg
thumbnails/4000896042746/2021-08-17_4000896042746_smallX.jpg
thumbnails/4000896042333/2021-08-17_4000896042746_medium.jpg
thumbnails/4000896042444/2021-08-17_4000896042746_hugex.jpg
thumbnails/4000896042333/2021-08-17_4000896042746_tiny.jpg

After the script

thumbnails/4000896042746/2021-08-17_4000896042746_small.jpg
thumbnails/4000896042333/2021-08-17_4000896042746_medium.jpg
thumbnails/4000896042444/2021-08-17_4000896042746_hugex.jpg

CodePudding user response:

If it concerns "TXT-File with thousands of lines", I would use the PowerShell pipeline for this because (if correctly setup) it will perform the same but uses far less memory. Performance improvements might actually be leveraged from using a HashTable (or a HashSet) which is based on a binary search (and therefore much faster then e.g. grouping).
(I am pleading to get an accelerated HashSet #16003 into PowerShell)

$Unique = [System.Collections.Generic.HashSet[string]]::new() 
Get-Content .\InFile.txt |ForEach-Object {
    if ($Unique.Add(($_.Split('/'))[-2])) { $_ }
} | Set-Content .\OutFile.txt

CodePudding user response:

You can group by custom property. So if you know what's your ID then you just have to group by that and then take the first element from the group:

$content = Get-Content "path_to_your_file";

$content = ($content | group { ($_ -split "/")[1] } | % { $_.Group[0] });

$content | Out-File "path_to_your_result_file"

CodePudding user response:

Here a solution that uses a calculated property to create an object that contains the ID and the FileName. Then I group the result based on the ID, iterate over each group and select the first FileName:

$yourFileList = @(
    'thumbnails/4000896042746/2021-08-17_4000896042746_small.jpg',
    'thumbnails/4000896042746/2021-08-17_4000896042746_smallX.jpg',
    'thumbnails/4000896042333/2021-08-17_4000896042746_medium.jpg',
    'thumbnails/4000896042444/2021-08-17_4000896042746_hugex.jpg',
    'thumbnails/4000896042333/2021-08-17_4000896042746_tiny.jpg'
)

$yourFileList | 
Select-Object @{Name = "Id"; Expression = { ($_ -split '/')[1] } }, @{ Name = 'FileName'; Expression = { $_ } } | 
Group Id | 
ForEach-Object { $_.Group[0].FileName }
  • Related