I have a txt file that contains over 100,000 lines with header column fields that are separated by tab, and it contains duplicate entries. Now I would like to delete those duplicate entries by either using a batch script or PowerShell script when the VATRegistrationNumber is duplicated (appears again for GroupID 2000 when already existing for an entry with GroupID 1000).
Input file:
Supplier GroupID Name1 City VATRegistrationNumber Region
9032512 1000 PENRITH PERFORMING AND VISUAL ARTS Penrith 97003605089 NSW
9032923 1000 TORONTO MULTI PURPOSE CENTRE INC Toronto 85659352794 NSW
9035991 1000 SYDNEY'S BEST MAINTENANCE AND CLEAN Panania 42616361365 NSW
9044080 1000 FIRST PEOPLES DISABILITY NETWORK (A Sydney 58169154330 NSW
9044292 1000 PUNCHY DIGITAL MEDIA PTY LTD South Melbourne 22166978669 VIC
9044691 1000 POLARON LANGUAGE SERVICES PTY LTD St Kilda 99120446362 VIC
9047594 1000 WORKSPACE COMMERCIAL FURNITURE PTY Rosebery 16619275213 NSW
9075591 1000 ENGADINE COMMUNITY SERVICES INC Engadine 39108386726 NSW
9076178 1000 CENTRAL COAST AGEING AND DISABILITY Newcastle West 60874001192 NSW
9076319 1000 YASS SOLDIERS' CLUB LTD Yass 44000988152 NSW
9076370 1000 ARTFUL FOOD CO. PTY LIMITED EMU PLAINS 52600861612 NSW
9078232 1000 WSG AUSTRALIA PTY LTD NORWOOD SOUTH 76604823430 SA
9078312 1000 CAZNER PTY LTD Manly 39648003570 NSW
9078463 1000 MILSONS POINT DIGITAL PTY LTD Milsons Point 88635600681 NSW
9079757 1000 RISK & SECURITY MANAGEMENT PTY LTD Brisbane 71613136048 QLD
9079793 1000 PETRUSIC, MILOJKA Seven Hills 86077460484 NSW
8487483 2000 FOTI, STEPHANIE BLACKTOWN 78620350307 NSW
7458490 2000 STAFF CHECK PTY LIMITED Marrickville 36227046572 NSW
75637 2000 ANDERSON, ALLAN GRAHAM Gladesville 49340951752 NSW
362789 2000 STAR TRACK EXPRESS PTY LIMITED CRONULLA 82104931562 NSW
952678 2000 CALCUTTA GROUP PTY LTD KATOOMBA 52699520223 NSW
2403417 2000 The trustee for Moore Family Trust NORWOOD SOUTH 76604823430 SA
94905 2000 MEYER PTY LTD Manly 39648003570 NSW
426783 2000 DAWSON DIGITAL PTY LTD Milsons Point 88635600681 NSW
8588214 2000 HOME MANAGEMENT PTY LTD Brisbane 71613136048 QLD
Expected Output file:
Supplier GroupID Name1 City VATRegistrationNumber Region
9032512 1000 PENRITH PERFORMING AND VISUAL ARTS Penrith 97003605089 NSW
9032923 1000 TORONTO MULTI PURPOSE CENTRE INC Toronto 85659352794 NSW
9035991 1000 SYDNEY'S BEST MAINTENANCE AND CLEAN Panania 42616361365 NSW
9044080 1000 FIRST PEOPLES DISABILITY NETWORK (A Sydney 58169154330 NSW
9044292 1000 PUNCHY DIGITAL MEDIA PTY LTD South Melbourne 22166978669 VIC
9044691 1000 POLARON LANGUAGE SERVICES PTY LTD St Kilda 99120446362 VIC
9047594 1000 WORKSPACE COMMERCIAL FURNITURE PTY Rosebery 16619275213 NSW
9075591 1000 ENGADINE COMMUNITY SERVICES INC Engadine 39108386726 NSW
9076178 1000 CENTRAL COAST AGEING AND DISABILITY Newcastle West 60874001192 NSW
9076319 1000 YASS SOLDIERS' CLUB LTD Yass 44000988152 NSW
9076370 1000 ARTFUL FOOD CO. PTY LIMITED EMU PLAINS 52600861612 NSW
9078232 1000 WSG AUSTRALIA PTY LTD NORWOOD SOUTH 76604823430 SA
9078312 1000 CAZNER PTY LTD Manly 39648003570 NSW
9078463 1000 MILSONS POINT DIGITAL PTY LTD Milsons Point 88635600681 NSW
9079757 1000 RISK & SECURITY MANAGEMENT PTY LTD Brisbane 71613136048 QLD
9079793 1000 PETRUSIC, MILOJKA Seven Hills 86077460484 NSW
8487483 2000 FOTI, STEPHANIE BLACKTOWN 78620350307 NSW
7458490 2000 STAFF CHECK PTY LIMITED Marrickville 36227046572 NSW
75637 2000 ANDERSON, ALLAN GRAHAM Gladesville 49340951752 NSW
362789 2000 STAR TRACK EXPRESS PTY LIMITED CRONULLA 82104931562 NSW
952678 2000 CALCUTTA GROUP PTY LTD KATOOMBA 52699520223 NSW
Here is what I have so far in PowerShell but it doesn't provide the desired result as more lines are getting deleted than I'm expecting.
$Unique = [System.Collections.Generic.HashSet[string]]::new()
Get-Content C:\Test\Input.txt |ForEach-Object {
if ($Unique.Add(($_.Split('VATRegistrationNumber'))[-2])) { $_ }
} | Set-Content C:\Test\Output.txt
Any help would be greatly appreciated. Thank you
CodePudding user response:
This works for the sample given, though I'm not sure what the performance will be like for 100,000 entries. Give it a try and see if it helps.
Import-Csv 'input.txt' -Delimiter `t |
Group-Object -Property VATRegistrationNumber |
ForEach-Object {
if($_.Count -gt 1) {
$_.Group | Select-Object -First 1
}
else {
$_.Group
}
}
CodePudding user response:
Knowing that #11221
Select-Object -Unique is unnecessary slow and exhaustive, I would indeed use a Hashset for this, but instead of the Get-Content
/Set-Content
cmdlets, I would simply use the Import-Csv
/Export-Csv
cmdlets as they automatically deal with your properties (columns):
$Unique = [System.Collections.Generic.HashSet[string]]::new()
Import-Csv .\Input.txt -Delimiter "`t" |ForEach-Object {
if ($Unique.Add($_.VATRegistrationNumber)) { $_ }
} |Export-Csv .\Output.txt -Delimiter "`t"