Home > Mobile >  Delete duplicate lines from text file based on column
Delete duplicate lines from text file based on column

Time:06-15

I have a txt file that contains over 100,000 lines with header column fields that are separated by tab, and it contains duplicate entries. Now I would like to delete those duplicate entries by either using a batch script or PowerShell script when the VATRegistrationNumber is duplicated (appears again for GroupID 2000 when already existing for an entry with GroupID 1000).

Input file:

Supplier    GroupID Name1   City    VATRegistrationNumber   Region
9032512 1000    PENRITH PERFORMING AND VISUAL ARTS  Penrith 97003605089 NSW
9032923 1000    TORONTO MULTI PURPOSE CENTRE INC    Toronto 85659352794 NSW
9035991 1000    SYDNEY'S BEST MAINTENANCE AND CLEAN Panania 42616361365 NSW
9044080 1000    FIRST PEOPLES DISABILITY NETWORK (A Sydney  58169154330 NSW
9044292 1000    PUNCHY DIGITAL MEDIA PTY LTD    South Melbourne 22166978669 VIC
9044691 1000    POLARON LANGUAGE SERVICES PTY LTD   St Kilda    99120446362 VIC
9047594 1000    WORKSPACE COMMERCIAL FURNITURE PTY  Rosebery    16619275213 NSW
9075591 1000    ENGADINE COMMUNITY SERVICES INC Engadine    39108386726 NSW
9076178 1000    CENTRAL COAST AGEING AND DISABILITY Newcastle West  60874001192 NSW
9076319 1000    YASS SOLDIERS' CLUB LTD Yass    44000988152 NSW
9076370 1000    ARTFUL FOOD CO. PTY LIMITED EMU PLAINS  52600861612 NSW
9078232 1000    WSG AUSTRALIA PTY LTD   NORWOOD SOUTH   76604823430 SA
9078312 1000    CAZNER PTY LTD  Manly   39648003570 NSW
9078463 1000    MILSONS POINT DIGITAL PTY LTD   Milsons Point   88635600681 NSW
9079757 1000    RISK & SECURITY MANAGEMENT PTY LTD  Brisbane    71613136048 QLD
9079793 1000    PETRUSIC, MILOJKA   Seven Hills 86077460484 NSW
8487483 2000    FOTI, STEPHANIE BLACKTOWN   78620350307 NSW
7458490 2000    STAFF CHECK PTY LIMITED Marrickville    36227046572 NSW
75637   2000    ANDERSON, ALLAN GRAHAM  Gladesville 49340951752 NSW
362789  2000    STAR TRACK EXPRESS PTY LIMITED  CRONULLA    82104931562 NSW
952678  2000    CALCUTTA GROUP PTY LTD  KATOOMBA    52699520223 NSW
2403417 2000    The trustee for Moore Family Trust  NORWOOD SOUTH   76604823430 SA
94905   2000    MEYER PTY LTD   Manly   39648003570 NSW
426783  2000    DAWSON DIGITAL PTY LTD  Milsons Point   88635600681 NSW
8588214 2000    HOME MANAGEMENT PTY LTD Brisbane    71613136048 QLD

Expected Output file:

Supplier    GroupID Name1   City    VATRegistrationNumber   Region
9032512 1000    PENRITH PERFORMING AND VISUAL ARTS  Penrith 97003605089 NSW
9032923 1000    TORONTO MULTI PURPOSE CENTRE INC    Toronto 85659352794 NSW
9035991 1000    SYDNEY'S BEST MAINTENANCE AND CLEAN Panania 42616361365 NSW
9044080 1000    FIRST PEOPLES DISABILITY NETWORK (A Sydney  58169154330 NSW
9044292 1000    PUNCHY DIGITAL MEDIA PTY LTD    South Melbourne 22166978669 VIC
9044691 1000    POLARON LANGUAGE SERVICES PTY LTD   St Kilda    99120446362 VIC
9047594 1000    WORKSPACE COMMERCIAL FURNITURE PTY  Rosebery    16619275213 NSW
9075591 1000    ENGADINE COMMUNITY SERVICES INC Engadine    39108386726 NSW
9076178 1000    CENTRAL COAST AGEING AND DISABILITY Newcastle West  60874001192 NSW
9076319 1000    YASS SOLDIERS' CLUB LTD Yass    44000988152 NSW
9076370 1000    ARTFUL FOOD CO. PTY LIMITED EMU PLAINS  52600861612 NSW
9078232 1000    WSG AUSTRALIA PTY LTD   NORWOOD SOUTH   76604823430 SA
9078312 1000    CAZNER PTY LTD  Manly   39648003570 NSW
9078463 1000    MILSONS POINT DIGITAL PTY LTD   Milsons Point   88635600681 NSW
9079757 1000    RISK & SECURITY MANAGEMENT PTY LTD  Brisbane    71613136048 QLD
9079793 1000    PETRUSIC, MILOJKA   Seven Hills 86077460484 NSW
8487483 2000    FOTI, STEPHANIE BLACKTOWN   78620350307 NSW
7458490 2000    STAFF CHECK PTY LIMITED Marrickville    36227046572 NSW
75637   2000    ANDERSON, ALLAN GRAHAM  Gladesville 49340951752 NSW
362789  2000    STAR TRACK EXPRESS PTY LIMITED  CRONULLA    82104931562 NSW
952678  2000    CALCUTTA GROUP PTY LTD  KATOOMBA    52699520223 NSW

Here is what I have so far in PowerShell but it doesn't provide the desired result as more lines are getting deleted than I'm expecting.

$Unique = [System.Collections.Generic.HashSet[string]]::new() 
Get-Content C:\Test\Input.txt |ForEach-Object {
    if ($Unique.Add(($_.Split('VATRegistrationNumber'))[-2])) { $_ }
} | Set-Content C:\Test\Output.txt

Any help would be greatly appreciated. Thank you

CodePudding user response:

This works for the sample given, though I'm not sure what the performance will be like for 100,000 entries. Give it a try and see if it helps.

Import-Csv 'input.txt' -Delimiter `t |
    Group-Object -Property VATRegistrationNumber |
            ForEach-Object {
                if($_.Count -gt 1) {
                    $_.Group | Select-Object -First 1
                }
                else {
                    $_.Group
                }
            }

CodePudding user response:

Knowing that #11221 Select-Object -Unique is unnecessary slow and exhaustive, I would indeed use a Hashset for this, but instead of the Get-Content/Set-Content cmdlets, I would simply use the Import-Csv/Export-Csv cmdlets as they automatically deal with your properties (columns):

$Unique = [System.Collections.Generic.HashSet[string]]::new() 
Import-Csv .\Input.txt -Delimiter "`t" |ForEach-Object {
    if ($Unique.Add($_.VATRegistrationNumber)) { $_ }
} |Export-Csv .\Output.txt -Delimiter "`t"
  • Related