Home > Software design >  Unique elements separated by semicolon in bash
Unique elements separated by semicolon in bash

Time:11-08

I have a tab separated file, which one of the columns has semicolon separated values, as the following:

1   930215  930215  A   G   27734943
1   939111  939111  C   T   27734943;27734943
1   942143  942143  C   G   26204995;30276537
1   942995  942995  C   T   29738522;30276537;30276537

I want to unique all the values on the 6th column and keep the structure of the rest of the file:

1   930215  930215  A   G   27734943
1   939111  939111  C   T   27734943
1   942143  942143  C   G   26204995;30276537
1   942995  942995  C   T   29738522;30276537

I know that may exist a solution with awk, but all my attempts have failed. How can I make this in bash?

CodePudding user response:

Look at the cut operator. You will first cut based on the tab character and retrieving the nth field. Then cut again based on the semicolon character. Work with the 1 or more fields as you need.

CodePudding user response:

These types of problems are a lot of fun in perl:

perl -lane '%seen={}; $F[5] = join ";", grep { ! $seen{$_}    } split ";", $F[5]; print join "\t", @F' input

The code is self-documenting, so does not require any explanation. :)

The -a flag instructs perl to split the line into the array @F. It splits the 5th element (6th column) on ; with split ";", $F[5] and then applies grep while incrementing the $seen array to find unique elements. Those unique elements are then joined back together with ; and the final result joined with \t is printed.

  • Related