Unique elements separated by semicolon in bash-CodePudding

I have a tab separated file, which one of the columns has semicolon separated values, as the following:

1   930215  930215  A   G   27734943
1   939111  939111  C   T   27734943;27734943
1   942143  942143  C   G   26204995;30276537
1   942995  942995  C   T   29738522;30276537;30276537

I want to unique all the values on the 6th column and keep the structure of the rest of the file:

1   930215  930215  A   G   27734943
1   939111  939111  C   T   27734943
1   942143  942143  C   G   26204995;30276537
1   942995  942995  C   T   29738522;30276537

I know that may exist a solution with awk, but all my attempts have failed. How can I make this in bash?

CodePudding user response：

Look at the cut operator. You will first cut based on the tab character and retrieving the nth field. Then cut again based on the semicolon character. Work with the 1 or more fields as you need.

CodePudding user response：

These types of problems are a lot of fun in perl:

perl -lane '%seen={}; $F[5] = join ";", grep { ! $seen{$_}    } split ";", $F[5]; print join "\t", @F' input

The code is self-documenting, so does not require any explanation. :)

The -a flag instructs perl to split the line into the array @F. It splits the 5th element (6th column) on ; with split ";", $F[5] and then applies grep while incrementing the $seen array to find unique elements. Those unique elements are then joined back together with ; and the final result joined with \t is printed.