I have a tab separated file, which one of the columns has semicolon separated values, as the following:
1 930215 930215 A G 27734943
1 939111 939111 C T 27734943;27734943
1 942143 942143 C G 26204995;30276537
1 942995 942995 C T 29738522;30276537;30276537
I want to unique all the values on the 6th column and keep the structure of the rest of the file:
1 930215 930215 A G 27734943
1 939111 939111 C T 27734943
1 942143 942143 C G 26204995;30276537
1 942995 942995 C T 29738522;30276537
I know that may exist a solution with awk
, but all my attempts have failed. How can I make this in bash?
CodePudding user response:
Look at the cut operator. You will first cut based on the tab character and retrieving the nth field. Then cut again based on the semicolon character. Work with the 1 or more fields as you need.
CodePudding user response:
These types of problems are a lot of fun in perl
:
perl -lane '%seen={}; $F[5] = join ";", grep { ! $seen{$_} } split ";", $F[5]; print join "\t", @F' input
The code is self-documenting, so does not require any explanation. :)
The -a
flag instructs perl to split the line into the array @F. It splits the 5th element (6th column) on ;
with split ";", $F[5]
and then applies grep
while incrementing the $seen
array to find unique elements. Those unique elements are then joined back together with ;
and the final result joined with \t
is printed.