I want to remove the duplicate value from the first column only and then arrange the data in second column horizontally using value of first column as the first value for each row where each column would align with each other.
blunt-snouted_clingfish rdh14b
blunt-snouted_clingfish LOC114457682
blunt-snouted_clingfish rngtt
blunt-snouted_clingfish cnr1
blunt-snouted_clingfish akirin2
blunt-snouted_clingfish rars2
blunt-snouted_clingfish slc35a1
blunt-snouted_clingfish LOC114457715
blunt-snouted_clingfish rhag
Chinese_tongue_sole nt5c1bb
Chinese_tongue_sole si:dkey-174m14.3
Chinese_tongue_sole rdh14b
Chinese_tongue_sole LOC103381225
Chinese_tongue_sole rngtt
Chinese_tongue_sole cnr1
Chinese_tongue_sole akirin2
Chinese_tongue_sole rars2
Chinese_tongue_sole riox1
Chinese_tongue_sole ndufb1
Chinese_tongue_sole cpsf2
Helicophagus_hypophthalmus_Sauvage,_1878 myo6a
Helicophagus_hypophthalmus_Sauvage,_1878 LOC113528782
Helicophagus_hypophthalmus_Sauvage,_1878 mei4
Helicophagus_hypophthalmus_Sauvage,_1878 nt5e
Helicophagus_hypophthalmus_Sauvage,_1878 snx14
Helicophagus_hypophthalmus_Sauvage,_1878 cnr1
Helicophagus_hypophthalmus_Sauvage,_1878 rngtt
Helicophagus_hypophthalmus_Sauvage,_1878 pnrc1
Helicophagus_hypophthalmus_Sauvage,_1878 LOC113528790
Helicophagus_hypophthalmus_Sauvage,_1878 LOC113529170
Helicophagus_hypophthalmus_Sauvage,_1878 c30h8orf82
The output should be:
blunt-snouted_clingfish rdh14b LOC114457682 rngtt cnr1 akirin2 rars2 slc35a1 LOC114457715 rhag
Chinese_tongue_sole nt5c1bb si:dkey-174m14.3 rdh14b LOC103381225 rngtt cnr1 akirin2 rars2 riox1 ndufb1 cpsf2
Helicophagus_hypophthalmus_Sauvage,_1878 myo6a LOC113528782 mei4 nt5e snx14 cnr1 rngtt pnrc1 LOC113528790 LOC113529170 c30h8orf82
It would be better if the columns align with themselves and there is a space between each row. It would help me visualise it better.
I tried using:
awk '{if (!seen[$1] ) {print $1, $2}}'
But it only removes all the duplicates from the first row and removing the values from second column.
blunt-snouted_clingfish LOC114457607
Chinese_tongue_sole nt5c1bb
Helicophagus_hypophthalmus_Sauvage,_1878 myo6a
CodePudding user response:
Assuming you don't REALLY have a blank character in every line that otherwise appears to be empty in your input (if you do, just do sed 's/^ *$//'
or similar to remove them), then using any awk
:
$ awk -v RS= -v ORS='\n\n' '{str=$1; for (i=2; i<=NF; i =2) str=str OFS $i; print str}' file
blunt-snouted_clingfish rdh14b LOC114457682 rngtt cnr1 akirin2 rars2 slc35a1 LOC114457715 rhag
Chinese_tongue_sole nt5c1bb si:dkey-174m14.3 rdh14b LOC103381225 rngtt cnr1 akirin2 rars2 riox1 ndufb1 cpsf2
Helicophagus_hypophthalmus_Sauvage,_1878 myo6a LOC113528782 mei4 nt5e snx14 cnr1 rngtt pnrc1 LOC113528790 LOC113529170 c30h8orf82
and piping to column
to get the column alignment you asked for:
$ awk -v RS= '{str=$1; for (i=2; i<=NF; i =2) str=str OFS $i; print str}' file | column -t | awk -v ORS='\n\n' '1'
blunt-snouted_clingfish rdh14b LOC114457682 rngtt cnr1 akirin2 rars2 slc35a1 LOC114457715 rhag
Chinese_tongue_sole nt5c1bb si:dkey-174m14.3 rdh14b LOC103381225 rngtt cnr1 akirin2 rars2 riox1 ndufb1 cpsf2
Helicophagus_hypophthalmus_Sauvage,_1878 myo6a LOC113528782 mei4 nt5e snx14 cnr1 rngtt pnrc1 LOC113528790 LOC113529170 c30h8orf82