How can I remove the duplicate value from the first column only and then arrange the data in second-CodePudding

I want to remove the duplicate value from the first column only and then arrange the data in second column horizontally using value of first column as the first value for each row where each column would align with each other.

blunt-snouted_clingfish rdh14b
blunt-snouted_clingfish LOC114457682
blunt-snouted_clingfish rngtt
blunt-snouted_clingfish cnr1
blunt-snouted_clingfish akirin2
blunt-snouted_clingfish rars2
blunt-snouted_clingfish slc35a1
blunt-snouted_clingfish LOC114457715
blunt-snouted_clingfish rhag
 
 
 
Chinese_tongue_sole nt5c1bb
Chinese_tongue_sole si:dkey-174m14.3
Chinese_tongue_sole rdh14b
Chinese_tongue_sole LOC103381225
Chinese_tongue_sole rngtt
Chinese_tongue_sole cnr1
Chinese_tongue_sole akirin2
Chinese_tongue_sole rars2
Chinese_tongue_sole riox1
Chinese_tongue_sole ndufb1
Chinese_tongue_sole cpsf2
 
 
 
Helicophagus_hypophthalmus_Sauvage,_1878 myo6a
Helicophagus_hypophthalmus_Sauvage,_1878 LOC113528782
Helicophagus_hypophthalmus_Sauvage,_1878 mei4
Helicophagus_hypophthalmus_Sauvage,_1878 nt5e
Helicophagus_hypophthalmus_Sauvage,_1878 snx14
Helicophagus_hypophthalmus_Sauvage,_1878 cnr1
Helicophagus_hypophthalmus_Sauvage,_1878 rngtt
Helicophagus_hypophthalmus_Sauvage,_1878 pnrc1
Helicophagus_hypophthalmus_Sauvage,_1878 LOC113528790
Helicophagus_hypophthalmus_Sauvage,_1878 LOC113529170
Helicophagus_hypophthalmus_Sauvage,_1878 c30h8orf82

The output should be:

blunt-snouted_clingfish rdh14b LOC114457682 rngtt cnr1 akirin2 rars2 slc35a1 LOC114457715 rhag

Chinese_tongue_sole nt5c1bb si:dkey-174m14.3 rdh14b LOC103381225 rngtt cnr1 akirin2 rars2 riox1 ndufb1 cpsf2

Helicophagus_hypophthalmus_Sauvage,_1878 myo6a LOC113528782 mei4 nt5e snx14 cnr1 rngtt pnrc1 LOC113528790 LOC113529170 c30h8orf82

It would be better if the columns align with themselves and there is a space between each row. It would help me visualise it better.

I tried using:

awk '{if (!seen[$1]  ) {print $1, $2}}'

But it only removes all the duplicates from the first row and removing the values from second column.

blunt-snouted_clingfish LOC114457607
Chinese_tongue_sole nt5c1bb
Helicophagus_hypophthalmus_Sauvage,_1878 myo6a

CodePudding user response：

Assuming you don't REALLY have a blank character in every line that otherwise appears to be empty in your input (if you do, just do sed 's/^ *$//' or similar to remove them), then using any awk:

$ awk -v RS= -v ORS='\n\n' '{str=$1; for (i=2; i<=NF; i =2) str=str OFS $i; print str}' file
blunt-snouted_clingfish rdh14b LOC114457682 rngtt cnr1 akirin2 rars2 slc35a1 LOC114457715 rhag

Chinese_tongue_sole nt5c1bb si:dkey-174m14.3 rdh14b LOC103381225 rngtt cnr1 akirin2 rars2 riox1 ndufb1 cpsf2

Helicophagus_hypophthalmus_Sauvage,_1878 myo6a LOC113528782 mei4 nt5e snx14 cnr1 rngtt pnrc1 LOC113528790 LOC113529170 c30h8orf82

and piping to column to get the column alignment you asked for:

$ awk -v RS= '{str=$1; for (i=2; i<=NF; i =2) str=str OFS $i; print str}' file | column -t | awk -v ORS='\n\n' '1'
blunt-snouted_clingfish                   rdh14b   LOC114457682      rngtt   cnr1          akirin2  rars2  slc35a1  LOC114457715  rhag

Chinese_tongue_sole                       nt5c1bb  si:dkey-174m14.3  rdh14b  LOC103381225  rngtt    cnr1   akirin2  rars2         riox1         ndufb1        cpsf2

Helicophagus_hypophthalmus_Sauvage,_1878  myo6a    LOC113528782      mei4    nt5e          snx14    cnr1   rngtt    pnrc1         LOC113528790  LOC113529170  c30h8orf82