Home > Mobile >  How can I arrange the second and third column horizontally taking the first column as a reference?
How can I arrange the second and third column horizontally taking the first column as a reference?

Time:01-21

I have a dataset with 3 columns and '#' occupying the empty lines.

Danio_rerio tmem129 transmembrane_protein_129,_E3_ubiquitin_protein_ligase
Danio_rerio si:zfos-2326c3.2 si:zfos-2326c3.2
Danio_rerio il1rapl2 interleukin_1_receptor_accessory_protein-like_2
Danio_rerio fam199x family_with_sequence_similarity_199,_X-linked
Danio_rerio commd5 COMM_domain_containing_5
Danio_rerio cysltr1 cysteinyl_leukotriene_receptor_1
Danio_rerio lpar4 lysophosphatidic_acid_receptor_4
Danio_rerio p2ry10 P2Y_receptor_family_member_10
Danio_rerio LOC101883727 uncharacterized_protein_K02A2.6-like
Danio_rerio gpr174 G_protein-coupled_receptor_174
Danio_rerio si:dkey-92i17.2 si:dkey-92i17.2
#  
#  
#  
Danio_rerio cysltr1 cysteinyl_leukotriene_receptor_1
Danio_rerio lpar4 lysophosphatidic_acid_receptor_4
Danio_rerio p2ry10 P2Y_receptor_family_member_10
Danio_rerio LOC101883727 uncharacterized_protein_K02A2.6-like
Danio_rerio gpr174 G_protein-coupled_receptor_174
Danio_rerio si:dkey-92i17.2 si:dkey-92i17.2
#  
#  
#  
#  
#  
#  
#  
Etheostoma_spectabile LOC116696506 integral_membrane_protein_2A-like
Etheostoma_spectabile gpr174 G_protein-coupled_receptor_174
Etheostoma_spectabile p2ry10 P2Y_receptor_family_member_10
Etheostoma_spectabile lpar4 lysophosphatidic_acid_receptor_4
Etheostoma_spectabile LOC116696505 uncharacterized_LOC116696505
Etheostoma_spectabile cysltr1 cysteinyl_leukotriene_receptor_1
Etheostoma_spectabile LOC116697110 endothelin_receptor_type_B-like
Etheostoma_spectabile polr1d RNA_polymerase_I_and_III_subunit_D
Etheostoma_spectabile sybl1 synaptobrevin-like_1
Etheostoma_spectabile mtus1a microtubule_associated_tumor_suppressor_1a
Etheostoma_spectabile pdgfrl platelet-derived_growth_factor_receptor-like
#  
#  
#  
#  
#  
#  
#  
#  
#  
Pimephales_promelas sulf1 sulfatase_1
Pimephales_promelas LOC120475969 solute_carrier_organic_anion_transporter_family_member_5A1
Pimephales_promelas si:ch211-51h4.2 uncharacterized_si:ch211-51h4.2
Pimephales_promelas tbl1xr1a TBL1X/Y_related_1a
Pimephales_promelas zgc:113516 uncharacterized_LOC120475971
Pimephales_promelas cysltr1 cysteinyl_leukotriene_receptor_1
Pimephales_promelas lpar4 lysophosphatidic_acid_receptor_4
Pimephales_promelas p2ry10 P2Y_receptor_family_member_10
Pimephales_promelas LOC120475975 probable_G-protein_coupled_receptor_174
Pimephales_promelas mettl15 methyltransferase_like_15
Pimephales_promelas kif18a kinesin_family_member_18A

The first column should be printed first followed by the second column arranged horizontally. In the next line the first column should be printed followed by the third column arranged horizontally. After that there must be an empty line separating the two lines for a particular paragraph and then the columns must be aligned to each other for easy visualisation.

I used the following command:

cat input | sed 's/# #/ /g' | sed 's/^  *$//' | awk -v RS= '{str=$1; for (i=2; i<=NF; i =2) str=str OFS $i; print str}' | column -t |  awk -v ORS='\n\n' '1' > output

which gave me the following output:

Danio_rerio               tmem129       si:zfos-2326c3.2  il1rapl2         fam199x            commd5            cysltr1          lpar4         p2ry10        LOC101883727      gpr174            si:dkey-92i17.2

Danio_rerio               cysltr1       lpar4             p2ry10           LOC101883727       gpr174            si:dkey-92i17.2

Etheostoma_spectabile     LOC116696506  gpr174            p2ry10           lpar4              LOC116696505      cysltr1          LOC116697110  polr1d        sybl1             mtus1a            pdgfrl

Pimephales_promelas       sulf1         LOC120475969      si:ch211-51h4.2  tbl1xr1a           zgc:113516        cysltr1          lpar4         p2ry10        LOC120475975      mettl15           kif18a

Also I tried:

cat input | sed 's/# #/ /g' | sed 's/^  *$//' | awk -v RS= '{str=$1; for (i=3; i<=NF; i =3) str=str OFS $i; print str}' | column -t |  awk -v ORS='\n\n' '1' > output

which gave:

Danio_rerio               transmembrane_protein_129,_E3_ubiquitin_protein_ligase                 si:zfos-2326c3.2                                                                          interleukin_1_receptor_accessory_protein-like_2                                  family_with_sequence_similarity_199,_X-linked                             COMM_domain_containing_5                                               cysteinyl_leukotriene_receptor_1                          lysophosphatidic_acid_receptor_4                                     P2Y_receptor_family_member_10                                             uncharacterized_protein_K02A2.6-like                                   G_protein-coupled_receptor_174                                             si:dkey-92i17.2

Danio_rerio               cysteinyl_leukotriene_receptor_1                                       lysophosphatidic_acid_receptor_4                                                          P2Y_receptor_family_member_10                                                    uncharacterized_protein_K02A2.6-like                                      G_protein-coupled_receptor_174                                         si:dkey-92i17.2

Etheostoma_spectabile     integral_membrane_protein_2A-like                                      G_protein-coupled_receptor_174                                                            P2Y_receptor_family_member_10                                                    lysophosphatidic_acid_receptor_4                                          uncharacterized_LOC116696505                                           cysteinyl_leukotriene_receptor_1                          endothelin_receptor_type_B-like                                      RNA_polymerase_I_and_III_subunit_D                                        synaptobrevin-like_1                                                   microtubule_associated_tumor_suppressor_1a                                 platelet-derived_growth_factor_receptor-like

Pimephales_promelas       sulfatase_1                                                            solute_carrier_organic_anion_transporter_family_member_5A1                                uncharacterized_si:ch211-51h4.2                                                  TBL1X/Y_related_1a                                                        uncharacterized_LOC120475971                                           cysteinyl_leukotriene_receptor_1                          lysophosphatidic_acid_receptor_4                                     P2Y_receptor_family_member_10                                             probable_G-protein_coupled_receptor_174                                methyltransferase_like_15                                                  kinesin_family_member_18A

However, I want these two to remain together in a single file such that it gives:

Danio_rerio               tmem129       si:zfos-2326c3.2  il1rapl2         fam199x            commd5            cysltr1          lpar4         p2ry10        LOC101883727      gpr174            si:dkey-92i17.2
Danio_rerio               transmembrane_protein_129,_E3_ubiquitin_protein_ligase                 si:zfos-2326c3.2                                                                          interleukin_1_receptor_accessory_protein-like_2                                  family_with_sequence_similarity_199,_X-linked                             COMM_domain_containing_5                                               cysteinyl_leukotriene_receptor_1                          lysophosphatidic_acid_receptor_4                                     P2Y_receptor_family_member_10                                             uncharacterized_protein_K02A2.6-like                                   G_protein-coupled_receptor_174                                             si:dkey-92i17.2

Danio_rerio               cysltr1       lpar4             p2ry10           LOC101883727       gpr174            si:dkey-92i17.2
Danio_rerio               cysteinyl_leukotriene_receptor_1                                       lysophosphatidic_acid_receptor_4                                                          P2Y_receptor_family_member_10                                                    uncharacterized_protein_K02A2.6-like                                      G_protein-coupled_receptor_174                                         si:dkey-92i17.2

Etheostoma_spectabile     LOC116696506  gpr174            p2ry10           lpar4              LOC116696505      cysltr1          LOC116697110  polr1d        sybl1             mtus1a            pdgfrl
Etheostoma_spectabile     integral_membrane_protein_2A-like                                      G_protein-coupled_receptor_174                                                            P2Y_receptor_family_member_10                                                    lysophosphatidic_acid_receptor_4                                          uncharacterized_LOC116696505                                           cysteinyl_leukotriene_receptor_1                          endothelin_receptor_type_B-like                                      RNA_polymerase_I_and_III_subunit_D                                        synaptobrevin-like_1                                                   microtubule_associated_tumor_suppressor_1a                                 platelet-derived_growth_factor_receptor-like

Pimephales_promelas       sulf1         LOC120475969      si:ch211-51h4.2  tbl1xr1a           zgc:113516        cysltr1          lpar4         p2ry10        LOC120475975      mettl15           kif18a
Pimephales_promelas       sulfatase_1                                                            solute_carrier_organic_anion_transporter_family_member_5A1                                uncharacterized_si:ch211-51h4.2                                                  TBL1X/Y_related_1a                                                        uncharacterized_LOC120475971                                           cysteinyl_leukotriene_receptor_1                          lysophosphatidic_acid_receptor_4                                     P2Y_receptor_family_member_10                                             probable_G-protein_coupled_receptor_174                                methyltransferase_like_15                                                  kinesin_family_member_18A

Let me know in the comments in case of any doubt. Thank you in advance!

CodePudding user response:

Here is an awk that treats the same key as two blocks if separated by runs of #:

awk  '
FNR==1{
    f1=$1 OFS $2
    f2=$1 OFS $3
    next
}

/^#/ && last==$1{next} 
/^[^#]/ && last=="#" {
    printf("%s\n%s\n\n",f1,f2)
    f1=f2=$1
}
{
    f1=f1 OFS $2
    f2=f2 OFS $3
    last=$1
}
END{printf("%s\n%s\n",f1,f2)}
' file 

Or with ruby it is easier to deal with runs of # as an additional data element to separate Danio_rerio into two separate blocks:

ruby -e 'blocks=$<.read.split(/^#[\s\S]*?(?=[^#\s])/m)
blocks.map{|s| s.split(/\r?\n/)}.
    each{|sl|   
                tag=sl[0][/^\S /]
                f1=sl.map{|ss| ss.split()[1]}.join("\t")
                f2=sl.map{|ss| ss.split()[2]}.join("\t")
                puts "#{tag}\t#{f1}\n#{tag}\t#{f2}\n\n" }
' file

Either prints:

Danio_rerio tmem129 si:zfos-2326c3.2    il1rapl2    fam199x commd5  cysltr1 lpar4   p2ry10  LOC101883727    gpr174  si:dkey-92i17.2
Danio_rerio transmembrane_protein_129,_E3_ubiquitin_protein_ligase  si:zfos-2326c3.2    interleukin_1_receptor_accessory_protein-like_2 family_with_sequence_similarity_199,_X-linked   COMM_domain_containing_5    cysteinyl_leukotriene_receptor_1    lysophosphatidic_acid_receptor_4    P2Y_receptor_family_member_10   uncharacterized_protein_K02A2.6-like    G_protein-coupled_receptor_174  si:dkey-92i17.2

Danio_rerio cysltr1 lpar4   p2ry10  LOC101883727    gpr174  si:dkey-92i17.2
Danio_rerio cysteinyl_leukotriene_receptor_1    lysophosphatidic_acid_receptor_4    P2Y_receptor_family_member_10   uncharacterized_protein_K02A2.6-like    G_protein-coupled_receptor_174  si:dkey-92i17.2

Etheostoma_spectabile   LOC116696506    gpr174  p2ry10  lpar4   LOC116696505    cysltr1 LOC116697110    polr1d  sybl1   mtus1a  pdgfrl
Etheostoma_spectabile   integral_membrane_protein_2A-like   G_protein-coupled_receptor_174  P2Y_receptor_family_member_10   lysophosphatidic_acid_receptor_4    uncharacterized_LOC116696505    cysteinyl_leukotriene_receptor_1    endothelin_receptor_type_B-like RNA_polymerase_I_and_III_subunit_D  synaptobrevin-like_1    microtubule_associated_tumor_suppressor_1a  platelet-derived_growth_factor_receptor-like

Pimephales_promelas sulf1   LOC120475969    si:ch211-51h4.2 tbl1xr1a    zgc:113516  cysltr1 lpar4   p2ry10  LOC120475975    mettl15 kif18a
Pimephales_promelas sulfatase_1 solute_carrier_organic_anion_transporter_family_member_5A1  uncharacterized_si:ch211-51h4.2 TBL1X/Y_related_1a  uncharacterized_LOC120475971    cysteinyl_leukotriene_receptor_1    lysophosphatidic_acid_receptor_4    P2Y_receptor_family_member_10   probable_G-protein_coupled_receptor_174 methyltransferase_like_15   kinesin_family_member_18A
  • Related