Home > Back-end >  AWK select columns in file2 based on partial header match in file1
AWK select columns in file2 based on partial header match in file1

Time:10-21

I have a file ("File1") with ~40-80k columns and ~10k rows. The column headers in File1 are comprised of a unique identifier (e.g. "4b_1.04:") followed by a description (e.g. "Colname_3"). File2 contains a list of unique identifiers (i.e. not an exact match). Is there a way to extract columns from File1 using a list of column headers in File2 based on a partial match?

For example:

"File1"

patient_ID,response,0_4: Number of Variants,0_6: Number of CDS Variants,3_2.83: Colname_1,3_8.5102: Colname_2,4b_1.04: Colname_3,4_1.0: Colname_4,4_7.7101: Colname_5
ID_237.vcf,Benefit,13008,4343,0.65,1.23,0.17,2.57,4.22
ID_841.vcf,Benefit,15127,2468,0.9,0.68,2.39,1.8,1.6
ID_767.vcf,Benefit,5190,3261,0.73,1.16,1.99,0.79,1.17
ID_263.vcf,Benefit,16888,9548,0.61,1.66,0.73,2.42,1.55
ID_179.vcf,Benefit,3545,842,0.22,0.67,0.48,3.9,3.95
ID_408.vcf,Benefit,1427,4583,0.92,0.76,0.17,0.8,1.27
ID_850.vcf,Benefit,13835,4682,0.8,1.21,0.05,1.74,4.61
ID_856.vcf,Benefit,8939,8435,0.31,0.99,2.5,1.36,0.74
ID_328.vcf,Benefit,14220,8481,0.23,0.22,0.79,0.14,1.08
ID_704.vcf,Benefit,18145,914,0.66,1.69,0.17,0.4,3.13
ID_828.vcf,No_Benefit,4798,8163,0.74,0.89,1.04,1.68,1.29
ID_16.vcf,No_Benefit,6472,528,0.47,1.5,1.74,0.19,3.54
ID_380.vcf,No_Benefit,9827,8359,0.86,1.59,2.41,0.11,3.71
ID_559.vcf,No_Benefit,10247,9150,0.68,0.78,1.02,0.69,1.31
ID_466.vcf,No_Benefit,11092,4078,0.16,0.03,0.4,1.51,2.86
ID_925.vcf,No_Benefit,4809,2908,0.01,1.49,2.32,2.35,4.58
ID_573.vcf,No_Benefit,4341,4307,0.87,0.14,2.63,1.35,3.54
ID_497.vcf,No_Benefit,18279,663,0.1,1.06,2.96,1.98,4.22
ID_830.vcf,No_Benefit,18505,456,0.31,0.25,1.96,3.01,4.6
ID_665.vcf,No_Benefit,15072,2962,0.43,1.35,0.76,0.68,1.47

"File2"

patient_ID
response
0_4:
0_6:
4b_1.04:
3_2.83:
3_8.5102:

NB. The identifiers in File2 are in a different order to the column headers in File1, and the delimiter in File1 is a tab, not a comma (do tabs get converted to spaces when copy-pasting from SO?).

My attempt:

awk 'NR==FNR{T[$1]=NR; next} FNR==1 {MX=NR-1; for (i=1; i<=NF; i  ) if ($i in T) C[T[$i]] = i } {for (j=1; j<=MX; j  ) printf "%s%s", $C[j], (j==MX)?RS:"\t" }' File2 <(tr -s "," "\t" < File1)

Unfortunately, this prints the 'partial' header - I want the full header - and appears to struggle with File2 being in a different order to File1.

Expected outcome (awk 'BEGIN{FS=","; OFS="\t"}{print $1, $2, $3, $4, $7, $5, $6}' File1):

patient_ID  response    0_4: Number of Variants 0_6: Number of CDS Variants 4b_1.04: Colname_3  3_2.83: Colname_1   3_8.5102: Colname_2
ID_237.vcf  Benefit 13008   4343    0.17    0.65    1.23
ID_841.vcf  Benefit 15127   2468    2.39    0.9 0.68
ID_767.vcf  Benefit 5190    3261    1.99    0.73    1.16
ID_263.vcf  Benefit 16888   9548    0.73    0.61    1.66
ID_179.vcf  Benefit 3545    842 0.48    0.22    0.67
ID_408.vcf  Benefit 1427    4583    0.17    0.92    0.76
ID_850.vcf  Benefit 13835   4682    0.05    0.8 1.21
ID_856.vcf  Benefit 8939    8435    2.5 0.31    0.99
ID_328.vcf  Benefit 14220   8481    0.79    0.23    0.22
ID_704.vcf  Benefit 18145   914 0.17    0.66    1.69
ID_828.vcf  No_Benefit  4798    8163    1.04    0.74    0.89
ID_16.vcf   No_Benefit  6472    528 1.74    0.47    1.5
ID_380.vcf  No_Benefit  9827    8359    2.41    0.86    1.59
ID_559.vcf  No_Benefit  10247   9150    1.02    0.68    0.78
ID_466.vcf  No_Benefit  11092   4078    0.4 0.16    0.03
ID_925.vcf  No_Benefit  4809    2908    2.32    0.01    1.49
ID_573.vcf  No_Benefit  4341    4307    2.63    0.87    0.14
ID_497.vcf  No_Benefit  18279   663 2.96    0.1 1.06
ID_830.vcf  No_Benefit  18505   456 1.96    0.31    0.25
ID_665.vcf  No_Benefit  15072   2962    0.76    0.43    1.35

CodePudding user response:

Would you please try the following:

awk -F"\t" '
    NR==FNR {                                   # handle File2
        partial[FNR] = $i                       # create a list of desired header (partial)
        len = FNR                               # array lenth of "partial"
        next
    }
    FNR==1 {                                    # handle header line of File1
        ofs = line = ""
        for (j = 1; j <= len; j  ) {
            for (i = 1; i <= NF; i  ) {
                if (index($i, partial[j]) == 1) {    # test the partial match
                    header[  n] = i             # if match, store the position
                    line = line ofs $i
                    ofs = "\t"
                }
            }
        }
        print line                              # print the desired header (full)
    }
    FNR>1 {                                     # handle body lines of File1
        ofs = line = ""
        for (i = 1; i <= n; i  ) {              # positions of desired columns
            line = line ofs $header[i]
            ofs = "\t"
        }
        print line
    }
' File2 File1

Output:

patient_ID  response    0_4: Number of Variants 0_6: Number of CDS Variants 4b_1.04: Colname_33_2.83: Colname_1 3_8.5102: Colname_2
ID_237.vcf  Benefit 13008   4343    0.17    0.65    1.23
ID_841.vcf  Benefit 15127   2468    2.39    0.9 0.68
ID_767.vcf  Benefit 5190    3261    1.99    0.73    1.16
ID_263.vcf  Benefit 16888   9548    0.73    0.61    1.66
ID_179.vcf  Benefit 3545    842 0.48    0.22    0.67
ID_408.vcf  Benefit 1427    4583    0.17    0.92    0.76
ID_850.vcf  Benefit 13835   4682    0.05    0.8 1.21
ID_856.vcf  Benefit 8939    8435    2.5 0.31    0.99
ID_328.vcf  Benefit 14220   8481    0.79    0.23    0.22
ID_704.vcf  Benefit 18145   914 0.17    0.66    1.69
ID_828.vcf  No_Benefit  4798    8163    1.04    0.74    0.89
ID_16.vcf   No_Benefit  6472    528 1.74    0.47    1.5
ID_380.vcf  No_Benefit  9827    8359    2.41    0.86    1.59
ID_559.vcf  No_Benefit  10247   9150    1.02    0.68    0.78
ID_466.vcf  No_Benefit  11092   4078    0.4 0.16    0.03
ID_925.vcf  No_Benefit  4809    2908    2.32    0.01    1.49
ID_573.vcf  No_Benefit  4341    4307    2.63    0.87    0.14
ID_497.vcf  No_Benefit  18279   663 2.96    0.1 1.06
ID_830.vcf  No_Benefit  18505   456 1.96    0.31    0.25
ID_665.vcf  No_Benefit  15072   2962    0.76    0.43    1.35
  • Related