How to merge two files based on 2 columns using awk?-CodePudding

I have two files with different row numbers. I would like to merge file1 with file2, similar to how I would do with left_join in R:

Denisovan_EAS_4 <- left_join(file1,file2,by = "SNP", "pval")%>% distinct(SNP, pval, .keep_all = TRUE)

I want to use awk as it is much faster and more efficient.

File 1:

SNP pval
rs9258594 1.52e-127
rs62056091 9.97e-95

File 2:

SNP CHROM POS beta se pval REF ALT af Category Description Details Ancestry phenotype pval_heterogeneity fdr
rs9258594 10 122943300 -1.629 0.4599 1.52e-127 T G 0.05 Metabolic NA NA NA NA 0.25 0.005
rs62056091 8 122546790 -1.466 0.5799 9.97e-95 C A 0.35 Neurological NA NA NA NA 0.75 0.0045
rs7680377 2 5687902 -1.136 0.5357 1.18e-34 T C 0.15 Neurological NA NA NA NA 0.21 0.25

All SNPs are present in file 2 which are present in file 1, file 1 has much less rows than file 2. I would like to print all headers of file 2 in file 3.

File3 (same than file 2, but only SNPs with pval that are present in file 1):

SNP CHROM POS beta se pval REF ALT af Category Description Details Ancestry phenotype pval_heterogeneity fdr
rs9258594 10 122943300 -1.629 0.4599 1.52e-127 T G 0.05 Metabolic NA NA NA NA 0.25 0.005
rs62056091 8 122546790 -1.466 0.5799 9.97e-95 C A 0.35 Neurological NA NA NA NA 0.75 0.0045

Tried awk code:

awk 'FNR==NR && FNR>1 {a[$2] = $6; next}
     FNR > 1 && ($21 in a) && $6 == "pval" {
         print $0
     }' file2 file1

The output is empty.

CodePudding user response：

Okay, you say "only SNPs with pval that are present in file 1)", and it appears that existence in file1 is sufficient to determine that something has a pval, yes? So the following may be sufficient:

$ awk 'NR==FNR{f1[$1];next} $1 in f1' file1 file2

This reads through both files. In first file you populate an array with keys, and in the second file, you print only the lines with existent keys.

Note that if you want a more complex condition that merely checking for existence -- for example, that the pval column of file1 matches a certain pattern, it's easy to add that:

'NR==FNR && $2 ~ /^[0-9] \.[0-9] e-?[0-9.] $/ {f1[$1; next}`

Also note that $1 in f1 is a full condition that checks for the existence of a key in the array. In awk, if a condition has no statement the statement defaults to {print}.

UPDATE:

The awk code you added to your question has some problems. The FNR==NR section, which only refers to the first file, has no $6 in file1, and $2 refers to neither SNP nor pval in file2. Also, your example data does not contain a $21; it has only 16 fields. Perhaps you provided only a subset of fields for sample data?

Per your comment, if what you need is a matching field-1 and field-6 rather than just field-1, then this might do:

awk 'NR==FNR{f1[$1 FS $2];next} $1 FS $6 in f1' file1 file2

This merely makes the key in the array a combination of the two fields instead of just the first field. It should be safe to separate fields using FS.