extract rows across multiple files using conditions and variables-CodePudding

I have a text file of 5,000 rows. Each row corresponds to a unique FILE all labelled gwas*:

CHR     BP          FILE
chr1    12345678    gwas1
chr2    87654321    gwas2
...

I have 5,000 gwas* files all with unique file names (as found in column 4 above - FILE), for example, gwas1 looks like this:

CHR     BP                  
chr1    12345678    
chr1    12345679        
chr1    12356777    
...

I want to extract all rows from each gwas* file where the BP value is within 500,000 (up and down) of the BP value in the text file. The CHR value must also match. For example:

gwas1 in the text file has the CHR value chr1 and the BP value 12,345,678.
I want to extract all rows from gwas1 which have a chr1 value in the CHR column and which have a BP value between 11,845,678 (this value is 500,000 down from the BP value in the text file) and 12,845,678 (this value is 500,000 down from the BP value in the text file).

I can do this manually for a single gwas file using the below code (this does not use the text file of 5,000 rows however):

export CHR="chr11"
export BP=107459522
export WINDOW=500000

awk -v CHR=$CHR -v BP_pos=$(($BP   $WINDOW)) -v BP_neg=$(($BP - $WINDOW)) 'BEGIN{FS=OFS="\t"}FNR==1 || ($1 == CHR && $2 < BP_pos && $2 > BP_neg )' gwas1 > gwas1_extract

I want to be able to do this for all 5,000 gwas files listed in my text file. The output for every gwas file should only include the rows in which (i) the column CHR contained the value for that file (e.g., for gwas1 this is chr1) and the BP column contained values that were within 500,000 of the value given in the text file for the BP column (for gwas1 this is 500,000 either side of 12,345,678). The output file would look like this:

CHR     BP           
chr1    12345678  
chr1    12345679  
chr1    12356777

awk --version = GNU Awk 4.0.2

Any help would be great!

CodePudding user response：

Any help would be great!

You might find GNU AWK's ARGC and ARGV useful in your case as

A program can alter ARGC and the elements of ARGV. Each time awk reaches the end of an input file, it uses the next element of ARGV as the name of the next input file. By storing a different string there, a program can change which files are read. Use "-" to represent the standard input. Storing additional elements and incrementing ARGC causes additional files to be read.

Thus allowing you to select files to process based on content of file.

CodePudding user response：

Sample inputs:

$ cat file.txt
CHR     BP      FILE
chr1    12345678        gwas1
chr2    87654321        gwas2
chr4    99999999        gwas4             # file gwas4 does not exist

$ cat gwas1
CHR     BP
chr1    12345678                          # match
chr1    12345679                          # match
chr1    12356777                          # match
chr1    99999999

$ cat gwas2
CHR     BP
chr1    12345678
chr2    87650000                          # match

$ cat gwas3                               # no matches since gwas3 not in file.txt
CHR     BP
chr3    2134234

NOTE: comments do not exist in actual files

A single GNU awk script to process all gwas* files:

awk -v diff=500000 '

function abs(x) { return (x < 0.0) ? -x : x }

BEGIN         { FS=OFS="\t" }

FNR==NR       { if (FNR>1)                      # 1st file; skip header and ...
                   bp_list[$3][$1]=$2           # save contents in our bp_list[FILE][CHR] array
                next
              }

FNR==1        { close(outfile)                  # close previous output file
                fn=FILENAME
                outfile=fn "_extract"
                if (fn in bp_list)              # if fn in 1st file then ...
                   print > outfile              # print header else ...
                else                            # skip to next input file; also addresses gwas* matching on gwas*_extract files, ie, these will be skipped, too
                   nextfile
                next
              }

fn in bp_list { if ($1 in bp_list[fn] && abs(bp_list[fn][$1] - $2) <= diff)
                   print > outfile
              }
' file.txt gwas*

NOTE: requires GNU awk for multidimensional arrays (aka array of arrays)

This generates:

$ head gwas*extract
==> gwas1_extract <==
CHR     BP
chr1    12345678
chr1    12345679
chr1    12356777

==> gwas2_extract <==
CHR     BP
chr2    87650000