I have a text file of 5,000 rows. Each row corresponds to a unique FILE
all labelled gwas*
:
CHR BP FILE
chr1 12345678 gwas1
chr2 87654321 gwas2
...
I have 5,000 gwas*
files all with unique file names (as found in column 4 above - FILE
), for example, gwas1
looks like this:
CHR BP
chr1 12345678
chr1 12345679
chr1 12356777
...
I want to extract all rows from each gwas*
file where the BP
value is within 500,000 (up and down) of the BP
value in the text file. The CHR
value must also match. For example:
gwas1
in the text file has theCHR
valuechr1
and theBP
value12,345,678
.- I want to extract all rows from
gwas1
which have achr1
value in theCHR
column and which have aBP
value between11,845,678
(this value is 500,000 down from theBP
value in the text file) and12,845,678
(this value is 500,000 down from theBP
value in the text file).
I can do this manually for a single gwas file using the below code (this does not use the text file of 5,000 rows however):
export CHR="chr11"
export BP=107459522
export WINDOW=500000
awk -v CHR=$CHR -v BP_pos=$(($BP $WINDOW)) -v BP_neg=$(($BP - $WINDOW)) 'BEGIN{FS=OFS="\t"}FNR==1 || ($1 == CHR && $2 < BP_pos && $2 > BP_neg )' gwas1 > gwas1_extract
I want to be able to do this for all 5,000 gwas files listed in my text file. The output for every gwas file should only include the rows in which (i) the column CHR
contained the value for that file (e.g., for gwas1
this is chr1
) and the BP
column contained values that were within 500,000 of the value given in the text file for the BP
column (for gwas1
this is 500,000 either side of 12,345,678
). The output file would look like this:
CHR BP
chr1 12345678
chr1 12345679
chr1 12356777
awk --version
= GNU Awk 4.0.2
Any help would be great!
CodePudding user response:
Any help would be great!
You might find GNU AWK
's ARGC and ARGV useful in your case as
A program can alter ARGC and the elements of ARGV. Each time awk reaches the end of an input file, it uses the next element of ARGV as the name of the next input file. By storing a different string there, a program can change which files are read. Use "-" to represent the standard input. Storing additional elements and incrementing ARGC causes additional files to be read.
Thus allowing you to select files to process based on content of file.
CodePudding user response:
Sample inputs:
$ cat file.txt
CHR BP FILE
chr1 12345678 gwas1
chr2 87654321 gwas2
chr4 99999999 gwas4 # file gwas4 does not exist
$ cat gwas1
CHR BP
chr1 12345678 # match
chr1 12345679 # match
chr1 12356777 # match
chr1 99999999
$ cat gwas2
CHR BP
chr1 12345678
chr2 87650000 # match
$ cat gwas3 # no matches since gwas3 not in file.txt
CHR BP
chr3 2134234
NOTE: comments do not exist in actual files
A single GNU awk
script to process all gwas*
files:
awk -v diff=500000 '
function abs(x) { return (x < 0.0) ? -x : x }
BEGIN { FS=OFS="\t" }
FNR==NR { if (FNR>1) # 1st file; skip header and ...
bp_list[$3][$1]=$2 # save contents in our bp_list[FILE][CHR] array
next
}
FNR==1 { close(outfile) # close previous output file
fn=FILENAME
outfile=fn "_extract"
if (fn in bp_list) # if fn in 1st file then ...
print > outfile # print header else ...
else # skip to next input file; also addresses gwas* matching on gwas*_extract files, ie, these will be skipped, too
nextfile
next
}
fn in bp_list { if ($1 in bp_list[fn] && abs(bp_list[fn][$1] - $2) <= diff)
print > outfile
}
' file.txt gwas*
NOTE: requires GNU awk
for multidimensional arrays (aka array of arrays)
This generates:
$ head gwas*extract
==> gwas1_extract <==
CHR BP
chr1 12345678
chr1 12345679
chr1 12356777
==> gwas2_extract <==
CHR BP
chr2 87650000