I modified the question based on the comments.
I would like to match two files: if $4 in file 1 is in the range of $3 and $4 in file 2, I would like to print file 1 with $6 in file 2. If there is no match, I would like to print NA in the output. If there are overlapping ranges, I would like to print the first match (sorting based on $4 of file 1).
File 1:
1 rs537182016 0 8674590 A C
1 rs575272151 0 69244805 G C
1 rs544419019 0 69244469 G C
1 rs354682 0 1268900 G C
File 2:
18 16 8674587 8784575 ABAT
10349 17 69148007 69244815 - ABCA10
23461 17 69244435 69327182 - ABCA5
Output:
1 rs537182016 0 8674590 A C ABAT
1 rs575272151 0 69244805 G C ABCA10
1 rs544419019 0 69244469 G C ABCA10
1 rs354682 0 1268900 G C NA
I tried the following based on previous answers, but it did not work. The output is an empty file.
awk 'FNR == NR { val[$1] = $4 }
FNR != NR { if ($1 in val && val[$1] >= $3 && val[$1] <= $4)
print $1, val[$1], $6
}' file1 file2 > file3
CodePudding user response:
Assumptions:
- in the case of multiple matches OP has stated we only use the
'first'
match; OP hasn't (yet) defined'first'
so I'm going to assume it means the order in which lines appear infile2
(aka the line number)
One awk
idea:
awk '
FNR==NR { min[ n]=$3 # save 1st file values in arrays; use line number as index
max[n]=$4
col6[n]=$6
next
}
{ for (i=1;i<=n;i ) # loop through 1st file entries
if (min[i] <= $4 && $4 <= max[i]) { # if we find a range match then ...
print $0, col6[i] # print current line plus column #6 from 1st file and then ...
next # skip to next line of input; this keeps us from matching on additional entries from 1st file
}
print $0, "NA" # if we got here there was no match so print current line plus "NA"
}
' file2 file1
NOTE: make note of the order of the input files; the first answer (below) was based on an input of file1 file2
; this answer requires the order of the input files to be flipped, ie, file2 file1
This generates:
1 rs537182016 0 8674590 A C ABAT
1 rs575272151 0 69244805 G C ABCA10
1 rs544419019 0 69244469 G C ABCA10
1 rs354682 0 1268900 G C NA
NOTE: following is based on OP's original question and expected output (revision #2); OP has since modified the expected output to such an extent that the following answer is no longer valid ...
Assumptions:
- in
file1
bothrs575272151 / 69244805
andrs544419019 / 69244469
match 2 different (overlapping) ranges fromfile2
but OP has only showed one set of matches in the expected output; from this I'm going to assume ... - once a match is found for an entry from
file1
, remove said entry from any additional matching; this will eliminate multiple matches forfile1
entries - once a match is found for a line from
file2
then stop looking for matches for that line (ie, go to the next intput line fromfile2
); this will eliminate multiple-matches forfile2
- OP has not provided any details on how to determine which mulit-match to keep so we'll use the first match we find
One awk
idea:
awk '
FNR==NR { val[$2]=$4; next }
{ for (i in val) # loop through list of entries from 1st file ...
if ($3 <= val[i] && val[i] <= $4) { # looking for a range match and if found ...
print $0,i # print current line plus 2nd field from 1st file and then ...
delete val[i] # remove 1st file entry from further matches and ...
next # skip to next line of input from 2nd file, ie, stop looking for additional matches for the current line
}
}
' file1 file2
This generates:
18 16 8674587 8784575 ABAT rs537182016
10349 17 69148007 69244815 - ABCA10 rs575272151
23461 17 69244435 69327182 - ABCA5 rs544419019
NOTES:
- the
for (i in val)
construct is not guaranteed to process the array entries in a consistent manner; net result is that in the instance where there are multiple matches we simply match on the 'first' array entry provided byawk
; if this 'random' nature of thefor (i in val)
is not acceptable then OP will need to update the question with additional details on how to handle multiple matches - for this particular case we actually generate the same output as expected by OP, but the assignments of
rs575272151
andrs544419019
could just as easily be reversed (due to the nature of thefor (i in val)
construct)