I'm trying to compare two files using awk
, where I want to combine them based on three conditions.
- column 2 equals column 1
- column 3 bigger or equal column 2
- column 3 smaller or equal column 3
The files look like this:
file1
snp1 14 6371334
snp2 14 7928189
snp3 14 31819743
snp4 14 62133529
snp5 14 62616434
snp6 14 17544926
snp7 14 31639444
file2
14 71159186 72228540 31
14 15732809 16677121 68
14 45003977 46299534 69
14 61965465 64286878 128
14 17378950 17833828 141
14 12877549 13217565 193
14 31369019 31785149 194
14 49883707 49905143 197
And the desired output would be:
snp1 14 6371334 0
snp2 14 7928189 0
snp3 14 31819743 0
snp4 14 62133529 128
snp5 14 62616434 128
snp6 14 17544926 141
snp7 14 31639444 194
I've tried this:
awk 'NR==FNR {a[$1]=$1;b[$2]=$2;c[$3]=$3;d[$4]=$4;next} {if($2 in a && $3 >= b[$2] && $3 <= c[$3]) print $1,$2,$3,d[$4]}' file2 file1
but it doesn't work like that.
Any help?
Thanks!
CodePudding user response:
looks like maybe you want to assign an interval to a snp
that is if a snp is within some interval
report an identifier associated with the interval.
things almost I never like to see include the use of NR==FNR
pattern
without a corresponding NR!=FNR
pattern.
the idea four separate arrays where each key is a duplicate of its value
... what could you possibly do with that?
no items in the same row are in anyway correlated with one another save by chance.
Not saying you should do it like this ...
but what you are likely thinking would be better served with a construct such as:
a[NR]=$1;b[NR]=$2 ....
where items related by being on the same row are recoverable as such
the trailing ;next
in the first block may not be helping anything
as it is awk's natural behavior to go on without being told.
the second block has yet to embrace awk's essence ...
the condition goes in the implicit if before the block
something like
NR != FNR && $1 in a ... {print ...
You typically want the much smaller file first if possible then stream through the second, especially if 2nd is much larger.
note: your samples appear to have order that is not being exploited
an outline might look like
read file1 into array(s) maintaining order
process first item from file1 through file2 until
found OR not exists is determined.
proceed to process next item from file1 (continuing from where you are in file2)
rinse & repeat
I could do your work for you, but you will be better served
taking another run at it your self considering some of
the points made
If you get stuck again please post your much closer approximation
of something that could possibly work and I'll check back.
CodePudding user response:
With GNU awk for arrays of arrays and assuming a value can only be in 1 range for a given key:
$ cat tst.awk
NR==FNR {
ranges2vals[$1][$2 FS $3] = $4
next
}
{ val = 0 }
$2 in ranges2vals {
for (range in ranges2vals[$2]) {
split(range,r)
if ( (r[1] <= $3) && ($3 <= r[2]) ) {
val = ranges2vals[$2][range]
break
}
}
}
{ print $0, val }
$ awk -f tst.awk file2 file1
snp1 14 6371334 0
snp2 14 7928189 0
snp3 14 31819743 0
snp4 14 62133529 128
snp5 14 62616434 128
snp6 14 17544926 141
snp7 14 31639444 194