awk hash applying conditions on two input files-CodePudding

I'm trying to compare two files using awk, where I want to combine them based on three conditions.

column 2 equals column 1
column 3 bigger or equal column 2
column 3 smaller or equal column 3

The files look like this:

file1

snp1 14 6371334
snp2 14 7928189
snp3 14 31819743
snp4 14 62133529
snp5 14 62616434
snp6 14 17544926
snp7 14 31639444

file2

14 71159186 72228540 31
14 15732809 16677121 68
14 45003977 46299534 69
14 61965465 64286878 128
14 17378950 17833828 141
14 12877549 13217565 193
14 31369019 31785149 194
14 49883707 49905143 197

And the desired output would be:

snp1 14 6371334 0 
snp2 14 7928189 0
snp3 14 31819743 0
snp4 14 62133529 128
snp5 14 62616434 128
snp6 14 17544926 141
snp7 14 31639444 194

I've tried this:

awk 'NR==FNR {a[$1]=$1;b[$2]=$2;c[$3]=$3;d[$4]=$4;next} {if($2 in a && $3 >= b[$2] && $3 <= c[$3]) print $1,$2,$3,d[$4]}' file2 file1

but it doesn't work like that.

Any help?

Thanks!

CodePudding user response：

looks like maybe you want to assign an interval to a snp
that is if a snp is within some interval
report an identifier associated with the interval.

things almost I never like to see include the use of NR==FNR pattern without a corresponding NR!=FNR pattern.

the idea four separate arrays where each key is a duplicate of its value
... what could you possibly do with that?
no items in the same row are in anyway correlated with one another save by chance.

Not saying you should do it like this ...
but what you are likely thinking would be better served with a construct such as:

a[NR]=$1;b[NR]=$2 ....

where items related by being on the same row are recoverable as such

the trailing ;next in the first block may not be helping anything
as it is awk's natural behavior to go on without being told.

the second block has yet to embrace awk's essence ...
the condition goes in the implicit if before the block

something like

NR != FNR && $1 in a   ... {print ...

You typically want the much smaller file first if possible then stream through the second, especially if 2nd is much larger.

note: your samples appear to have order that is not being exploited

an outline might look like

read file1 into array(s) maintaining order

process first item from file1 through file2 until  
 found OR not exists is determined.

proceed to process next item from file1 (continuing from where you are in file2)
rinse & repeat

I could do your work for you, but you will be better served taking another run at it your self considering some of
the points made If you get stuck again please post your much closer approximation of something that could possibly work and I'll check back.

CodePudding user response：

With GNU awk for arrays of arrays and assuming a value can only be in 1 range for a given key:

$ cat tst.awk
NR==FNR {
    ranges2vals[$1][$2 FS $3] = $4
    next
}
{ val = 0 }
$2 in ranges2vals {
    for (range in ranges2vals[$2]) {
        split(range,r)
        if ( (r[1] <= $3) && ($3 <= r[2]) ) {
            val = ranges2vals[$2][range]
            break
        }
    }
}
{ print $0, val }

$ awk -f tst.awk file2 file1
snp1 14 6371334 0
snp2 14 7928189 0
snp3 14 31819743 0
snp4 14 62133529 128
snp5 14 62616434 128
snp6 14 17544926 141
snp7 14 31639444 194