Home > Mobile >  Delete repeated rows keeping one closer to another file using awk
Delete repeated rows keeping one closer to another file using awk

Time:11-05

I have two files

$cat file1.txt
0105   20   20   95     50
0106   20   20   95     50
0110   20   20   88     60
0110   20   20   88     65
0115   20   20   82     70
0115   20   20   82     70
0115   20   20   82     75

If you see the file1.txt, there are repeated values in column-1, which are 0110 and 0115.

So I would like to keep one row only based on the column-5 values, which are closer to corresponding values in a reference file (file2.txt). Here closer means the equal or the nearest value in file2.txt. I don't want to change any value in file1.txt, but just to select one row.

$cat file2.txt
0105   20   20   95     50
0106   20   20   95     50
0107   20   20   95     52
0110   20   20   88     65  34
0112   20   20   82     80  23
0113   20   20   82     85  32
0114   20   20   82     70  23
0115   20   20   82     72
0118   20   20   87     79
0120   20   20   83     79

So if we compare the two files, we must keep 0110 20 20 88 65, as the column-5 entry (i.e. 65) in file1.txt is closer that in reference file (i.e. 65 in file2.txt) and delete the other repeated rows. Similarly we must keep 0115 20 20 82 70 because 70 is closer to 72 and delete other two rows starting with 0115

Desire output:

0105   20   20   95     50
0106   20   20   95     50
0110   20   20   88     65
0115   20   20   82     70

I am trying with the following script, but not getting my desire result.

awk 'FNR==NR { a[$5]; next } $5 in a ' file1.txt file2.txt > test.txt
awk '{a[NR]=$1""$2} a[NR]!=a[NR-1]{print}' test.txt

My fortran program algorithm is:

# check each entries in column-1 in file1.txt with next rows if they are same or not
i.e. for i=1,i   do  # Here i is ith row
       for j=1,j   do
if a[i,j] != a[i 1,j]; then print the whole row as it is,
else
# find the row b[i,j] in file2.txt starting with a[i,j]
# and compare the 5th column i.e. b[i,j 5] with all a[i,j 5] starting with a[i,j] in file1.txt 
# and take the differences to find closest one
e.g. if we have 3 rows starting with same entry, then 
we select the a[i,j] in which diff(b[i,j 5],a[i,j 5]) is minumum i=1,2,3 

CodePudding user response:

awk 'BEGIN {
    while ((getline line < "file2.txt")>0) {
        split(line, f);
        file2[f[1]] = line;
    }
}
{
    if (!($1 in result)) result[$1] = $0;
    split(result[$1], a);
    split(file2[$1], f);
    if (abs(f[5]-$5) < abs(f[5]-a[5])) result[$1] = $0;
}
END {
    for (i in result) print result[i];
}
function abs(n) {
    return (n < 0 ? -n : n);
}' file1.txt | sort
  • Related