I would like to see if column2 values if present in column3 and vice versa, if this is true, then I would like to remove this from the output.
sample.txt
3 abc def
5 ghk lmn
8 opq abc
10 lmn rst
15 uvw xyz
4 bcd abc
89 ntz uhg
so far I have
awk ' {
x[$2]=$2
y[$3]=$3
if (!(($2 in y) ||( $3 in x)) )
{
print $1,$2,$3
}
} ' sample.txt
I would like to have the output as follows.
15 uvw xyz
89 ntz uhg
I understand that awk reads the file line by line and the code I 've is incompatible, as it does not check for future array indexes that are not seen yet. Therefore reporting the first occurence. Would like to see if this can be done in a simpler way in awk, as my real date set is very huge (up to 5 million lines, and 400-500 MBytes). Thanks!
CodePudding user response:
One awk
idea using a two passes of the input file:
awk '
# 1st pass:
FNR==NR { seen[$2] # increment our seen counter for $2
if ($2 != $3) # do not increment seen[] if $2==$3
seen[$3] # increment our seen counter for $3
next
}
# 2nd pass:
seen[$2] <= 1 && # if seen[] counts are <= 1 for both
seen[$3] <= 1 # $2 and $3 then print current line
' sample.txt sample.txt
This generates:
15 uvw xyz
89 ntz uhg
Copying the first 4 lines over and over again until sample.txt
contains ~4 million lines, then running this awk
script, generated the same 2 lines of output and took ~3 seconds on my system (a VM running on a low-end 9xxx i7).
Another awk
idea that uses some extra memory but only requires a single pass through the input file:
awk '
{ seen[$2]
if ($2 != $3)
seen[$3]
if (seen[$2] <=1 && seen[$3] <= 1)
lines[ c]=$0
}
END { for (i=1;i<=c;i ) {
split(lines[i],arr)
if (seen[arr[1]] <= 1 && seen[arr[2]] <= 1)
print lines[i]
}
}
' sample.txt
This also generates:
15 uvw xyz
89 ntz uhg
Peformance on this one is going to depend on the number of unique $2/$3 values and thus the amount of memory that has to be allocated/processed. For my 4 million-row sample.txt
(where 4 million rows are duplicate rows thus little additional memory used) the run time comes in at ~1.7 seconds ... a tad better than the 2-pass solution (~3 secs) but for real world data (with a large volume of unique $2/$3 values) I'm guessing the times will be a bit closer.