Home > other >  if value from one column is present in another columns then ignore/delete the line in output using a
if value from one column is present in another columns then ignore/delete the line in output using a

Time:02-15

I would like to see if column2 values if present in column3 and vice versa, if this is true, then I would like to remove this from the output.

sample.txt

3   abc def
5   ghk lmn
8   opq abc
10  lmn rst
15  uvw xyz
4   bcd abc
89  ntz uhg

so far I have

awk ' {
    x[$2]=$2
    y[$3]=$3
    if (!(($2 in y) ||( $3 in x)) )
    {
     print $1,$2,$3
    }


} ' sample.txt

I would like to have the output as follows.

15  uvw xyz
89  ntz uhg

I understand that awk reads the file line by line and the code I 've is incompatible, as it does not check for future array indexes that are not seen yet. Therefore reporting the first occurence. Would like to see if this can be done in a simpler way in awk, as my real date set is very huge (up to 5 million lines, and 400-500 MBytes). Thanks!

CodePudding user response:

One awk idea using a two passes of the input file:

awk '

# 1st pass:

FNR==NR { seen[$2]             # increment our seen counter for $2
          if ($2 != $3)        # do not increment seen[] if $2==$3
             seen[$3]          # increment our seen counter for $3
          next
        }

# 2nd pass:

seen[$2] <= 1 &&               # if seen[] counts are <= 1 for both
seen[$3] <= 1                  # $2 and $3 then print current line
' sample.txt sample.txt

This generates:

15  uvw xyz
89  ntz uhg

Copying the first 4 lines over and over again until sample.txt contains ~4 million lines, then running this awk script, generated the same 2 lines of output and took ~3 seconds on my system (a VM running on a low-end 9xxx i7).


Another awk idea that uses some extra memory but only requires a single pass through the input file:

awk '
    { seen[$2]  
      if ($2 != $3)
         seen[$3]  
      if (seen[$2] <=1 && seen[$3] <= 1)
         lines[  c]=$0
    }
END { for (i=1;i<=c;i  ) {
          split(lines[i],arr)
          if (seen[arr[1]] <= 1 && seen[arr[2]] <= 1)
             print lines[i]
      }
    }
' sample.txt

This also generates:

15  uvw xyz
89  ntz uhg

Peformance on this one is going to depend on the number of unique $2/$3 values and thus the amount of memory that has to be allocated/processed. For my 4 million-row sample.txt (where 4 million rows are duplicate rows thus little additional memory used) the run time comes in at ~1.7 seconds ... a tad better than the 2-pass solution (~3 secs) but for real world data (with a large volume of unique $2/$3 values) I'm guessing the times will be a bit closer.

  • Related