Home > database >  (awk) Remove lines containing two or more specific words
(awk) Remove lines containing two or more specific words

Time:09-01

I have this:

awk -v p="WORD1" 'FNR==1{x=0}{x =gsub(p,p);if(x>1){print FILENAME;nextfile}}' *

This finds files that contain the word WORD1 two or more times. Could you please tell me how to remove these lines (which contain the word WORD1 two or more times) from the file?

Thank you!

CodePudding user response:

Could you please tell me how to remove these lines (which contain the word WORD1 two or more times) from the file

If you have gnu-awk then you can use this solution:

awk -i inplace -F '\\<WORD1\\>' 'NF <= 2' *

CodePudding user response:

This might work for you (GNU sed):

sed -E '/(\<\S \>).*\<\1\>/d' file

If a word repeats twice in a line, delete the line.

CodePudding user response:

Sample input:

$ cat sample.dat
this line has 0 entries
this line has 1 WORD1 entry
this line has 2 WORD1 WORD1 entries
this line has 3 WORD1 WORD1 WORD1 entries

One awk idea:

$ awk -v p="WORD1" 'gsub(p,p)<=1' sample.dat
this line has 0 entries
this line has 1 WORD1 entry

If:

  • results are correct and ...
  • OP wants to overwrite the input file and ...
  • OP has GNU awk

Then OP can add -i inplace, eg:

$ awk -v p="WORD1" -i inplace 'gsub(p,p)<=1' sample.dat

If OP's awk doesn't support -i inplace then the output can be saved to a temp file and then mv tmpfile sample.dat.


One potential issue with gsub(p,p) ... this will cause WORD1 to match on BADWORD123; if OP only wants a 'whole word' match then we need to add a bit more code.

Updating our sample input:

$ cat sample.dat
this line has 0 entries
this line has 1 WORD1 entry
this line has 2 WORD1 WORD1 entries
this line has 3 WORD1 WORD1 WORD1 entries
this line has 1 WORD1 BADWORD12 entry
WORD1 this line has 2 matching entries WORD1

One awk idea:

awk -v p="WORD1" '

function regex_cnt() {
    c=0
    x=$0
    while (match(x,regex)) {
          c  
          x=substr(x,RSTART RLENGTH-1)
    } 
    return c
}

BEGIN { regex="(^|[^[:alnum:]])" p "([^[:alnum:]]|$)" }   # add checks for beginning-of-line, non-alphanumeric and end-of-line to our regex

regex_cnt()<=1
' sample.dat

This generates:

this line has 0 entries
this line has 1 WORD1 entry
this line has 1 WORD1 BADWORD12 entry
  • Related