I have this:
awk -v p="WORD1" 'FNR==1{x=0}{x =gsub(p,p);if(x>1){print FILENAME;nextfile}}' *
This finds files that contain the word WORD1 two or more times. Could you please tell me how to remove these lines (which contain the word WORD1 two or more times) from the file?
Thank you!
CodePudding user response:
Could you please tell me how to remove these lines (which contain the word WORD1 two or more times) from the file
If you have gnu-awk
then you can use this solution:
awk -i inplace -F '\\<WORD1\\>' 'NF <= 2' *
CodePudding user response:
This might work for you (GNU sed):
sed -E '/(\<\S \>).*\<\1\>/d' file
If a word repeats twice in a line, delete the line.
CodePudding user response:
Sample input:
$ cat sample.dat
this line has 0 entries
this line has 1 WORD1 entry
this line has 2 WORD1 WORD1 entries
this line has 3 WORD1 WORD1 WORD1 entries
One awk
idea:
$ awk -v p="WORD1" 'gsub(p,p)<=1' sample.dat
this line has 0 entries
this line has 1 WORD1 entry
If:
- results are correct and ...
- OP wants to overwrite the input file and ...
- OP has
GNU awk
Then OP can add -i inplace
, eg:
$ awk -v p="WORD1" -i inplace 'gsub(p,p)<=1' sample.dat
If OP's awk
doesn't support -i inplace
then the output can be saved to a temp file and then mv tmpfile sample.dat
.
One potential issue with gsub(p,p)
... this will cause WORD1
to match on BADWORD123
; if OP only wants a 'whole word' match then we need to add a bit more code.
Updating our sample input:
$ cat sample.dat
this line has 0 entries
this line has 1 WORD1 entry
this line has 2 WORD1 WORD1 entries
this line has 3 WORD1 WORD1 WORD1 entries
this line has 1 WORD1 BADWORD12 entry
WORD1 this line has 2 matching entries WORD1
One awk
idea:
awk -v p="WORD1" '
function regex_cnt() {
c=0
x=$0
while (match(x,regex)) {
c
x=substr(x,RSTART RLENGTH-1)
}
return c
}
BEGIN { regex="(^|[^[:alnum:]])" p "([^[:alnum:]]|$)" } # add checks for beginning-of-line, non-alphanumeric and end-of-line to our regex
regex_cnt()<=1
' sample.dat
This generates:
this line has 0 entries
this line has 1 WORD1 entry
this line has 1 WORD1 BADWORD12 entry