Home > Back-end >  awk remove endings of words ending with patterns
awk remove endings of words ending with patterns

Time:11-10

I have a large dataset and am trying to lemmatize a column ($14) with awk, which is I need to remove 'ing', 'ed', 's' in words if it ends with one of those pattern. So asked, asks, asking would be just 'ask' after all.

Let's say I have this dataset (the column I want to make modifications is $2:

onething      This is a string that is tested multiple times.
twoed         I wanted to remove words ending with many patterns.
threes        Reading books is good thing.

With that, expected output is:

onething      Thi i a str that i test multiple time.
twoed         I want to remove word end with many pattern.
threes        Read book i good th.

I have tried following regex with awk, but it didnt work.

awk -F'\t' '{gsub(/\(ing|ed|s\)\b/," ",$2); print}' file.txt  

#this replaces some of the words with ing and ed, not all, words ending with s stays the same (which I dont want)

Please help, I'm new to awk and still exploring it.

CodePudding user response:

Using GNU awk for gensub() and \> for word boundaries:

$ awk 'BEGIN{FS=OFS="\t"} {$2=gensub(/(ing|ed|s)\>/,"","g",$2)} 1' file
onething        Thi i a str that i test multiple time.
twoed   I want to remove word end with many pattern.
threes  Read book i good th.

CodePudding user response:

Using any awk with gsub you could do:

awk -F'\t' -v OFS="\t" '
    { gsub(/(s|ed|ing)[.[:blank:]]/," ",$2)
      match($2,/[.]$/) || sub(/[[:blank:]]$/,".",$2)
    }1
' file

Example Input File

$ cat file
onething        This is a string that is tested multiple times.
twoed   I wanted to remove words ending with many patterns.
threes  Reading books is good thing.
four    Just a normal sentence.

Example Use/Output

$ awk -F'\t' -v OFS="\t" '
>     { gsub(/(s|ed|ing)[.[:blank:]]/," ",$2)
>       match($2,/[.]$/) || sub(/[[:blank:]]$/,".",$2)
>     }1
> ' file
onething        Thi i a str that i test multiple time.
twoed   I want to remove word end with many pattern.
threes  Read book i good th.
four    Just a normal sentence.

(note: last line added as example of a sentence unchanged)

  • Related