I have a large dataset and am trying to lemmatize a column ($14) with awk, which is I need to remove 'ing', 'ed', 's' in words if it ends with one of those pattern. So asked, asks, asking would be just 'ask' after all.
Let's say I have this dataset (the column I want to make modifications is $2:
onething This is a string that is tested multiple times. twoed I wanted to remove words ending with many patterns. threes Reading books is good thing.
With that, expected output is:
onething Thi i a str that i test multiple time. twoed I want to remove word end with many pattern. threes Read book i good th.
I have tried following regex with awk, but it didnt work.
awk -F'\t' '{gsub(/\(ing|ed|s\)\b/," ",$2); print}' file.txt
#this replaces some of the words with ing and ed, not all, words ending with s stays the same (which I dont want)
Please help, I'm new to awk and still exploring it.
CodePudding user response:
Using GNU awk for gensub()
and \>
for word boundaries:
$ awk 'BEGIN{FS=OFS="\t"} {$2=gensub(/(ing|ed|s)\>/,"","g",$2)} 1' file
onething Thi i a str that i test multiple time.
twoed I want to remove word end with many pattern.
threes Read book i good th.
CodePudding user response:
Using any awk
with gsub
you could do:
awk -F'\t' -v OFS="\t" '
{ gsub(/(s|ed|ing)[.[:blank:]]/," ",$2)
match($2,/[.]$/) || sub(/[[:blank:]]$/,".",$2)
}1
' file
Example Input File
$ cat file
onething This is a string that is tested multiple times.
twoed I wanted to remove words ending with many patterns.
threes Reading books is good thing.
four Just a normal sentence.
Example Use/Output
$ awk -F'\t' -v OFS="\t" '
> { gsub(/(s|ed|ing)[.[:blank:]]/," ",$2)
> match($2,/[.]$/) || sub(/[[:blank:]]$/,".",$2)
> }1
> ' file
onething Thi i a str that i test multiple time.
twoed I want to remove word end with many pattern.
threes Read book i good th.
four Just a normal sentence.
(note: last line added as example of a sentence unchanged)