Home > Blockchain >  Stemming a text file to remove suffixes given linewise in another file using sed
Stemming a text file to remove suffixes given linewise in another file using sed

Time:09-21

I have one file suffix.txt which contains some strings linewise, for example-

ing
ness
es
ed
tion

Also, I have a text file text.txt which contains some text, it is given that text.txt consists only of lowercase letters and without any punctuation, for example-

the raining cloud answered the man all his interrogation and with all
questioned mind the princess responded
harness all goodness without getting irritated

I want to remove the suffixes from the original words in text.txt only once for every suffix. Thus I expect the following output-

the rain cloud answer the man all his interroga and with all
question mind the princess respond
har all good without gett irritat

Note that tion was not removed from questioned since the original word didn't contain tion as a suffix. It would be really helpful if someone could answer this with sed commands. I was using a naive script that doesn't seem to do the job-

#!/bin/bash

while read p; do
  sed -i "s/$p / /g" text.txt;
  sed -i "s/$p$//g" text.txt;
done <suffix.txt

CodePudding user response:

An awk:

$ awk '
NR==FNR {                   # generate a regex of suffices
    s=s (s==""?"(":"|") $0  # (ing|ness|es|ed|tion)$
    next
}
FNR==1 {
    s=s ")$"                # well, above )$ is inserted here
}
{
    for(i=1;i<=NF;i  )      # iterate all the words and
        sub(s,"",$i)        # apply regex to each of them
}1' suffix text             # output

Output:

the rain cloud answer the man all his interroga and with all
question mind the princess respond
har all good without gett irritat

CodePudding user response:

Kinda hairy but sed and unix tools only:

sed -E -f <(tr '\n' '|' <suffix.txt | sed 's/\|$//; s/\|/\\\\b|/g; s/$/\\\\b/' | xargs printf 's/%s//g') text.txt

The

tr '\n' '|' <suffix.txt | sed 's/\|$//; s/\|/\\\\b|/g; s/$/\\\\b/' | xargs printf 's/%s//g'

generates the substitution script of

s/ing\b|ness\b|es\b|ed\b|tion\b//g

This requires GNU sed for \b.

It would be easier with perl, ruby, awk, etc

Here is a GNU awk:

gawk -i join 'FNR==NR {arr[FNR]=$1; next}
FNR==1{re=join(arr,1,length(arr),"\\>|"); re=re "\\>"}
{gsub(re,"")}
1
' suffix.txt text.txt

Both produce:

the rain cloud answer the man all his interroga and with all
question mind the princess respond
har all good without gett irritat

CodePudding user response:

This might work for you (GNU sed):

sed -z 'y/\n/|/;s/|$//;s#.*#s/\\B(&)\\b//g#' suffixFile | sed -Ef - textFile

Convert suffixFile into sed commands in a file and pass that via a pipe to a second invocation of sed that amends the textFile.

N.B. The sed command use the \B and \b to match a suffix.

CodePudding user response:

You can try this sed approach.

You will first need to create an array from suffix.txt

suffix=($(cat suffix.txt))

You can then use it for ubstitution within the main sed code.

sed " s/${suffix[0]}//;s/${suffix[1]}//g;/question/! {s/${suffix[2]//};s/${suffix[3]}//g;/question/! {s/${suffix[4]}//}" text.txt

Output

the rain cloud answer the man all his interroga and with all
question mind the princess respond
har all good without gett irritat
  • Related