Remove tuple based on character count-CodePudding

I have a dataset consisting of tuple of words. I want to remove words that contain less than 4 characters, but I could not figure out a way to iterate my codes.

Here is a sample of my data:

                   content                clean4Char
0         [yes, no, never]                   [never]
1    [to, every, contacts]         [every, contacts]
2 [words, tried, describe]  [words, tried, describe]
3          [word, you, go]                    [word]

Here is the code that I'm working with (it keeps showing me error warning).

def remove_single_char(text):
    text = [word for word in text]
    return re.sub(r"\b\w{1,3}\b"," ", word)

df['clean4Char'] = df['content'].apply(lambda x: remove_single_char(x))
df.head(3)

CodePudding user response：

the problem is with your remove_single_char function. This will do the job:

Also there is no need to use lambda since you already are passing a function to applay

def remove(input):
    return list(filter(lambda x: len(x) > 4, input))

df['clean4Char'] = df['content'].apply(remove)
df.head(3)

CodePudding user response：

We can use str.replace here for a Pandas option:

df["clean4Char"] = df["content"].str.replace(r'\b\w{1,3}\b,?\s*', '', regex=True)

The regex used here says to match:

\b a word boundary (only match entire words)
\w{1,3} a word with no more than 3 characters
\b closing word boundary
,? optional comma
\s* optional whitespace

We then replace with empty string to effectively remove the 3 letter or less matching words along with optional trailing whitespace and comma.

Here is a regex demo showing that the replacement logic is working.