I have a dataset consisting of tuple of words. I want to remove words that contain less than 4 characters, but I could not figure out a way to iterate my codes.
Here is a sample of my data:
content clean4Char
0 [yes, no, never] [never]
1 [to, every, contacts] [every, contacts]
2 [words, tried, describe] [words, tried, describe]
3 [word, you, go] [word]
Here is the code that I'm working with (it keeps showing me error warning).
def remove_single_char(text):
text = [word for word in text]
return re.sub(r"\b\w{1,3}\b"," ", word)
df['clean4Char'] = df['content'].apply(lambda x: remove_single_char(x))
df.head(3)
CodePudding user response:
the problem is with your remove_single_char
function. This will do the job:
Also there is no need to use lambda since you already are passing a function to applay
def remove(input):
return list(filter(lambda x: len(x) > 4, input))
df['clean4Char'] = df['content'].apply(remove)
df.head(3)
CodePudding user response:
We can use str.replace
here for a Pandas option:
df["clean4Char"] = df["content"].str.replace(r'\b\w{1,3}\b,?\s*', '', regex=True)
The regex used here says to match:
\b
a word boundary (only match entire words)\w{1,3}
a word with no more than 3 characters\b
closing word boundary,?
optional comma\s*
optional whitespace
We then replace with empty string to effectively remove the 3 letter or less matching words along with optional trailing whitespace and comma.
Here is a regex demo showing that the replacement logic is working.