I have a pandas data frame with a column named "content" that contains text. I want to remove some words from each text within this column. I thought of replacing each string by empty string, but when I print the result of my function I see that the words have not been removed. My code is below:
def replace_words(t):
words = ['Livre', 'Chapitre', 'Titre', 'Chapter', 'Article' ]
for i in t:
if i in words:
t.replace (i, '')
else:
continue
print(t)
st = 'this is Livre and Chapitre and Titre and Chapter and Article'
replace_words(st)
An example of desired result is: 'this is and and and and '
With the code below I want to apply the function above to each text in the column "content":
df['content'].apply(lambda x: replace_words(x))
Can someone help me to create a function that removes all the words I need and then apply this function to all the texts within my df column?
CodePudding user response:
You can use str.replace
.
Input:
df = pd.DataFrame({
'ID' : np.arange(4),
'words' : ['this is Livre and Chapitre and Titre and Chapter and Article',
'this is car and Chapitre and bus and Chapter and Article',
'this is Livre and Chapitre',
'nothing to replace']
})
words = ['Livre', 'Chapitre', 'Titre', 'Chapter', 'Article']
pat = '|'.join(map(re.escape, words))
print(pat)
'Livre|Chapitre|Titre|Chapter|Article'
df['words'] = df['words'].str.replace(pat, '', regex=True)
print(df)
ID words
0 0 this is and and and and
1 1 this is car and and bus and and
2 2 this is and
3 3 nothing to replace
CodePudding user response:
Two problems:
- If you split using
for i in t:
eachi
is a letter, not a word. - t.replace does not work inplace
Use this:
def replace_words(t):
words = ['Livre', 'Chapitre', 'Titre', 'Chapter', 'Article' ]
for i in t.split(' '):
# print(i) # remove to see problem 1
if i in words:
t= t.replace (i, '')
else:
continue
# print(t)
return t
Edit: You can directly call df['col'].apply(replace_words)
.