replacing strings within a pandas data frame column-CodePudding

I have a pandas data frame with a column named "content" that contains text. I want to remove some words from each text within this column. I thought of replacing each string by empty string, but when I print the result of my function I see that the words have not been removed. My code is below:

def replace_words(t):
  words = ['Livre', 'Chapitre', 'Titre', 'Chapter', 'Article' ]
  for i in t:
    if i in words:
      t.replace (i, '')
    else:
      continue
  print(t)


st = 'this is Livre and Chapitre and Titre and Chapter and Article'

replace_words(st)

An example of desired result is: 'this is and and and and '

With the code below I want to apply the function above to each text in the column "content":

df['content'].apply(lambda x: replace_words(x))

Can someone help me to create a function that removes all the words I need and then apply this function to all the texts within my df column?

CodePudding user response：

You can use str.replace. Input:

df = pd.DataFrame({
    'ID' : np.arange(4),
    'words' : ['this is Livre and Chapitre and Titre and Chapter and Article', 
               'this is car and Chapitre and bus and Chapter and Article',
              'this is Livre and Chapitre',
              'nothing to replace']
})

words = ['Livre', 'Chapitre', 'Titre', 'Chapter', 'Article']
pat = '|'.join(map(re.escape, words))
print(pat)
'Livre|Chapitre|Titre|Chapter|Article'

df['words'] = df['words'].str.replace(pat, '', regex=True)
print(df)

   ID                               words
0   0        this is  and  and  and  and 
1   1  this is car and  and bus and  and 
2   2                       this is  and 
3   3                  nothing to replace

CodePudding user response：

Two problems:

If you split using for i in t: each i is a letter, not a word.
t.replace does not work inplace

Use this:

def replace_words(t):
    words = ['Livre', 'Chapitre', 'Titre', 'Chapter', 'Article' ]
    for i in t.split(' '):
        # print(i) # remove to see problem 1
        if i in words:
            t= t.replace (i, '')
        else:
            continue
    # print(t)
    return t

Edit: You can directly call df['col'].apply(replace_words).