Home > database >  Pandas - Remove part of string in column that is already in another column
Pandas - Remove part of string in column that is already in another column

Time:05-19

I have this dataframe :

dfA = pd.DataFrame({
            'A': ['abc','ghi','mno', 'stu'],
            'B': ['abcdef', 'jklghi', 'mnopqr', 'vwxstu']
         })
dfA

And I want to get this dataframe :

dfB = pd.DataFrame({
            'A': ['abc','ghi','mno', 'stu'],
            'B': ['abcdef', 'jklghi', 'mnopqr', 'vwxstu'],
            'C': ['def', 'jkl', 'pqr', 'vwx'],
         })
dfB

The column 'C' must contains the substrings of the column 'B' that is not in the strings in column 'A'.

I tried to copy column 'B' into 'C' and then use df.replace() as shown below, but it doesn't work :

dfA = pd.DataFrame({
            'A': ['abc','ghi','mno', 'stu'],
            'B': ['abcdef', 'jklghi', 'mnopqr', 'vwxstu']
         })
dfA.loc[:,'C'] = dfA['B']

dfA['C'].replace(dfA['B'], '', regex=True)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_16112\1271772080.py in <cell line: 7>()
      5 dfA.loc[:,'C'] = dfA['B']
      6 
----> 7 dfA['C'].replace(dfA['B'], '', regex=True)

~\Anaconda3\envs\py310\lib\site-packages\pandas\core\series.py in replace(self, to_replace, value, inplace, limit, regex, method)
   4958         method: str | lib.NoDefault = lib.no_default,
   4959     ):
-> 4960         return super().replace(
   4961             to_replace=to_replace,
   4962             value=value,

~\Anaconda3\envs\py310\lib\site-packages\pandas\core\generic.py in replace(self, to_replace, value, inplace, limit, regex, method)
   6677                     # Operate column-wise
   6678                     if self.ndim == 1:
-> 6679                         raise ValueError(
   6680                             "Series.replace cannot use dict-like to_replace "
   6681                             "and non-None value"

ValueError: Series.replace cannot use dict-like to_replace and non-None value

Also, the strings in 'A' must be the pre/sufix of the 'B' column, so the 'C' column will be the su/prefix of 'B' strings. So, 'B' = 'A' 'C' | 'C' 'A', I also tried to use - as a "decatenation" operator, but it doesn't work.

Do you have any idea of how I should do this instead ?

CodePudding user response:

You need to loop here.

You can use re.sub:

import re
dfA['C'] = [re.sub(a, '', b) for a,b in zip(dfA['A'], dfA['B'])]

or str.replace:

dfA['C'] = [b.replace(a, '') for a,b in zip(dfA['A'], dfA['B'])]

output:

     A       B    C
0  abc  abcdef  def
1  ghi  jklghi  jkl
2  mno  mnopqr  pqr
3  stu  vwxstu  vwx
  • Related