I have this dataframe :
dfA = pd.DataFrame({
'A': ['abc','ghi','mno', 'stu'],
'B': ['abcdef', 'jklghi', 'mnopqr', 'vwxstu']
})
dfA
And I want to get this dataframe :
dfB = pd.DataFrame({
'A': ['abc','ghi','mno', 'stu'],
'B': ['abcdef', 'jklghi', 'mnopqr', 'vwxstu'],
'C': ['def', 'jkl', 'pqr', 'vwx'],
})
dfB
The column 'C' must contains the substrings of the column 'B' that is not in the strings in column 'A'.
I tried to copy column 'B' into 'C' and then use df.replace()
as shown below, but it doesn't work :
dfA = pd.DataFrame({
'A': ['abc','ghi','mno', 'stu'],
'B': ['abcdef', 'jklghi', 'mnopqr', 'vwxstu']
})
dfA.loc[:,'C'] = dfA['B']
dfA['C'].replace(dfA['B'], '', regex=True)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_16112\1271772080.py in <cell line: 7>()
5 dfA.loc[:,'C'] = dfA['B']
6
----> 7 dfA['C'].replace(dfA['B'], '', regex=True)
~\Anaconda3\envs\py310\lib\site-packages\pandas\core\series.py in replace(self, to_replace, value, inplace, limit, regex, method)
4958 method: str | lib.NoDefault = lib.no_default,
4959 ):
-> 4960 return super().replace(
4961 to_replace=to_replace,
4962 value=value,
~\Anaconda3\envs\py310\lib\site-packages\pandas\core\generic.py in replace(self, to_replace, value, inplace, limit, regex, method)
6677 # Operate column-wise
6678 if self.ndim == 1:
-> 6679 raise ValueError(
6680 "Series.replace cannot use dict-like to_replace "
6681 "and non-None value"
ValueError: Series.replace cannot use dict-like to_replace and non-None value
Also, the strings in 'A' must be the pre/sufix of the 'B' column, so the 'C' column will be the su/prefix of 'B' strings. So, 'B' = 'A' 'C' | 'C' 'A'
, I also tried to use -
as a "decatenation" operator, but it doesn't work.
Do you have any idea of how I should do this instead ?
CodePudding user response:
You need to loop here.
You can use re.sub
:
import re
dfA['C'] = [re.sub(a, '', b) for a,b in zip(dfA['A'], dfA['B'])]
or str.replace
:
dfA['C'] = [b.replace(a, '') for a,b in zip(dfA['A'], dfA['B'])]
output:
A B C
0 abc abcdef def
1 ghi jklghi jkl
2 mno mnopqr pqr
3 stu vwxstu vwx