Replace column value in pandas for all possible combinations of conditions-CodePudding

I have a dataframe (df) that I want to manipulate as follows

date,string
2002-01-01,ABAA
2002-01-01,AAAA
2002-01-01,CCCC
2002-01-01,BBAA
2002-01-01,AAAA
2002-01-02,BBBB
2002-01-02,BABA
2002-01-02,ABBB
2002-01-02,DCDC
2002-01-02,AABB

Check for all cases where the string contains 'D' and 'C' in whatever order and change to 'DDDD'
Also check for all occurrences where the string contains 'A' and 'B' and change to 'AAAA'

As a simple example, the below code will produce the expected output below.

import pandas as pd 
import numpy as np 
from datetime import datetime

df = pd.read_csv('df.csv')

df["string"].mask(df["string"] == 'DCDC', 'DDDD', inplace=True)
df["string"].mask(df["string"] == 'ABAA', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'BBAA', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'BABA', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'ABBB', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'AABB', 'AAAA', inplace=True)
print(df)

Expected Output:

date,string
2002-01-01,AAAA
2002-01-01,AAAA
2002-01-01,CCCC
2002-01-01,AAAA
2002-01-01,AAAA
2002-01-02,BBBB
2002-01-02,AAAA
2002-01-02,AAAA
2002-01-02,DDDD
2002-01-02,AAAA

However, the above is too hard coded. I am thinking, I will first something like below to first extract all instances that need to be replaced:

letters_dc = ['D','C']
letters_ab = ['A','B']
contains_dc = [df['symbol'].str.contains(i) for i in letters_dc]
contains_ab = [df['symbol'].str.contains(i) for i in letters_ab]
resul = df[np.all(contains_dc, axis=0) | np.all(contains_ab, axis=0)]

How can I proceed from here or is there a better way to approach this.

CodePudding user response：

You could use this:

ab = df['string'].str.match(r'^[AB] $')
cd = df['string'].str.match(r'^[CD] $')

newdf = df.assign(string=df['string'].where(~ab, 'AAAA').where(~cd, 'DDDD'))

>>> newdf
         date string
0  2002-01-01   AAAA
1  2002-01-01   AAAA
2  2002-01-01   DDDD
3  2002-01-01   AAAA
4  2002-01-01   AAAA
5  2002-01-02   AAAA
6  2002-01-02   AAAA
7  2002-01-02   AAAA
8  2002-01-02   DDDD
9  2002-01-02   AAAA

Any string that matches (fully) any combination of 'C' and 'D' is replaced by 'DDDD'. Similarly, any 'AB' combination becomes 'AAAA'. All other values are left unmodified.

CodePudding user response：

You can use numpy.logical_and.reduce:

import numpy as np

letters = [['D','C'], ['A','B']]
for l in letters:
    df['string'] = (df['string']
                    .mask(np.logical_and.reduce(
                          [df['string'].str.contains(x)
                           for x in l]), l[0]*4)
                   )

output:

         date string
0  2002-01-01   AAAA
1  2002-01-01   AAAA
2  2002-01-01   CCCC
3  2002-01-01   AAAA
4  2002-01-01   AAAA
5  2002-01-02   BBBB
6  2002-01-02   AAAA
7  2002-01-02   AAAA
8  2002-01-02   DDDD
9  2002-01-02   AAAA