I have a dataframe (df) that I want to manipulate as follows
date,string
2002-01-01,ABAA
2002-01-01,AAAA
2002-01-01,CCCC
2002-01-01,BBAA
2002-01-01,AAAA
2002-01-02,BBBB
2002-01-02,BABA
2002-01-02,ABBB
2002-01-02,DCDC
2002-01-02,AABB
- Check for all cases where the string contains 'D' and 'C' in whatever order and change to 'DDDD'
- Also check for all occurrences where the string contains 'A' and 'B' and change to 'AAAA'
As a simple example, the below code will produce the expected output below.
import pandas as pd
import numpy as np
from datetime import datetime
df = pd.read_csv('df.csv')
df["string"].mask(df["string"] == 'DCDC', 'DDDD', inplace=True)
df["string"].mask(df["string"] == 'ABAA', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'BBAA', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'BABA', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'ABBB', 'AAAA', inplace=True)
df["string"].mask(df["string"] == 'AABB', 'AAAA', inplace=True)
print(df)
Expected Output:
date,string
2002-01-01,AAAA
2002-01-01,AAAA
2002-01-01,CCCC
2002-01-01,AAAA
2002-01-01,AAAA
2002-01-02,BBBB
2002-01-02,AAAA
2002-01-02,AAAA
2002-01-02,DDDD
2002-01-02,AAAA
However, the above is too hard coded. I am thinking, I will first something like below to first extract all instances that need to be replaced:
letters_dc = ['D','C']
letters_ab = ['A','B']
contains_dc = [df['symbol'].str.contains(i) for i in letters_dc]
contains_ab = [df['symbol'].str.contains(i) for i in letters_ab]
resul = df[np.all(contains_dc, axis=0) | np.all(contains_ab, axis=0)]
How can I proceed from here or is there a better way to approach this.
CodePudding user response:
You could use this:
ab = df['string'].str.match(r'^[AB] $')
cd = df['string'].str.match(r'^[CD] $')
newdf = df.assign(string=df['string'].where(~ab, 'AAAA').where(~cd, 'DDDD'))
>>> newdf
date string
0 2002-01-01 AAAA
1 2002-01-01 AAAA
2 2002-01-01 DDDD
3 2002-01-01 AAAA
4 2002-01-01 AAAA
5 2002-01-02 AAAA
6 2002-01-02 AAAA
7 2002-01-02 AAAA
8 2002-01-02 DDDD
9 2002-01-02 AAAA
Any string
that matches (fully) any combination of 'C' and 'D' is replaced by 'DDDD'
. Similarly, any 'AB'
combination becomes 'AAAA'
. All other values are left unmodified.
CodePudding user response:
You can use numpy.logical_and.reduce
:
import numpy as np
letters = [['D','C'], ['A','B']]
for l in letters:
df['string'] = (df['string']
.mask(np.logical_and.reduce(
[df['string'].str.contains(x)
for x in l]), l[0]*4)
)
output:
date string
0 2002-01-01 AAAA
1 2002-01-01 AAAA
2 2002-01-01 CCCC
3 2002-01-01 AAAA
4 2002-01-01 AAAA
5 2002-01-02 BBBB
6 2002-01-02 AAAA
7 2002-01-02 AAAA
8 2002-01-02 DDDD
9 2002-01-02 AAAA