I have two dataframes x
and y
. DF x
contains two grouping variables S
and A
, and a value variable V
. From this DF I want to delete rows for group pairs (S,A) defined in y
. DF y
contains the variable S
and a variable A_D
which together define which (S,A) pair needs to be deleted from x
.
However, each element in y['A_D']
, (could) contains a string of elements from A
, separated by commas. Each of these separate elements should get deleted form x
(for the particular element in S
). Moreover, if y['A_D']
contains the element all for a particular element in S
it should delete the entire S-group in x
.
I found a solution which does what I want but my question is, is there a simpler or more Pythonic way to solve this problem?
import pandas as pd
import numpy as np
# Define x
x = pd.DataFrame({'S': np.repeat(['s1','s2','s3'], 5),
'A': [j for i in range(3) for j in ['a','b','c','d','e']],
'V': np.random.uniform(size=15) })
# Define y. Which (S,A) pairs should be deleted from x. For 's1' all rows should be deleted.
# For 's2' 'a' and 'd' rows should be deleted and for 's3' the 'c' row should be deleted.
y = pd.DataFrame({'S':['s1','s2','s3'],
'A_D':['all','a, d', 'c']})
# My solution:
# expand y to a new DF z. Comma separated elements in 'A_D' become separate elements. Also strip whitespace.
z = []
for i, r in y.iterrows():
z.append(pd.DataFrame({'S' : r[0],
'A_D': [u.strip() for u in str(r[1]).split(',')]}))
z = pd.concat(z)
# first delete S-groups defined by `all`
x_d = x.merge(z[z['A_D']=='all'],how='left')
x_d = x_d[x_d['A_D']!='all'].drop(columns= 'A_D')
# then drop (S,A) pairs.
x_d = x_d.merge(z[z['A_D']!='all'],how='left', left_on = ['S','A'], right_on = ['S', 'A_D'])
x_d = x_d[pd.isna(x_d['A_D'])].drop(columns= 'A_D').reset_index(drop=True)
# The required result:
print(x_d)
For clarity the objects look like this:
x
Out[1]:
S A V
0 s1 a 0.758516
1 s1 b 0.522200
2 s1 c 0.190511
3 s1 d 0.544617
4 s1 e 0.480378
5 s2 a 0.191016
6 s2 b 0.714625
7 s2 c 0.852788
8 s2 d 0.142410
9 s2 e 0.909382
10 s3 a 0.895031
11 s3 b 0.153444
12 s3 c 0.751675
13 s3 d 0.227501
14 s3 e 0.586467
y
Out[2]:
S A_D
0 s1 all
1 s2 a, d
2 s3 c
z
Out[3]:
S A_D
0 s1 all
0 s2 a
1 s2 d
0 s3 c
x_d
Out[4]:
S A V
0 s2 b 0.714625
1 s2 c 0.852788
2 s2 e 0.909382
3 s3 a 0.895031
4 s3 b 0.153444
5 s3 d 0.227501
6 s3 e 0.586467
CodePudding user response:
x
###
S A V
0 s1 a 0.490194
1 s1 b 0.875381
2 s1 c 0.384808
3 s1 d 0.063960
4 s1 e 0.003159
5 s2 a 0.188624
6 s2 b 0.400527
7 s2 c 0.137458
8 s2 d 0.162291
9 s2 e 0.337899
10 s3 a 0.101296
11 s3 b 0.464031
12 s3 c 0.407629
13 s3 d 0.222498
14 s3 e 0.802472
whatever separated with ', '
, ' ,'
, ','
y
###
S A_D
0 s1 all
1 s2 a, d ,c
2 s3 c,d
y['A_D'] = y['A_D'].replace('all', ', '.join(x['A'].unique()))
y = y.assign(A_D=y['A_D'].str.split(',')).explode('A_D')
y['A_D'] = y['A_D'].str.strip()
output = x[~x.set_index(['S','A']).index.isin(y.set_index(['S','A_D']).index)].reset_index(drop=True)
output
###
S A V
0 s2 b 0.400527
1 s2 e 0.337899
2 s3 a 0.101296
3 s3 b 0.464031
4 s3 e 0.802472
CodePudding user response:
Here's my solution, it's at least shorter :)
def filter_group(group, filter_rule):
return (None if filter_rule == 'all'
else group[~group["A"].isin(filter_rule.replace(' ', '').split(','))])
x_d = pd.concat(filter_group(x.groupby('S').get_group(grp), filter_rule)
for grp, filter_rule in dict(zip(y["S"], y["A_D"])).items())