Delete rows from pd.DataFrame based on a defenition given in a second DF-CodePudding

I have two dataframes x and y. DF x contains two grouping variables S and A, and a value variable V. From this DF I want to delete rows for group pairs (S,A) defined in y. DF y contains the variable S and a variable A_D which together define which (S,A) pair needs to be deleted from x.

However, each element in y['A_D'], (could) contains a string of elements from A, separated by commas. Each of these separate elements should get deleted form x (for the particular element in S). Moreover, if y['A_D'] contains the element all for a particular element in S it should delete the entire S-group in x.

I found a solution which does what I want but my question is, is there a simpler or more Pythonic way to solve this problem?

import pandas as pd
import numpy as np

# Define x
x = pd.DataFrame({'S': np.repeat(['s1','s2','s3'], 5),
                  'A': [j for i in range(3) for j in ['a','b','c','d','e']],
                  'V': np.random.uniform(size=15) })

# Define y. Which (S,A) pairs should be deleted from x. For 's1' all rows should be deleted.
# For 's2' 'a' and 'd' rows should be deleted and for 's3' the 'c' row should be deleted.
y = pd.DataFrame({'S':['s1','s2','s3'],
                  'A_D':['all','a, d', 'c']})

# My solution:
# expand y to a new DF z. Comma separated elements in 'A_D' become separate elements. Also strip whitespace.
z = []
for i, r in y.iterrows():
    z.append(pd.DataFrame({'S'  : r[0],
                           'A_D': [u.strip() for u in str(r[1]).split(',')]}))
z = pd.concat(z)

# first delete S-groups defined by `all`
x_d = x.merge(z[z['A_D']=='all'],how='left')
x_d = x_d[x_d['A_D']!='all'].drop(columns= 'A_D')

# then drop (S,A) pairs.
x_d = x_d.merge(z[z['A_D']!='all'],how='left', left_on = ['S','A'], right_on = ['S', 'A_D'])
x_d = x_d[pd.isna(x_d['A_D'])].drop(columns= 'A_D').reset_index(drop=True)

# The required result:
print(x_d)

For clarity the objects look like this:

x
Out[1]: 
     S  A         V
0   s1  a  0.758516
1   s1  b  0.522200
2   s1  c  0.190511
3   s1  d  0.544617
4   s1  e  0.480378
5   s2  a  0.191016
6   s2  b  0.714625
7   s2  c  0.852788
8   s2  d  0.142410
9   s2  e  0.909382
10  s3  a  0.895031
11  s3  b  0.153444
12  s3  c  0.751675
13  s3  d  0.227501
14  s3  e  0.586467

y
Out[2]: 
    S   A_D
0  s1   all
1  s2  a, d
2  s3     c

z
Out[3]: 
    S  A_D
0  s1  all
0  s2    a
1  s2    d
0  s3    c

x_d
Out[4]: 
    S  A         V
0  s2  b  0.714625
1  s2  c  0.852788
2  s2  e  0.909382
3  s3  a  0.895031
4  s3  b  0.153444
5  s3  d  0.227501
6  s3  e  0.586467

CodePudding user response：

x
###
     S  A         V
0   s1  a  0.490194
1   s1  b  0.875381
2   s1  c  0.384808
3   s1  d  0.063960
4   s1  e  0.003159
5   s2  a  0.188624
6   s2  b  0.400527
7   s2  c  0.137458
8   s2  d  0.162291
9   s2  e  0.337899
10  s3  a  0.101296
11  s3  b  0.464031
12  s3  c  0.407629
13  s3  d  0.222498
14  s3  e  0.802472

whatever separated with ', ', ' ,', ','

y
###
    S      A_D
0  s1      all
1  s2  a, d ,c
2  s3      c,d

y['A_D'] = y['A_D'].replace('all', ', '.join(x['A'].unique()))
y = y.assign(A_D=y['A_D'].str.split(',')).explode('A_D')
y['A_D'] = y['A_D'].str.strip()
output = x[~x.set_index(['S','A']).index.isin(y.set_index(['S','A_D']).index)].reset_index(drop=True)
output
###
    S  A         V
0  s2  b  0.400527
1  s2  e  0.337899
2  s3  a  0.101296
3  s3  b  0.464031
4  s3  e  0.802472

CodePudding user response：

Here's my solution, it's at least shorter :)

def filter_group(group, filter_rule):
    return (None if filter_rule == 'all'
            else group[~group["A"].isin(filter_rule.replace(' ', '').split(','))])

x_d = pd.concat(filter_group(x.groupby('S').get_group(grp), filter_rule) 
                for grp, filter_rule in dict(zip(y["S"], y["A_D"])).items())