drop rows from a pandas dataframe based on list of points in another dataframe-CodePudding

I have two dataframes:

data = pd.DataFrame({"A": np.repeat(np.arange(1.,11.),50), 
                    "B": np.tile(np.repeat(np.arange(0.,5.),10),10), 
                    "C":np.arange(500)})
bad_data = pd.DataFrame({"A": [1., 2., 7., 9.], 
                           "B": [0., 3., 0., 2.], 
                           "points": [[0, 1],[0],[1],[0,1]]})
data.head(15)
bad_data
>>> data.head(15)
      A    B   C
0   1.0  0.0   0
1   1.0  0.0   1
2   1.0  0.0   2
3   1.0  0.0   3
4   1.0  0.0   4
5   1.0  0.0   5
6   1.0  0.0   6
7   1.0  0.0   7
8   1.0  0.0   8
9   1.0  0.0   9
10  1.0  1.0  10
11  1.0  1.0  11
12  1.0  1.0  12
13  1.0  1.0  13
14  1.0  1.0  14
>>> bad_data
     A    B  points
0  1.0  0.0  [0, 1]
1  2.0  3.0     [0]
2  7.0  0.0     [1]
3  9.0  2.0  [0, 1]

For each row of data, I want to drop all rows in bad_data with the same A and B, and indexed by the values of points. For example, the first row of bad_data tells me that I need to drop the first two rows of data:

      A    B   C
0   1.0  0.0   0
1   1.0  0.0   1

How can I do that? I was able to cook up this horror, but it's quite ugly to read. Can you help me write a more Pythonic/readable solution?

rows_to_remove = []
for A, B in zip(bad_data['A'], bad_data['B']):
    rows_in_data = (data['A'] == A) & (data['B'] == B)
    rows_in_bad_data = (bad_data['A'] == A) & (bad_data['B'] == B)
    bad_points = bad_data.loc[rows_in_bad_data, 'points'].values[0]
    indices = data[rows_in_data].index.values[bad_points]
    rows_to_remove.extend(indices)
    print(rows_to_remove)
data.drop(data.index[rows_to_remove], inplace=True)

CodePudding user response：

IIUC, you could perform a reverse merge on the exploded bad_data:

(data
.assign(points=df.groupby(['A', 'B']).cumcount())  # get index per group (=points)
.merge(bad_data.explode('points'), on=['A', 'B', 'points'], # outer merge
       indicator=True, how='outer')
.loc[lambda d: d['_merge'].eq('left_only')]  # keep the rows unique to the left
.drop(columns=['points', '_merge'])          # remove helper columns
)

Another option is to use GroupBy.apply:

# craft a Series of list of points indexed by A/B
s = bad_data.set_index(['A', 'B'])['points']
    # group by A/B
(df.groupby(['A', 'B'], as_index=False, group_keys=False)
    # get the real index names from "index" and drop if the key is present in s
    # else leave the group unchanged
   .apply(lambda g: g.drop(g.index[s.loc[g.name]]) if g.name in s else g)
)

This yields the same dataframe as your custom code.

CodePudding user response：

Not 100% if I understood this correctly or my attempt is the most elegant way, so let me know if this works for you:

bad_indexes = []
labels = ['A', 'B']

for _, s in bad_data.iterrows():
    p = data.loc[s['points']]
    p = p[p[labels].eq(s[labels]).all(1)]
    bad_indexes.extend(p.index)

result = data.loc[data.index.difference(bad_indexes)]

I assumed that the index of data has unique values.