I have two dataframes:
data = pd.DataFrame({"A": np.repeat(np.arange(1.,11.),50),
"B": np.tile(np.repeat(np.arange(0.,5.),10),10),
"C":np.arange(500)})
bad_data = pd.DataFrame({"A": [1., 2., 7., 9.],
"B": [0., 3., 0., 2.],
"points": [[0, 1],[0],[1],[0,1]]})
data.head(15)
bad_data
>>> data.head(15)
A B C
0 1.0 0.0 0
1 1.0 0.0 1
2 1.0 0.0 2
3 1.0 0.0 3
4 1.0 0.0 4
5 1.0 0.0 5
6 1.0 0.0 6
7 1.0 0.0 7
8 1.0 0.0 8
9 1.0 0.0 9
10 1.0 1.0 10
11 1.0 1.0 11
12 1.0 1.0 12
13 1.0 1.0 13
14 1.0 1.0 14
>>> bad_data
A B points
0 1.0 0.0 [0, 1]
1 2.0 3.0 [0]
2 7.0 0.0 [1]
3 9.0 2.0 [0, 1]
For each row of data
, I want to drop all rows in bad_data
with the same A
and B
, and indexed by the values of points
. For example, the first row of bad_data
tells me that I need to drop the first two rows of data
:
A B C
0 1.0 0.0 0
1 1.0 0.0 1
How can I do that? I was able to cook up this horror, but it's quite ugly to read. Can you help me write a more Pythonic/readable solution?
rows_to_remove = []
for A, B in zip(bad_data['A'], bad_data['B']):
rows_in_data = (data['A'] == A) & (data['B'] == B)
rows_in_bad_data = (bad_data['A'] == A) & (bad_data['B'] == B)
bad_points = bad_data.loc[rows_in_bad_data, 'points'].values[0]
indices = data[rows_in_data].index.values[bad_points]
rows_to_remove.extend(indices)
print(rows_to_remove)
data.drop(data.index[rows_to_remove], inplace=True)
CodePudding user response:
IIUC, you could perform a reverse merge on the exploded bad_data:
(data
.assign(points=df.groupby(['A', 'B']).cumcount()) # get index per group (=points)
.merge(bad_data.explode('points'), on=['A', 'B', 'points'], # outer merge
indicator=True, how='outer')
.loc[lambda d: d['_merge'].eq('left_only')] # keep the rows unique to the left
.drop(columns=['points', '_merge']) # remove helper columns
)
Another option is to use GroupBy.apply
:
# craft a Series of list of points indexed by A/B
s = bad_data.set_index(['A', 'B'])['points']
# group by A/B
(df.groupby(['A', 'B'], as_index=False, group_keys=False)
# get the real index names from "index" and drop if the key is present in s
# else leave the group unchanged
.apply(lambda g: g.drop(g.index[s.loc[g.name]]) if g.name in s else g)
)
This yields the same dataframe as your custom code.
CodePudding user response:
Not 100% if I understood this correctly or my attempt is the most elegant way, so let me know if this works for you:
bad_indexes = []
labels = ['A', 'B']
for _, s in bad_data.iterrows():
p = data.loc[s['points']]
p = p[p[labels].eq(s[labels]).all(1)]
bad_indexes.extend(p.index)
result = data.loc[data.index.difference(bad_indexes)]
I assumed that the index of data
has unique values.