I have a dataframe df
with column 'Batch' and 'Status'. 'Status' has the possible values 'keep' or 'remove'. I want to make a new column 'Status updated', which is 'keep' only if all elements for the same Batch number in 'Status' are 'keep' and 'remove' otherwise.
Batch Status
0 1 keep
1 1 keep
2 1 remove
3 2 keep
4 2 keep
5 2 keep
6 3 remove
7 3 remove
8 3 keep
9 4 remove
10 4 remove
11 4 remove
Desired output looks like:
Batch Status Status updated
0 1 keep remove
1 1 keep remove
2 1 remove remove
3 2 keep keep
4 2 keep keep
5 2 keep keep
6 3 remove remove
7 3 remove remove
8 3 keep remove
9 4 remove remove
10 4 remove remove
11 4 remove remove
CodePudding user response:
You can simply use:
df['updated'] = df.groupby('Batch')['Status'].transform('max')
Take advantage of the fact that "remove" is lexicographically sorted after "keep". If this was the other way around you would use min
instead
output:
Batch Status updated
0 1 keep remove
1 1 keep remove
2 1 remove remove
3 2 keep keep
4 2 keep keep
5 2 keep keep
6 3 remove remove
7 3 remove remove
8 3 keep remove
9 4 remove remove
10 4 remove remove
11 4 remove remove
CodePudding user response:
Solution with test if remove
exist in values, get Batch
values and recreate new column in numpy.where
:
mask = df['Batch'].isin(df.loc[df['Status'].eq('remove'), 'Batch'])
df['Status updated'] = np.where(mask, 'remove','keep')
print (df)
Batch Status Status updated
0 1 keep remove
1 1 keep remove
2 1 remove remove
3 2 keep keep
4 2 keep keep
5 2 keep keep
6 3 remove remove
7 3 remove remove
8 3 keep remove
9 4 remove remove
10 4 remove remove
11 4 remove remove
Performance: I think bottleneck is using groupby
and maybe max
for non numeric column.
np.random.seed(2022)
N = 1000
df = pd.DataFrame({'Batch':np.random.randint(N // 10, size=N),
'Status':np.random.choice(['remove','keep'], size=N)})
df = df.sort_values('Batch', ignore_index=True)
print (df)
In [91]: %timeit df['updated'] = df.groupby('Batch')['Status'].transform('max')
11.5 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [92]: %timeit df['Status updated'] = np.where(df['Batch'].isin(df.loc[df['Status'].eq('remove'), 'Batch']), 'remove','keep')
798 µs ± 7.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.random.seed(2022)
N = 100000
df = pd.DataFrame({'Batch':np.random.randint(N // 100, size=N),
'Status':np.random.choice(['remove','keep'], size=N)})
df = df.sort_values('Batch', ignore_index=True)
print (df)
In [95]: %timeit df['updated'] = df.groupby('Batch')['Status'].transform('max')
127 ms ± 546 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [96]: %timeit df['Status updated'] = np.where(df['Batch'].isin(df.loc[df['Status'].eq('remove'), 'Batch']), 'remove','keep')
27.4 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)