Home > Back-end >  Python Pandas - change column within group if condition is met
Python Pandas - change column within group if condition is met

Time:05-04

I have a dataframe df with column 'Batch' and 'Status'. 'Status' has the possible values 'keep' or 'remove'. I want to make a new column 'Status updated', which is 'keep' only if all elements for the same Batch number in 'Status' are 'keep' and 'remove' otherwise.

    Batch  Status
0       1    keep
1       1    keep
2       1  remove
3       2    keep
4       2    keep
5       2    keep
6       3  remove
7       3  remove
8       3    keep
9       4  remove
10      4  remove
11      4  remove

Desired output looks like:

Batch  Status Status updated
0       1    keep         remove
1       1    keep         remove
2       1  remove         remove
3       2    keep           keep
4       2    keep           keep
5       2    keep           keep
6       3  remove         remove
7       3  remove         remove
8       3    keep         remove
9       4  remove         remove
10      4  remove         remove
11      4  remove         remove

CodePudding user response:

You can simply use:

df['updated'] = df.groupby('Batch')['Status'].transform('max')

Take advantage of the fact that "remove" is lexicographically sorted after "keep". If this was the other way around you would use min instead

output:

    Batch  Status updated
0       1    keep  remove
1       1    keep  remove
2       1  remove  remove
3       2    keep    keep
4       2    keep    keep
5       2    keep    keep
6       3  remove  remove
7       3  remove  remove
8       3    keep  remove
9       4  remove  remove
10      4  remove  remove
11      4  remove  remove

CodePudding user response:

Solution with test if remove exist in values, get Batch values and recreate new column in numpy.where:

mask = df['Batch'].isin(df.loc[df['Status'].eq('remove'), 'Batch'])
df['Status updated'] = np.where(mask, 'remove','keep')

print (df)
    Batch  Status Status updated
0       1    keep         remove
1       1    keep         remove
2       1  remove         remove
3       2    keep           keep
4       2    keep           keep
5       2    keep           keep
6       3  remove         remove
7       3  remove         remove
8       3    keep         remove
9       4  remove         remove
10      4  remove         remove
11      4  remove         remove

Performance: I think bottleneck is using groupby and maybe max for non numeric column.

np.random.seed(2022)

N = 1000
df = pd.DataFrame({'Batch':np.random.randint(N // 10, size=N),
                   'Status':np.random.choice(['remove','keep'], size=N)})
df = df.sort_values('Batch', ignore_index=True)
print (df)


In [91]: %timeit df['updated'] = df.groupby('Batch')['Status'].transform('max')
11.5 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [92]: %timeit df['Status updated'] = np.where(df['Batch'].isin(df.loc[df['Status'].eq('remove'), 'Batch']),  'remove','keep')
798 µs ± 7.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

np.random.seed(2022)

N = 100000
df = pd.DataFrame({'Batch':np.random.randint(N // 100, size=N),
                   'Status':np.random.choice(['remove','keep'], size=N)})
df = df.sort_values('Batch', ignore_index=True)
print (df)



In [95]: %timeit df['updated'] = df.groupby('Batch')['Status'].transform('max')
127 ms ± 546 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [96]: %timeit df['Status updated'] = np.where(df['Batch'].isin(df.loc[df['Status'].eq('remove'), 'Batch']),  'remove','keep')
27.4 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • Related