Home > OS >  How to label groups conditionally?
How to label groups conditionally?

Time:11-18

I'm new to pandas and would like to know how to do the following: Given specific conditions, I would like to mark the whole group with a specific label rather than just the rows that meet the conditions. For example, if I have a DataFrame like this:

import numpy as np
import pandas as pd

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6, 7, 8],
                        "process": ["pending", "finished", "finished", "finished", "finished", "finished", "finished", "pending"],
                        "working_group": ["a", "a", "c", "d", "d", "f", "g", "g"],
                        "size": [2, 2, 1, 2, 2, 1, 2, 2]})

conditions = [(df['size'] >= 2) & (df['process'].isin(["pending"]))]

choices = ["not_done"]

df['state'] = df['state'] = np.select(conditions, choices, default = "something_else")

df:

   id   process working_group   size     state
0   1   pending             a     2  not_done
1   2  finished             a     2  something_else
2   3  finished             c     1  something_else
3   4  finished             d     2  something_else
4   5  finished             d     2  something_else
5   6  finished             f     1  something_else
6   7  finished             g     2  something_else
7   8   pending             g     2  not_done

However I would like the whole working_group marked as not_done when a individual task is pending, so for instance a & g should be marked as not_done.

   id   process working_group  size     state
0   1   pending             a     2  not_done
1   2  finished             a     2  not_done
2   3  finished             c     1  something_else
3   4  finished             d     2  something_else
4   5  finished             d     2  something_else
5   6  finished             f     1  something_else
6   7  finished             g     2  not_done
7   8   pending             g     2  not_done

CodePudding user response:

You can use:

condition = df['size'].ge(2) & df['process'].isin(["pending"])

df['state'] = np.where(condition.groupby(df['working_group']).transform('any'), 'not_done', 'something_else')

Or:

condition = df['size'].ge(2) & df['process'].isin(["pending"])

df['state'] = np.where(df['working_group'].isin(df.loc[condition, 'working_group']), 'not_done', 'something_else')

Output:

   id   process working_group  size           state
0   1   pending             a     2        not_done
1   2  finished             a     2        not_done
2   3  finished             c     1  something_else
3   4  finished             d     2  something_else
4   5  finished             d     2  something_else
5   6  finished             f     1  something_else
6   7  finished             g     2        not_done
7   8   pending             g     2        not_done

CodePudding user response:

A simple solution would be after you use np.select and create your 'state' column, to forward fill / backward fill per group?

df['state'] = df.groupby(['working_group'])['state'].transform(lambda x: x.bfill().ffill())

   id   process working_group  size     state
0   1   pending             a     2  not_done
1   2  finished             a     2  not_done
2   3  finished             c     1       NaN
3   4  finished             d     2       NaN
4   5  finished             d     2       NaN
5   6  finished             f     1       NaN
6   7  finished             g     2  not_done
7   8   pending             g     2  not_done
  • Related