Home > Mobile >  After groupby, create a Series from the smallest values in one column corresponding to a value in an
After groupby, create a Series from the smallest values in one column corresponding to a value in an

Time:02-12

I have some data like this:

df = pd.DataFrame({'x':[1,2,3,1,1,2,3,3,2],
                   'y':['n', 'n', 'p', 'p', 'n', 'n', 'n', 'p', 'n'], 
                   'z':[52,75,77,68,92,32,62,70,34]})

I'd like to first group it by x, and then check if p exists in any of the rows of each group, and add another column to the original dataframe (or to the grouped one, and then somehow flatten it back out?) that has either None if there were no p's in that group, or the smallest number corresponding to p from the z column.

So here it'd be:

   x  y   z  t
0  1  n  52 68
3  1  p  68 68
4  1  n  92 68

   x  y   z  t
1  2  n  75 None
5  2  n  32 None
8  2  n  34 None

   x  y   z  t
2  3  p  77 70
6  3  n  62 70
7  3  p  70 70

or flattened:

   x  y   z  t
0  1  n  52 68
3  1  p  68 68
4  1  n  92 68
1  2  n  75 None
5  2  n  32 None
8  2  n  34 None
2  3  p  77 70
6  3  n  62 70
7  3  p  70 70

So first we'd do

g = df.groupby('x')

But then I'm not sure how to proceed. I'm just having a hard time wrapping my head around it and running into all sorts of pandas error.

CodePudding user response:

One option is to filter only the rows in the DataFrame where y is p. Then use groupby min to get the minimal z value per group (of remaining rows). Then join back to the DataFrame on x. NaN will automatically be added for any missing values (groups which did not have any values equal to p).

df = df.join(
    df[df['y'].eq('p')].groupby('x')['z'].min().rename('t'),
    on='x'
)

   x  y   z     t
0  1  n  52  68.0
1  2  n  75   NaN
2  3  p  77  70.0
3  1  p  68  68.0
4  1  n  92  68.0
5  2  n  32   NaN
6  3  n  62  70.0
7  3  p  70  70.0
8  2  n  34   NaN

*rename is used here to change the name of the column to the desired before joining back.


We can also sort by x with sort_values if needing the x values grouped together:

df = df.sort_values('x', ignore_index=True).join(
    df[df['y'].eq('p')].groupby('x')['z'].min().rename('t'), 
    on='x'
)

   x  y   z     t
0  1  n  52  68.0
1  1  p  68  68.0
2  1  n  92  68.0
3  2  n  75   NaN
4  2  n  32   NaN
5  2  n  34   NaN
6  3  p  77  70.0
7  3  n  62  70.0
8  3  p  70  70.0

Depending on the size of the DataFrame it may be more efficient to select only the z column initially with loc:

df = df.sort_values('x', ignore_index=True).join(
    df.loc[df['y'].eq('p'), 'z'].groupby(df['x']).min().rename('t'),
    on='x'
)

   x  y   z     t
0  1  n  52  68.0
1  1  p  68  68.0
2  1  n  92  68.0
3  2  n  75   NaN
4  2  n  32   NaN
5  2  n  34   NaN
6  3  p  77  70.0
7  3  n  62  70.0
8  3  p  70  70.0

CodePudding user response:

@HenryEcker covered all the nice intuitive solutions. This one's just for fun.

The basic idea is filter the rows where "y" is 'p' and among these rows find the minimum value of "z" for each "x". Then map it back to "x":

df['t'] = df['x'].map(df[df['y'].eq('p')].groupby('x')['z'].min())
df = df.sort_values(by='x')

An alternative method using eq where. The basic idea is to mask the "z" values corresponding to non-"p" values in column "y"; then groupby "x" and transform the minimum "z":

df['t'] = df['z'].where(df['y'].eq('p')).groupby(df['x']).transform('min')

Output:

   x  y   z     t
0  1  n  52  68.0
3  1  p  68  68.0
4  1  n  92  68.0
1  2  n  75   NaN
5  2  n  32   NaN
8  2  n  34   NaN
2  3  p  77  70.0
6  3  n  62  70.0
7  3  p  70  70.0
  • Related