Get min and max values of categorical variable in a dataframe-CodePudding

I have a dataframe that looks like this:

D   X   Y   Z
A  22  16  23
A  21  16  22
A  20  17  21
B  33  50  11
B  34  53  12
B  34  55  13
C  44  34  11
C  45  33  11
C  45  33  10
D  55  35  60
D  57  34  61
E  66  36  13
E  67  38  14 
E  67  37  13

I want to get the minimum and maximum values of the categorical variable D across all the column values and so the output dataframe should look something like this:

D Xmin Xmax Ymin Ymax Zmin Zmax
A 20    22   16   17   21   23 
B 33    34   50   55   11   13    
C 44    45   33   34   10   11  
D 55    57   34   35   60   61
E 66    67   36   38   13   14

I have tried this, but no luck:

min_max_df = dfObj.groupby('D').agg({'X': [dfObj.min(axis=0), dfObj.max(axis=0)]})

CodePudding user response：

from itertools import product
aggs = {f"{col}{fn}": (col, fn) for col,fn in product(['X', 'Y', 'Z'], ['min', 'max'])}
df.groupby('D').agg(**aggs)
>>>
   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
D
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14

CodePudding user response：

I believe this is a nice way of doing it and in a single line of code. Making use of join doing the operation by index and the rsuffix and lsuffix to differentiate min and max.

output = df.groupby('D').min().join(df.groupby('D').max(),lsuffix='min',rsuffix='max')

Outputs:

   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
D
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14

CodePudding user response：

df = df.groupby('D').agg(['min', 'max'])

Output:

>>> df
    X       Y       Z    
  min max min max min max
D                        
A  20  22  16  17  21  23
B  33  34  50  55  11  13
C  44  45  33  34  10  11
D  55  57  34  35  60  61
E  66  67  36  38  13  14

>>> df['X']['min']
D
A    20
B    33
C    44
D    55
E    66
Name: min, dtype: int64

You can flatten the columns as well:

df.columns = df.columns.map(''.join)
df.rename_axis(None)

   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14

CodePudding user response：

df = df.groupby('D').agg(['min', 'max'])
df.columns = df.columns.map("".join)  # flatten multi-level columns

Output

>>> df

   Xmin  Xmax  Ymin  Ymax  Zmin  Zmax
D                                    
A    20    22    16    17    21    23
B    33    34    50    55    11    13
C    44    45    33    34    10    11
D    55    57    34    35    60    61
E    66    67    36    38    13    14

Speed Test

I did some simple tests comparing Celius's, Asish's and user17242583/ my solutions.

Setup

import numpy as np 
import pandas as pd 
from itertools import product

n = 1_000_000
rng = np.random.default_rng() 

df = pd.DataFrame({
    'D': rng.choice(list("ABCDEFGH"), size=n),
    'X': rng.integers(100, size=n),
    'Y': rng.integers(100, size=n),
    'Z': rng.integers(100, size=n),
})

>>> df.shape
(1000000, 4)

Results

Here are the results.

# Ashish's 
>>> %%timeit -n 50
... aggs = {f"{col}{fn}": (col, fn) for col,fn in product(['X', 'Y', 'Z'], ['min', 'max'])}
... df1 = df.groupby('D').agg(**aggs) 

116 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 50 loops each)

# user17242583's / mine 
>>> %%timeit -n 50
... df1 = df.groupby('D').agg(['min', 'max'])
... df1.columns = df1.columns.map("".join)  # flat multi-level columns

120 ms ± 4.69 ms per loop (mean ± std. dev. of 7 runs, 50 loops each)

#Celius's
>>> %%timeit -n 50
... df1 = df.groupby('D').min().join(df.groupby('D').max(),lsuffix='min',rsuffix='max')

178 ms ± 6.29 ms per loop (mean ± std. dev. of 7 runs, 50 loops each)

Comparing the first two with the last one, we can conclude that using DataFrameGroupBy.agg is a little faster for large DataFrames