I have a dataframe that looks like this:
D X Y Z
A 22 16 23
A 21 16 22
A 20 17 21
B 33 50 11
B 34 53 12
B 34 55 13
C 44 34 11
C 45 33 11
C 45 33 10
D 55 35 60
D 57 34 61
E 66 36 13
E 67 38 14
E 67 37 13
I want to get the minimum and maximum values of the categorical variable D
across all the column values and so the output dataframe should look something like this:
D Xmin Xmax Ymin Ymax Zmin Zmax
A 20 22 16 17 21 23
B 33 34 50 55 11 13
C 44 45 33 34 10 11
D 55 57 34 35 60 61
E 66 67 36 38 13 14
I have tried this, but no luck:
min_max_df = dfObj.groupby('D').agg({'X': [dfObj.min(axis=0), dfObj.max(axis=0)]})
CodePudding user response:
from itertools import product
aggs = {f"{col}{fn}": (col, fn) for col,fn in product(['X', 'Y', 'Z'], ['min', 'max'])}
df.groupby('D').agg(**aggs)
>>>
Xmin Xmax Ymin Ymax Zmin Zmax
D
A 20 22 16 17 21 23
B 33 34 50 55 11 13
C 44 45 33 34 10 11
D 55 57 34 35 60 61
E 66 67 36 38 13 14
CodePudding user response:
I believe this is a nice way of doing it and in a single line of code. Making use of join doing the operation by index and the rsuffix
and lsuffix
to differentiate min and max.
output = df.groupby('D').min().join(df.groupby('D').max(),lsuffix='min',rsuffix='max')
Outputs:
Xmin Xmax Ymin Ymax Zmin Zmax
D
A 20 22 16 17 21 23
B 33 34 50 55 11 13
C 44 45 33 34 10 11
D 55 57 34 35 60 61
E 66 67 36 38 13 14
CodePudding user response:
df = df.groupby('D').agg(['min', 'max'])
Output:
>>> df
X Y Z
min max min max min max
D
A 20 22 16 17 21 23
B 33 34 50 55 11 13
C 44 45 33 34 10 11
D 55 57 34 35 60 61
E 66 67 36 38 13 14
>>> df['X']['min']
D
A 20
B 33
C 44
D 55
E 66
Name: min, dtype: int64
You can flatten the columns as well:
df.columns = df.columns.map(''.join)
df.rename_axis(None)
Xmin Xmax Ymin Ymax Zmin Zmax
A 20 22 16 17 21 23
B 33 34 50 55 11 13
C 44 45 33 34 10 11
D 55 57 34 35 60 61
E 66 67 36 38 13 14
CodePudding user response:
df = df.groupby('D').agg(['min', 'max'])
df.columns = df.columns.map("".join) # flatten multi-level columns
Output
>>> df
Xmin Xmax Ymin Ymax Zmin Zmax
D
A 20 22 16 17 21 23
B 33 34 50 55 11 13
C 44 45 33 34 10 11
D 55 57 34 35 60 61
E 66 67 36 38 13 14
Speed Test
I did some simple tests comparing Celius's, Asish's and user17242583/ my solutions.
Setup
import numpy as np
import pandas as pd
from itertools import product
n = 1_000_000
rng = np.random.default_rng()
df = pd.DataFrame({
'D': rng.choice(list("ABCDEFGH"), size=n),
'X': rng.integers(100, size=n),
'Y': rng.integers(100, size=n),
'Z': rng.integers(100, size=n),
})
>>> df.shape
(1000000, 4)
Results
Here are the results.
# Ashish's
>>> %%timeit -n 50
... aggs = {f"{col}{fn}": (col, fn) for col,fn in product(['X', 'Y', 'Z'], ['min', 'max'])}
... df1 = df.groupby('D').agg(**aggs)
116 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 50 loops each)
# user17242583's / mine
>>> %%timeit -n 50
... df1 = df.groupby('D').agg(['min', 'max'])
... df1.columns = df1.columns.map("".join) # flat multi-level columns
120 ms ± 4.69 ms per loop (mean ± std. dev. of 7 runs, 50 loops each)
#Celius's
>>> %%timeit -n 50
... df1 = df.groupby('D').min().join(df.groupby('D').max(),lsuffix='min',rsuffix='max')
178 ms ± 6.29 ms per loop (mean ± std. dev. of 7 runs, 50 loops each)
Comparing the first two with the last one, we can conclude that using DataFrameGroupBy.agg
is a little faster for large DataFrames