I'm struggling to create a new variable in python based on a condition with a groupby statement. I know there is a new datatable package in python, but I would like to explore the options below.
In R, I did the following:
require(data.table)
id <- c(1,2,3,4,1,5)
units <- c(34, 31, 16, 32, NA, 35)
unitsnew <- c(23, 11, 21, 22, 27, 11)
df = data.table(id, units, unitsnew)
df[, col:= ifelse(is.na(units), min(unitsnew, na.rm = T), units), by = id]
id units unitsnew col
1 34 23 34
2 31 11 31
3 16 21 16
4 32 22 32
1 NA 27 23
5 35 11 35
In Python,
import pandas as pd
import numpy as np
matrix = [(1, 34, 23),
(2, 31, 11),
(3, 16, 21),
(4, 32, 22),
(1, np.nan, 27),
(5, 35, 11)]
df = pd.DataFrame(matrix, columns = ['id', 'units', 'unitsnew'])
df.assign(col = np.where(np.isnan(df.units), np.nanmin(df.unitsnew), df.units))
id units unitsnew col
1 34 23 34
2 31 11 31
3 16 21 16
4 32 22 32
1 NA 27 11
5 35 11 35
I'm not sure where to fit the groupby statement such that the minimum value per id is taken. The desired output is generated with R. In python, the nan value is filled with 11 because I failed to add the groupby statement. How can I solve this? Thanks,
CodePudding user response:
First let's look how to get the minimum value per group using groupby
transform
('min')
:
df.groupby('id')['unitsnew'].transform('min')
output:
0 23
1 11
2 21
3 22
4 23
5 11
Name: unitsnew, dtype: int64
Now we can use where
to assign the above value when df['units'].
isna
()
:
df['col'] = df['units'].where(df['units'].notna(),
df.groupby('id')['unitsnew'].transform('min'))
output:
id units unitsnew col
0 1 34.0 23 34.0
1 2 31.0 11 31.0
2 3 16.0 21 16.0
3 4 32.0 22 32.0
4 1 NaN 27 23.0
5 5 35.0 11 35.0