Aggregation function optimization-CodePudding

i have a data set named customer_base, containing over 800K rows like below:

ID	AGE	GENDER	OCCUPATION
1	64	101	"occ1"
2	64	100	"occ2"
2	66	100	Nan
2	Nan	100	"occ2"
3	Nan	101	"occ3"
3	Nan	Nan	Nan
3	32	Nan	Nan
.	.	.	.

and after a grouping operation the desired version of it should be like below:

ID	AGE	GENDER	OCCUPATION
1	64	101	"occ1"
2	66	100	"occ2"
3	32	101	"occ3"
.	.	.	.

previously i tried a code sample like below to get a table as clean as possible, but it took too much time. now i need a faster function to get any of the available values of occupation column.

customer_base.groupby("ID",
                      as_index=False).agg({"GENDER":"max",                                                              
                                           "AGE":"max",
                                           "OCCUPATION":lambda x: np.nan if len(x[x.notna()])==0 else x[x.notna()].values[0]})

thanks in advance for your optimization ideas, sorry for possible question duplication

CodePudding user response：

Use GroupBy.first for first non NaNs values:

df = customer_base.groupby("ID", as_index=False).agg({"AGE":"max",
                                                      "GENDER":"max",
                                                      "OCCUPATION":'first'})

print (df)
   ID   AGE  GENDER OCCUPATION
0   1  64.0   101.0     "occ1"
1   2  66.0   100.0     "occ2"
2   3  32.0   101.0     "occ3"