i have a data set named customer_base, containing over 800K rows like below:
ID | AGE | GENDER | OCCUPATION |
---|---|---|---|
1 | 64 | 101 | "occ1" |
2 | 64 | 100 | "occ2" |
2 | 66 | 100 | Nan |
2 | Nan | 100 | "occ2" |
3 | Nan | 101 | "occ3" |
3 | Nan | Nan | Nan |
3 | 32 | Nan | Nan |
. | . | . | . |
and after a grouping operation the desired version of it should be like below:
ID | AGE | GENDER | OCCUPATION |
---|---|---|---|
1 | 64 | 101 | "occ1" |
2 | 66 | 100 | "occ2" |
3 | 32 | 101 | "occ3" |
. | . | . | . |
previously i tried a code sample like below to get a table as clean as possible, but it took too much time. now i need a faster function to get any of the available values of occupation
column.
customer_base.groupby("ID",
as_index=False).agg({"GENDER":"max",
"AGE":"max",
"OCCUPATION":lambda x: np.nan if len(x[x.notna()])==0 else x[x.notna()].values[0]})
thanks in advance for your optimization ideas, sorry for possible question duplication
CodePudding user response:
Use GroupBy.first
for first non NaN
s values:
df = customer_base.groupby("ID", as_index=False).agg({"AGE":"max",
"GENDER":"max",
"OCCUPATION":'first'})
print (df)
ID AGE GENDER OCCUPATION
0 1 64.0 101.0 "occ1"
1 2 66.0 100.0 "occ2"
2 3 32.0 101.0 "occ3"