What I've got right now is a DataFrame like this:
id ts site type
0 111 2022-07-25 19:07:00.938365 A NaN
1 111 2022-07-25 19:07:00.938371 NaN 1.0
2 222 2022-07-25 19:07:00.938372 NaN NaN
3 222 2022-07-25 19:07:00.938373 NaN 2.0
4 222 2022-07-25 19:07:00.938374 C 1.0
What I'm trying to do is get the first non-null values of site
and type
for each id
, based on the descending order of ts
.
So my expected output is something like:
id site type
0 111 A 1.0
1 222 C 1.0
I've tried to do this:
df_grouped = df.sort_values(by="ts", ascending=False).groupby("id").ffill().first()
> TypeError: first() missing 1 required positional argument: 'offset'
I've also tried this:
df_grouped[["site", "type"]].apply(lambda x: x.first_valid_index()).reset_index()
index 0
0 site 0
1 screen_type 0
CodePudding user response:
You can do like this:
df = df.sort_values('ts', ascending=False)
df.groupby('id', as_index=False)[['site', 'type']].agg(lambda x: x.dropna().iloc[0])
or using first_valid_index
:
df.groupby('id', as_index=False)[['site', 'type']].agg(lambda x: x[x.first_valid_index()])
output:
id site type
0 111 A 1.0
1 222 C 1.0
Note: If you have all NaNs in either 'site' or 'type' columns it won't work. Then you don't even have to do this probably.
CodePudding user response:
(df.sort_values('ts', ascending=False).bfill().groupby('id')[['site', 'type']]
.agg(lambda x:x.bfill().head(1)).reset_index())
id site type
0 111 A 1.0
1 222 C 1.0
Note that if YOU ARE SURE there is ATLEAST 1 NON-NAN per id then you can do:
(df.sort_values('ts', ascending=False).bfill().groupby('id')[['site', 'type']]
.first().reset_index())
id site type
0 111 A 1.0
1 222 C 1.0