How do I get the first non-null value from multiple columns based on another datetime column order a-CodePudding

What I've got right now is a DataFrame like this:

    id  ts                          site   type
0   111 2022-07-25 19:07:00.938365  A      NaN
1   111 2022-07-25 19:07:00.938371  NaN    1.0
2   222 2022-07-25 19:07:00.938372  NaN    NaN
3   222 2022-07-25 19:07:00.938373  NaN    2.0
4   222 2022-07-25 19:07:00.938374  C      1.0

What I'm trying to do is get the first non-null values of site and type for each id, based on the descending order of ts.

So my expected output is something like:

    id  site   type
0   111 A      1.0
1   222 C      1.0

I've tried to do this:

df_grouped = df.sort_values(by="ts", ascending=False).groupby("id").ffill().first()


> TypeError: first() missing 1 required positional argument: 'offset'

I've also tried this:

df_grouped[["site", "type"]].apply(lambda x: x.first_valid_index()).reset_index()



    index       0
0   site        0
1   screen_type 0

CodePudding user response：

You can do like this:

df = df.sort_values('ts', ascending=False)

df.groupby('id', as_index=False)[['site', 'type']].agg(lambda x: x.dropna().iloc[0])

or using first_valid_index:

df.groupby('id', as_index=False)[['site', 'type']].agg(lambda x: x[x.first_valid_index()])

output:

    id site  type
0  111    A   1.0
1  222    C   1.0

Note: If you have all NaNs in either 'site' or 'type' columns it won't work. Then you don't even have to do this probably.

CodePudding user response：

(df.sort_values('ts', ascending=False).bfill().groupby('id')[['site', 'type']]
   .agg(lambda x:x.bfill().head(1)).reset_index())

    id site  type
0  111    A   1.0
1  222    C   1.0

Note that if YOU ARE SURE there is ATLEAST 1 NON-NAN per id then you can do:

(df.sort_values('ts', ascending=False).bfill().groupby('id')[['site', 'type']]
   .first().reset_index())

    id site  type
0  111    A   1.0
1  222    C   1.0