Home > Net >  How do I get the latest entries in a DataFrame up to a certain time, for a given list of column valu
How do I get the latest entries in a DataFrame up to a certain time, for a given list of column valu

Time:09-08

Say I have the following DataFrame df:

time    person    attributes
----------------------------
 1        1           a
 2        2           b
 3        1           c
 4        3           d
 5        2           e
 6        1           f
 7        3           g
...      ...         ...

I want to write a function get_latest() that, when given a request_time and a list of persons ids, it will return a DataFrame containing the latest entry (row) for each person, up to the request_time.

So for instance, if I called get_latest(request_time = 4.5, ids = [1, 2]), then I want it to return

time    person    attributes
----------------------------
 2        2           b
 3        1           c

since those are the latest entries for persons 1 and 2 up to the time 4.5.

I've thought about doing a truncation of the DataFrame and then doing search from there by going up the DataFrame, but that's an okay efficiency of O(n), and I was wondering if there are functions or logic that make this a faster computation.

EDIT: I made this example DataFrame on the fly but it is perhaps important that I point out that the times are Python datetimes.

CodePudding user response:

How about pd.DataFrame.query

def latest_entries(request_time: int or float, ids: list) -> pd.DataFrame:
    return (
        df
        .query("time <= @request_time & person in @ids")
        .sort_values(["time"], ascending=False)
        .drop_duplicates(subset=["person"], keep="first")
        .reset_index(drop=True)
    )


print(latest_entries(4.5, [1, 2]))

   time  person attributes
0     3       1          c
1     2       2          b

CodePudding user response:

def get_latest(tme, ids):
    df2= ( df[(df['time']<=tme) & 
              (df['person'].isin(ids))])
    return df2[~df2.duplicated(subset=['person'], keep='last')]

get_latest(4.5, [1,2])
    time    person  attributes
1      2         2           b
2      3         1           c
  • Related