Say I have the following DataFrame df
:
time person attributes
----------------------------
1 1 a
2 2 b
3 1 c
4 3 d
5 2 e
6 1 f
7 3 g
... ... ...
I want to write a function get_latest()
that, when given a request_time
and a list of persons ids
, it will return a DataFrame containing the latest entry (row) for each person, up to the request_time
.
So for instance, if I called get_latest(request_time = 4.5, ids = [1, 2])
, then I want it to return
time person attributes
----------------------------
2 2 b
3 1 c
since those are the latest entries for persons 1
and 2
up to the time 4.5
.
I've thought about doing a truncation of the DataFrame and then doing search from there by going up the DataFrame, but that's an okay efficiency of O(n), and I was wondering if there are functions or logic that make this a faster computation.
EDIT: I made this example DataFrame on the fly but it is perhaps important that I point out that the times are Python datetimes.
CodePudding user response:
How about pd.DataFrame.query
def latest_entries(request_time: int or float, ids: list) -> pd.DataFrame:
return (
df
.query("time <= @request_time & person in @ids")
.sort_values(["time"], ascending=False)
.drop_duplicates(subset=["person"], keep="first")
.reset_index(drop=True)
)
print(latest_entries(4.5, [1, 2]))
time person attributes
0 3 1 c
1 2 2 b
CodePudding user response:
def get_latest(tme, ids):
df2= ( df[(df['time']<=tme) &
(df['person'].isin(ids))])
return df2[~df2.duplicated(subset=['person'], keep='last')]
get_latest(4.5, [1,2])
time person attributes
1 2 2 b
2 3 1 c