Home > other >  get the first closest date in a groupby
get the first closest date in a groupby

Time:04-13

I am trying to retrieve the first next Funded date for each row in a group. So for each row in a group, find the first closest funded date as long as its on or after the date.

Example dataset

import uuid
import pandas as pd

df = (
    pd.DataFrame(
        {
            "id": pd.Series(
                ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C","B"]
            ),
            "status": pd.Series(
                [
                    "Funded",
                    "Rejected",
                    "Rejected",
                    "Funded",
                    "Funded",
                    "Rejected",
                    "Rejected",
                    "Funded",
                    "Rejected",
                    "Rejected",
                    "Funded",
                    "Funded",
                    "Rejected"
                ]
            ),
            "date": pd.Series(
                [
                    "2021-01-31",
                    "2021-02-28",
                    "2021-03-31",
                    "2021-04-30",
                    "2021-05-31",
                    "2021-06-30",
                    "2021-07-31",
                    "2021-08-31",
                    "2021-09-30",
                    "2021-10-31",
                    "2021-11-30",
                    "2021-12-31",
                    "2021-07-15"
                ]
            ),
        }
    )
    .sort_values(by=["id", "date"])
    .reset_index(drop=True)
)
df["uuid"] = [uuid.uuid4() for x in range(len(df))]

expected output

enter image description here

CodePudding user response:

Use pd.merge_asof:

# You should always convert datetime data to Timestamp type
df["date"] = pd.to_datetime(df["date"])

# For the left side of the merge: merge_asof requires the
# merge key to be sorted
left = df.sort_values("date")

# For the right side of the merge: it's a copy of the left
# side, but only for rows with status == Funded. We also
# rename the `date` column to `next_funded_date`
right = (
    left.query("status == 'Funded'")
    .rename(columns={"date": "next_funded_date"})
    [["id", "next_funded_date"]]
)

# The merge: left and right side must match exactly on the
# `by` column, but can match "next nearest" on `left_on` and
# `right_on`
# direction = forward meaning left_on <= right_on
# allow_exact_matches=False disallows a row to match itself
pd.merge_asof(
    left, right, by="id",
    left_on="date", right_on="next_funded_date",
    direction="forward", allow_exact_matches=False
).sort_values(["id", "date"])
  • Related