Home > database >  Alter dataframe based on values in other rows
Alter dataframe based on values in other rows

Time:06-10

I'm trying to alter my dataframe to create a Sankey diagram.

I've 3 million rows like this:

client_id  |  |  start_date  |  end_date  |  position
   1234          16-07-2019    27-03-2021       3
   1234          18-07-2021    09-10-2021       1
   1234          28-03-2021    17-07-2021       2
   1234          10-10-2021    20-11-2021       2

I want it to look like this:

client_id  |  |  start_date  |  end_date  |  position  |  source  |  target
   1234          16-07-2019    27-03-2021       3           3          2
   1234          18-07-2021    09-10-2021       1           1          2
   1234          28-03-2021    17-07-2021       2           2          1
   1234          10-10-2021    20-11-2021       2           2          4

Value 4 is the value that I use as "exit in the flow.

I have no idea how to do this.

Background: the source and target values contain the position values based on start_date and end_date. So for example in the first row the source is position value 3 but the target is position value 2 because after the end date client changed from position 3 to 2.

CodePudding user response:

Because the source and target are calculated by each client's date order. So it is possible to order the date and find its next position.

columns = ["client_id" ,"start_date","end_date","position"]

data = [
   ["1234","16-07-2019","27-03-2021",3],
   ["1234","18-07-2021","09-10-2021",1],
   ["1234","28-03-2021","17-07-2021",2],
   ["1234","10-10-2021","20-11-2021",2],
   ["5678","16-07-2019","27-03-2021",3],
   ["5678","18-07-2021","09-10-2021",1],
   ["5678","28-03-2021","17-07-2021",2],
   ["5678","10-10-2021","20-11-2021",2],
]

df = pd.DataFrame(
    data,
    columns=columns
)

df = df.assign(
    start_date = pd.to_datetime(df["start_date"]),
    end_date = pd.to_datetime(df["end_date"])
)
    
sdf = df.assign(
    rank=df.groupby("client_id")["start_date"].rank()
)

sdf = sdf.assign(
    next_rank=sdf["rank"]   1
)

combine_result = pd.merge(sdf,
                          sdf[["client_id", "position", "rank"]],
                          left_on=["client_id", "next_rank"],
                          right_on=["client_id", "rank"],
                          how="left",
                          suffixes=["", "_next"]
                          ).fillna({"position_next": 4})

combine_result[["client_id", "start_date", "end_date", "position", "position_next"]].rename(
    {"position": "source", "position_next": "target"}, axis=1).sort_values(["client_id", "start_date"])
  • Related