I'm trying to alter my dataframe to create a Sankey diagram.
I've 3 million rows like this:
client_id | | start_date | end_date | position
1234 16-07-2019 27-03-2021 3
1234 18-07-2021 09-10-2021 1
1234 28-03-2021 17-07-2021 2
1234 10-10-2021 20-11-2021 2
I want it to look like this:
client_id | | start_date | end_date | position | source | target
1234 16-07-2019 27-03-2021 3 3 2
1234 18-07-2021 09-10-2021 1 1 2
1234 28-03-2021 17-07-2021 2 2 1
1234 10-10-2021 20-11-2021 2 2 4
Value 4 is the value that I use as "exit in the flow.
I have no idea how to do this.
Background: the source and target values contain the position values based on start_date and end_date. So for example in the first row the source is position value 3 but the target is position value 2 because after the end date client changed from position 3 to 2.
CodePudding user response:
Because the source and target are calculated by each client's date order. So it is possible to order the date and find its next position.
columns = ["client_id" ,"start_date","end_date","position"]
data = [
["1234","16-07-2019","27-03-2021",3],
["1234","18-07-2021","09-10-2021",1],
["1234","28-03-2021","17-07-2021",2],
["1234","10-10-2021","20-11-2021",2],
["5678","16-07-2019","27-03-2021",3],
["5678","18-07-2021","09-10-2021",1],
["5678","28-03-2021","17-07-2021",2],
["5678","10-10-2021","20-11-2021",2],
]
df = pd.DataFrame(
data,
columns=columns
)
df = df.assign(
start_date = pd.to_datetime(df["start_date"]),
end_date = pd.to_datetime(df["end_date"])
)
sdf = df.assign(
rank=df.groupby("client_id")["start_date"].rank()
)
sdf = sdf.assign(
next_rank=sdf["rank"] 1
)
combine_result = pd.merge(sdf,
sdf[["client_id", "position", "rank"]],
left_on=["client_id", "next_rank"],
right_on=["client_id", "rank"],
how="left",
suffixes=["", "_next"]
).fillna({"position_next": 4})
combine_result[["client_id", "start_date", "end_date", "position", "position_next"]].rename(
{"position": "source", "position_next": "target"}, axis=1).sort_values(["client_id", "start_date"])