Home > database >  merge two data frame if start and end date of df2 comes in range of start and end date of df1 in pyt
merge two data frame if start and end date of df2 comes in range of start and end date of df1 in pyt

Time:01-11

i have two dataframe df1 and df2

df1=

id      start        end
 a  1/12/2022 18/12/2022
 a 19/12/2022 25/12/2022
 a 26/12/2022 31/12/2022
 b 01/12/2022 20/12/2022
 b 21/12/2022 31/12/2022
 c 01/12/2022 31/12/2022
 d 01/12/2022 15/12/2022
 d 16/12/2022 31/12/2022

and second data frame as

df2

id    start_2      end_2  number
 a 15/12/2022 15/12/2022       1
 b 17/12/2022 19/12/2022       3
 b 25/12/2022 27/12/2022       2
 c 12/12/2022 12/12/2022       1
 d 03/12/2022 04/12/2022       2
 d 25/12/2022 25/12/2022       1

I want to merge both dataframe left join (df1 and df2) by id . And adjust column 'number' in same date range(start and end date) coming in df1 . Like if in df2 id 'a' has number 1 it should come in first row of 'a'(1/12/2022 to 18/12/2022) not in other slot. Other slot should be zero. Like below

Resulted df

id      start        end  number
 a  1/12/2022 18/12/2022       1
 a 19/12/2022 25/12/2022       0
 a 26/12/2022 31/12/2022       0
 b 01/12/2022 20/12/2022       3
 b 21/12/2022 31/12/2022       2
 c 01/12/2022 31/12/2022       1
 d 01/12/2022 15/12/2022       2
 d 16/12/2022 31/12/2022       1

Note if two number lies in same slot of df1,there should be groupby sum.

CodePudding user response:

This is a work around. After merging you set start and end condition then make a good use of .loc and groupby

df1["start"] = pd.to_datetime(df1["start"], dayfirst=True)
df1["end"] = pd.to_datetime(df1["end"], dayfirst=True)
df2["start_2"] = pd.to_datetime(df2["start_2"], dayfirst=True)
df2["end_2"] = pd.to_datetime(df2["end_2"], dayfirst=True)


merged_df = pd.merge(df1, df2, on="id", how="left")
merged_df["number_adj"] = 0

start_condition = (merged_df["start_2"] >= merged_df["start"]) & (merged_df["start_2"] <= merged_df["end"])
end_condition = (merged_df["end_2"] >= merged_df["start"]) & (merged_df["end_2"] <= merged_df["end"])

merged_df.loc[start_condition | end_condition, "number_adj"] = merged_df["number"]
merged_df = merged_df.groupby(["id", "start", "end"]).sum().reset_index()
merged_df.drop("number", axis=1, inplace=True)
merged_df.rename(columns={"number_adj": "number"}, inplace=True)

print(merged_df)

Output:


  id      start        end  number
0  a 2022-12-01 2022-12-18       1
1  a 2022-12-19 2022-12-25       0
2  a 2022-12-26 2022-12-31       0
3  b 2022-12-01 2022-12-20       3
4  b 2022-12-21 2022-12-31       2
5  c 2022-12-01 2022-12-31       1
6  d 2022-12-01 2022-12-15       2
7  d 2022-12-16 2022-12-31       1

CodePudding user response:

You can use concat and the groupby with the size() method.

df = pd.concat([df1, df2])
df.groupby(["start", "end"]).size()

CodePudding user response:

You can merge on id then filter out your list:

# Convert to DatetimeIndex if necessary
df1['start'] = pd.to_datetime(df1['start'], dayfirst=True)
df1['end'] = pd.to_datetime(df1['end'], dayfirst=True)
df2['start_2'] = pd.to_datetime(df2['start_2'], dayfirst=True)
df2['end_2'] = pd.to_datetime(df2['end_2'], dayfirst=True)

# Merge on id, reset_index to preserve original index on merge
out = df1.reset_index().merge(df2, on='id', how='left')

# Check intervals
out['indicator'] = (out['start'] < out['start_2']) & (out['end_2'] < out['end'])

# Filter the list and set to 0 other slots
out = out.loc[out.groupby('index')['indicator'].idxmax()]
out.loc[~out['indicator'], 'number'] = 0

# Get the final dataframe
out = out[df1.columns.tolist()   ['number']].set_index(df1.index)

Output:

>>> out
  id      start        end  number
0  a 2022-12-01 2022-12-18       1
1  a 2022-12-19 2022-12-25       0
2  a 2022-12-26 2022-12-31       0
3  b 2022-12-01 2022-12-20       3
4  b 2022-12-21 2022-12-31       2
5  c 2022-12-01 2022-12-31       1
6  d 2022-12-01 2022-12-15       0
7  d 2022-12-16 2022-12-31       1
  • Related