Home > Software design >  Overlaping dates in pandas dataframe
Overlaping dates in pandas dataframe

Time:10-23

ID  START DATE  END DATE
5194    2019-05-15  2019-05-31
5193    2017-02-08  2017-04-02
5193    2017-02-15  2017-04-10
5193    2021-04-01  2021-05-15
5191    2020-10-01  2020-11-20
5191    2019-02-28  2019-04-20
5188    2018-10-01  2018-11-30

i have a dataframe(this is just a part of it) , When the id value of the previous row equals the id value of the next row, i want to check if the dates of the 2 rows overlap, and if so i want to create a new row that keeps the longest date and drops the old ones, ie when the ID is 5193 i want my new row to be ID: 5193, START DATE: 2017-02-08 , END DATE: 2017-04-10 !!

Is that even doable? , tried to approach it with midle point of a date but didnt get any results! Any suggestion would be highly appreciated

CodePudding user response:

Try with groupby and agg

import pandas as pd
a = """5194    2019-05-15  2019-05-31
5193    2017-02-08  2017-04-02
5193    2017-02-15  2017-04-10
5193    2021-04-01  2021-05-15
5191    2020-10-01  2020-11-20
5191    2019-02-28  2019-04-20
5188    2018-10-01  2018-11-30
"""
df = pd.DataFrame([i.split() for i in a.splitlines()], columns=["ID", "START DATE", "END DATE"])
df = df.assign(part_start_date=lambda x: x["START DATE"].astype(str).str[:7]).groupby(["ID", "part_start_date"]).agg({"START DATE": "min", "END DATE": "max"}).reset_index().drop("part_start_date", axis=1)


# output. Longest Date will be where start date is min and end_date is max
    ID  START DATE  END DATE
0   5188    2018-10-01  2018-11-30
1   5191    2019-02-28  2019-04-20
2   5191    2020-10-01  2020-11-20
3   5193    2017-02-08  2017-04-10
4   5193    2021-04-01  2021-05-15
5   5194    2019-05-15  2019-05-31
  • Related