ID START DATE END DATE
5194 2019-05-15 2019-05-31
5193 2017-02-08 2017-04-02
5193 2017-02-15 2017-04-10
5193 2021-04-01 2021-05-15
5191 2020-10-01 2020-11-20
5191 2019-02-28 2019-04-20
5188 2018-10-01 2018-11-30
i have a dataframe(this is just a part of it) , When the id value of the previous row equals the id value of the next row, i want to check if the dates of the 2 rows overlap, and if so i want to create a new row that keeps the longest date and drops the old ones, ie when the ID is 5193 i want my new row to be ID: 5193, START DATE: 2017-02-08 , END DATE: 2017-04-10 !!
Is that even doable? , tried to approach it with midle point of a date but didnt get any results! Any suggestion would be highly appreciated
CodePudding user response:
Try with groupby
and agg
import pandas as pd
a = """5194 2019-05-15 2019-05-31
5193 2017-02-08 2017-04-02
5193 2017-02-15 2017-04-10
5193 2021-04-01 2021-05-15
5191 2020-10-01 2020-11-20
5191 2019-02-28 2019-04-20
5188 2018-10-01 2018-11-30
"""
df = pd.DataFrame([i.split() for i in a.splitlines()], columns=["ID", "START DATE", "END DATE"])
df = df.assign(part_start_date=lambda x: x["START DATE"].astype(str).str[:7]).groupby(["ID", "part_start_date"]).agg({"START DATE": "min", "END DATE": "max"}).reset_index().drop("part_start_date", axis=1)
# output. Longest Date will be where start date is min and end_date is max
ID START DATE END DATE
0 5188 2018-10-01 2018-11-30
1 5191 2019-02-28 2019-04-20
2 5191 2020-10-01 2020-11-20
3 5193 2017-02-08 2017-04-10
4 5193 2021-04-01 2021-05-15
5 5194 2019-05-15 2019-05-31