How to handle duplicate dates in dates ranges in python-CodePudding

Imagine an employee that has worked for different companies and you want to know how many months he worked for each company. The problem is that he may have worked for two companies, overlapping some time (case of companies D and C). This overlapping time shall not be included. Another thing is that in the each month he worked, it counts as one month, no matter how long he worked, for example, in company A, duration shall be 3 months (as he worked in months 1, 2 and 3), although the real duration would be 2 months.

    Company    Start              End       Duration (months)
       A      2021-01-01       2021-03-01
       B      2021-03-03       2021-06-07
       C      2021-06-10       2021-08-28
       D      2021-04-10       2021-10-02

I thought of something like this:

duration_days = end - start # dates are datetime object
duration_months = duration_days.days/30 # not sure if it is the right way (months with 28 days?)
day = start
if day<= end:
    duration_months  = 1
    day = day   relativedelta(months= 1)
df.append()

I did not figure out how to remove the duplicates and I am not sure if my approach above is the best.

Thanks

CodePudding user response：

You could use the dateutil package and dateutil.relativedelta:

import io
import pandas as pd

import calendar
from datetime import datetime
from dateutil.relativedelta import relativedelta

data = io.StringIO("""
Company,Start,End
A,2021-01-01,2021-03-01
B,2021-03-03,2021-06-07
C,2021-06-10,2021-08-28
D,2021-04-10,2021-10-02
""")

df = pd.read_csv(data)

def to_datetime(date):
    return datetime.strptime(date, "%Y-%m-%d")

df["Start"] = df["Start"].apply(to_datetime)
df["End"] = df["End"].apply(to_datetime)

def get_duration(start, end):

    left = df[(df["End"] > start) & (df["Start"] < start)]
    right = df[(df["Start"] > start) & (df["End"] > end)]
    overlaps = df[(df["Start"] < start) & (df["End"] > end)]
    within = df[(df["Start"] > start) & (df["End"] < end)]
    
    if not left.empty:
        start = left["End"].max()

    if not right.empty:
        end = right["Start"].min()
    
    if not overlaps.empty:
        return 0
    
    if not within.empty:
        ...  # still to be solved
    
    last_day_in_month = calendar.monthrange(end.year, end.month)[1]

    end = end.replace(day=last_day_in_month)
    start = start.replace(day=1)

    return relativedelta(end, start).months

df["Duration"] = df.apply(lambda row: get_duration(row["Start"], row["End"]), axis=1)

Output for print(df) would be:

  Company      Start        End  Duration
0       A 2021-01-01 2021-03-01         2
1       B 2021-03-03 2021-06-07         1
2       C 2021-06-10 2021-08-28         0
3       D 2021-04-10 2021-10-02         4

The problem is that he may have worked for two companies, overlapping some time (case of companies D and C). This overlapping time shall not be included.

I read this like: No time is considered while working for 2 or more companies. NB. The example above handles 3 of 4 overlapping cases.

However, I assume you want the time allocated once. In that case, you may be able to achieve this by only handling two overlapping cases:

def get_duration(start, end):

    right = df[(df["Start"] > start) & (df["End"] > end)]
    overlaps = df[(df["Start"] < start) & (df["End"] > end)]
    
    if not right.empty:
        end = right["Start"].min()
    
    if not overlaps.empty:
        return 0
    
    last_day_in_month = calendar.monthrange(end.year, end.month)[1]

    end = end.replace(day=last_day_in_month)
    start = start.replace(day=1)

    return relativedelta(end, start).months

Output for print(df) would be:

  Company      Start        End  Duration
0       A 2021-01-01 2021-03-01         2
1       B 2021-03-03 2021-06-07         1
2       C 2021-06-10 2021-08-28         0
3       D 2021-04-10 2021-10-02         6

Another thing is that in the each month he worked, it counts as one month, no matter how long he worked [...]

You can achieve this with:

last_day_in_month = calendar.monthrange(end.year, end.month)[1]

end = end.replace(day=last_day_in_month)
start = start.replace(day=1)

months = relativedelta(end, start).months

Note: Depending on your data and the amount of the data, the example above may not perform well. However, hopefully my answer gives you an idea.

CodePudding user response：

I suggest making a new variable for each date and then when your done append all of the variables to a list! Like this:

duration_days = end - start # dates are datetime object
duration_months = duration_days.days/30
day = start
if day<= end:
    duration_months  = 1
    day = day   relativedelta(months= 1)
    date1 = (duration_months)
    date2 = (duration_months)
    date3 = (duration_months)
df.append(date1, date2, date3)

Tell me if that works!