Optimize ways to get the first date of week dynamically in pandas without using lambda function-CodePudding

I have a dataframe df_purchase. It has the column BusinessTime in datetime format. My purpose is to get the date from each record in the BusinessTime column and find the first date of the week corresponding to that date, then add it to the InvoiceWeek column. The important things is that the first date of the week can be changed dynamically, at any weekdays (Mon-Sun).

This is an example of the dataframe df_purchase (this will be the input):

    BusinessTime
0   2021-03-05
1   2021-03-29
2   2021-03-19
3   2021-03-25
4   2021-03-29
5   2021-03-29

For example: I want to set the first day of the week as Wednesday (which means the week will start from Wed-Tue). The result will be: BusinessTime (2021-03-05) -> InvoiceWeek (2021-03-03)

This is the code that I apply for this problem:

def get_first_day_of_week(date, selection_weekday):
    '''
    Get the first day of week of a given date based on the selection of any week day
    
    Selection_weekday    Week definition
    0                        Mon - Sun
    1                        Tue - Mon
    2                        Wed - Tue
    3                        Thu - Wed
    4                        Fri - Thu
    5                        Sat - Fri
    6                        Sun - Sat
    '''
    
    # Check the current week day of the given date
    check_current_weekday = date.dayofweek
    
    # Find out the differences between the current week day and the selection week day
    diff_days = selection_weekday - check_current_weekday
    
    # All of the date will be parsed to Monday first
    convert_date = date - timedelta(days=check_current_weekday)
    
    # Perform if the selection day is not Monday
    if not(selection_weekday == 0):
        
        # The weekday will be in previous time plus the difference 
        if diff_days > 0:
            convert_date = date - timedelta(days=7)   timedelta(days=diff_days)
        
        # The weekday will be in current week minus the difference
        elif diff_days < 0:
            convert_date = date   timedelta(days=diff_days)
        
        # The weekday will be that date
        else:
            convert_date = date
    
    return convert_date


def select_first_day_of_week():
    df_purchase['InvoiceWeek'] = df_purchase.loc[:, "BusinessTime"].apply(lambda x: get_first_day_of_week(x, SELECTION_WEEKDAY))
    return df_purchase


SELECTION_WEEKDAY = 2 # Week will be from Wed to Tue

select_first_day_of_week()

This is an example of the dataframe result:

    BusinessTime    InvoiceWeek
0   2021-03-05      2021-03-03
1   2021-03-29      2021-03-24
2   2021-03-19      2021-03-17
3   2021-03-25      2021-03-24
4   2021-03-29      2021-03-24
5   2021-03-29      2021-03-24

Due to the usage of lambda, the performance is really slow when having thousands to millions of records. Are there any ways that I can optimize the performance of this code or to change another solution? I have searched for resample() in pandas but it seems it doesn't support my given problem. Thanks everyone.

CodePudding user response：

How about

period_map = {
    0: "W-Sun",
    1: "W-Mon",
    2: "W-Tue",
    3: "W-Wed",
    4: "W-Thu",
    5: "W-Fri",
    6: "W-Sat",
}

df["InvoiceWeek"] = (
    df["BusinessTime"]
    .dt.to_period(period_map[SELECTION_DAY])
    .dt.start_time
)

Then you'd get

  BusinessTime InvoiceWeek
0   2021-03-05  2021-03-03
1   2021-03-29  2021-03-24
2   2021-03-19  2021-03-17
3   2021-03-25  2021-03-24
4   2021-03-29  2021-03-24
5   2021-03-29  2021-03-24