I have a dataframe df_purchase. It has the column BusinessTime in datetime format. My purpose is to get the date from each record in the BusinessTime column and find the first date of the week corresponding to that date, then add it to the InvoiceWeek column. The important things is that the first date of the week can be changed dynamically, at any weekdays (Mon-Sun).
This is an example of the dataframe df_purchase (this will be the input):
BusinessTime
0 2021-03-05
1 2021-03-29
2 2021-03-19
3 2021-03-25
4 2021-03-29
5 2021-03-29
For example: I want to set the first day of the week as Wednesday (which means the week will start from Wed-Tue). The result will be: BusinessTime (2021-03-05) -> InvoiceWeek (2021-03-03)
This is the code that I apply for this problem:
def get_first_day_of_week(date, selection_weekday):
'''
Get the first day of week of a given date based on the selection of any week day
Selection_weekday Week definition
0 Mon - Sun
1 Tue - Mon
2 Wed - Tue
3 Thu - Wed
4 Fri - Thu
5 Sat - Fri
6 Sun - Sat
'''
# Check the current week day of the given date
check_current_weekday = date.dayofweek
# Find out the differences between the current week day and the selection week day
diff_days = selection_weekday - check_current_weekday
# All of the date will be parsed to Monday first
convert_date = date - timedelta(days=check_current_weekday)
# Perform if the selection day is not Monday
if not(selection_weekday == 0):
# The weekday will be in previous time plus the difference
if diff_days > 0:
convert_date = date - timedelta(days=7) timedelta(days=diff_days)
# The weekday will be in current week minus the difference
elif diff_days < 0:
convert_date = date timedelta(days=diff_days)
# The weekday will be that date
else:
convert_date = date
return convert_date
def select_first_day_of_week():
df_purchase['InvoiceWeek'] = df_purchase.loc[:, "BusinessTime"].apply(lambda x: get_first_day_of_week(x, SELECTION_WEEKDAY))
return df_purchase
SELECTION_WEEKDAY = 2 # Week will be from Wed to Tue
select_first_day_of_week()
This is an example of the dataframe result:
BusinessTime InvoiceWeek
0 2021-03-05 2021-03-03
1 2021-03-29 2021-03-24
2 2021-03-19 2021-03-17
3 2021-03-25 2021-03-24
4 2021-03-29 2021-03-24
5 2021-03-29 2021-03-24
Due to the usage of lambda, the performance is really slow when having thousands to millions of records. Are there any ways that I can optimize the performance of this code or to change another solution? I have searched for resample() in pandas but it seems it doesn't support my given problem. Thanks everyone.
CodePudding user response:
How about
period_map = {
0: "W-Sun",
1: "W-Mon",
2: "W-Tue",
3: "W-Wed",
4: "W-Thu",
5: "W-Fri",
6: "W-Sat",
}
df["InvoiceWeek"] = (
df["BusinessTime"]
.dt.to_period(period_map[SELECTION_DAY])
.dt.start_time
)
Then you'd get
BusinessTime InvoiceWeek
0 2021-03-05 2021-03-03
1 2021-03-29 2021-03-24
2 2021-03-19 2021-03-17
3 2021-03-25 2021-03-24
4 2021-03-29 2021-03-24
5 2021-03-29 2021-03-24