I am trying to replace some hardcoded SQL queries related to timezone changes with a more dynamic/data-driven Python script. I have a dataset that looks like this spreadsheet below. WEEK_START/DAY/MONTH is the week, day, and month when daylight savings time begins (for example Canberra starts the first Sunday of April while Vienna is the last Sunday of March). The end variables are in the same format and display when it ends.
Here is the issue. I have seen solutions for specific use cases such as this, finding the last Sunday of the month:
current_year=today.year
current_month=today.month
current_day=today.day
month = calendar.monthcalendar(current_year, current_month)
day_of_month = max(month[-1][calendar.SUNDAY], month[-2][calendar.SUNDAY])
print(day_of_month)
31
This tells me that the last day of this month is the 31st. I can adjust the attributes for one given month/scenario, but how would I make a column for each and every row (city) to retrieve each? That is, several cities that change times on different dates? I thought if I could set attributes in day_of_month in an apply function it would work but when I do something like weekday='SUNDAY' it returns an error because of course the string 'SUNDAY' is not the same as SUNDAY the attribute of calendar. My SQL queries are grouped by cities that change on the same day but ideally anyone would be able to edit the CSV that loads the above dataset as needed and then each day the script would run once to see if today is between the start and end of daylight savings. We might have new cities to add in the future. I'm confident in doing that bit but quite lost on how to retrieve the dates for a given year.
My alternate, less resilient, option is to look at the distinct list of potential dates (last Sunday of March, first Sunday of April, etc.), write code to retrieve each one upfront (as in the above snippet above), and assign the dates in that way. I say that this is less resilient because if a city is added that does not fit in an existing group for time changes, the code would need to be altered as well.
So stackoverflow, is there a way to do this in a data driven way in pandas through an apply or something similar? Thanks in advance.
CodePudding user response:
Basically I think you have most of what you need. Just map the WEEK_START / WEEK_END column {-1, 1} to last or first day of month, put it all in a function and apply
it to each row. EX:
import calendar
import operator
import pandas as pd
def get_date(year: int, month: int, dayname: str, first=-1) -> pd.Timestamp:
"""
get the first or last day "dayname" in given month and year.
returns last by default.
"""
daysinmonth = calendar.monthcalendar(year, month)
getday = operator.attrgetter(dayname.upper())
if first == 1:
day = daysinmonth[0][getday(calendar)]
else:
day = max(daysinmonth[-1][getday(calendar)], daysinmonth[-2][getday(calendar)])
return pd.Timestamp(year, month, day)
year = 2021 # we need a year...
df['date_start'] = df.apply(lambda row: get_date(year,
row['MONTH_START'],
row['DAY_START'],
row['WEEK_START']), # selects first or last
axis=1) # to each row
df['date_end'] = df.apply(lambda row: get_date(year,
row['MONTH_END'],
row['DAY_END'],
row['WEEK_END']),
axis=1)
giving you for the sample data
df[['CITY', 'date_start', 'date_end']]
CITY date_start date_end
0 Canberra 2021-04-04 2021-10-03
1 Melbourne 2021-04-04 2021-10-03
2 Sydney 2021-04-04 2021-10-03
3 Kitzbuhel 2021-03-28 2021-10-31
4 Vienna 2021-03-28 2021-10-31
5 Antwerp 2021-03-28 2021-10-31
6 Brussels 2021-03-28 2021-10-31
7 Louvain-la-Neuve 2021-03-28 2021-10-31
Once you start working with time zones and DST transitions, Q: Is there a way to infer in Python if a date is the actual day in which the DST (Daylight Saving Time) change is made? might also be interesting to you.