I have a Pandas DataFrame with daily data that I'm trying to group by week number to sum some columns, and I notice that when years do not begin on Sunday, the data for the week spanning the end of one year and the beginning of the next do not cleanly sum, instead being broken into two groups. My code to create a week number column is as follows:
df = pd.DataFrame([
{'Date': 2020-03-01, 'population': 100, 'admission': 10, 'release': 10, 'feature_id': 'abc'},
{'Date': 2020-03-02, 'population': 100, 'admission': 10, 'release': 10, 'feature_id': 'abc'},
...
])
df['week_num'] = df['Date'].dt.strftime('%Y-%U')
by_week = df.groupby(["week_num", "feature_id"])[["population", "admission", "release"]].sum().reset_index()
When I look at my grouped data, I see the following:
...,
{
'week_num': '2020-52',
'start_of_week': '2020-12-27T00:00:00.000Z',
'end_of_week': '2021-01-02T00:00:00.000Z',
'population': 1000,
'admission': 100,
'release': 150
},
{
'week_num': '2021-00',
'start_of_week': '2020-12-27T00:00:00.000Z',
'end_of_week': '2021-01-02T00:00:00.000Z',
'population': 200,
'admission': 120,
'release': 90
},
...
In reality, the summed values for the week from 2020-12-27 to 2021-01-02 should be a single group, population=1000 200=1200, admission=100 120=220, release=150 90=240
.
Is there a clean way to create a week_num
column that takes into account this end-of-year hiccup and sums a single week?
Edit
It appears the issue may derive from the application of a Week
class to extract start and end of week, as follows:
class Week:
def __init__(self, year, week_num):
week_start = str(year) " " str(week_num) " 0"
week_end = str(year) " " str(week_num) " 6"
self.first_date = dt.datetime.strptime(week_start, '%Y %U %w')
self.last_date = dt.datetime.strptime(week_end, '%Y %U %w')
...
by_week["start_of_week"] = by_week["iso_week_num"].apply(lambda w: Week(*w.split("-")).first_date)
Final from answer
df["week_num"] = df["Date"].dt.strftime("%G-%V")
by_week = df.groupby(["week_num", "feature_id"])[["population", "admission", "release"]].sum().reset_index()
by_week["start_of_week"] = by_week["week_num"].apply(lambda d: datetime.strptime(d "1", "%G-%V%u"))
by_week["end_of_week"] = by_week["week_num"].apply(lambda d: datetime.strptime(d "7", "%G-%V%u"))
CodePudding user response:
%U breaks the week around new year as per documentation https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes
Replace %U with %V and %Y with %G for the original code to work as expected (from python 3.6 onward).
For python < 3.6 use isocalendar():
df = pd.DataFrame([
{'Date': pd.Timestamp('2020-12-20') pd.Timedelta(n, 'day'), 'population': 100, 'admission': 10, 'release': 10,
'feature_id': 'abc'}
for n in range(21)
])
df = pd.concat([df, (df['Date'] pd.Timedelta(1, 'day')).dt.isocalendar()], axis=1)
df['start_of_week'] = df.apply(lambda x: dt.datetime.fromisocalendar(x.year, x.week, 1), axis=1)
df['end_of_week'] = df.apply(lambda x: dt.datetime.fromisocalendar(x.year, x.week, 7), axis=1)
df["iso_week_num"] = (df['Date'] pd.Timedelta(1, 'day')).apply(lambda t: f'{t.isocalendar().year}-{t.isocalendar().week}')
by_iso_week = df.groupby(["year", "week", "feature_id"])[["population", "admission", "release"]].sum().reset_index()
df["start_of_week2"] = df["iso_week_num"].apply(lambda w: Week(*w.split("-")).first_date)