I'm trying to make a dataframe made of job records that looks something like this
id | start date | End Date |
---|---|---|
01 | 2023-01-30 | 2023-02-19 |
02 | 2023-06-12 | 2023-06-25 |
My goal is to expand each record so that there are x many rows for each id where x is the number of weeks between start and end dates.
It should look something like this:
id | start date | End Date | Week | Week Count | Year | Week # | Period |
---|---|---|---|---|---|---|---|
01 | 2023-01-30 | 2023-02-19 | 2023-01-30 | 1 | 2023 | 5 | 2023-5 |
01 | 2023-01-30 | 2023-02-19 | 2023-02-06 | 2 | 2023 | 6 | 2023-6 |
01 | 2023-01-30 | 2023-02-19 | 2023-02-13 | 3 | 2023 | 7 | 2023-7 |
02 | 2023-06-12 | 2023-06-25 | 2023-06-12 | 1 | 2023 | 24 | 2023-24 |
02 | 2023-06-12 | 2023-06-25 | 2023-06-19 | 2 | 2023 | 25 | 2023-25 |
I tried manually calculating the number of weeks before dates and then expanding the dataframe using
df = df.loc[df.index.repeat(df['# of weeks')].reset_index(drop=True)
I was wondering if there was a better way to automatically extract the number of weeks between dates and each week for every row per job record?
Trying the method here's the code I used and the resulting error:
df['Id'] = df['Id'].astype(str)
df['weeks'] = np.ceil((df['End_Date__c'] - df['Start_Date__c']) / np.timedelta64(1, "W"))
exp_df = pd.concat([pd.DataFrame({
'Id': np.repeat(idx 1, weeks),
'Start_Date__c': start,
'End_Date__c ': end,
'Week': pd.date_range(start, periods=weeks, freq='W-MON'),
'Week Count': np.arange(1, weeks 1),
'Year': pd.date_range(start, periods=weeks, freq='W-MON').year,
'Week #': pd.date_range(start, periods=weeks, freq='W-MON').strftime('%U'),
'Period': pd.date_range(start, periods=weeks, freq='W-MON').strftime('%Y-%U')
}) for idx, (start, end, weeks) in df.loc[:, ['Start_Date__c', 'End_Date__c', 'End_Date__c']].iterrows()])
exp_df
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[11], line 4
1 df['Id'] = df['Id'].astype(str)
2 df['weeks'] = np.ceil((df['End_Date__c'] - df['Start_Date__c']) / np.timedelta64(1, "W"))
----> 4 exp_df = pd.concat([pd.DataFrame({
5 'Id': np.repeat(idx 1, weeks),
6 'Start_Date__c': start,
7 'End_Date__c ': end,
8 'Week': pd.date_range(start, periods=weeks, freq='W-MON'),
9 'Week Count': np.arange(1, weeks 1),
10 'Year': pd.date_range(start, periods=weeks, freq='W-MON').year,
11 'Week #': pd.date_range(start, periods=weeks, freq='W-MON').strftime('%U'),
12 'Period': pd.date_range(start, periods=weeks, freq='W-MON').strftime('%Y-%U')
13 }) for idx, (start, end, weeks) in df.loc[:, ['Start_Date__c', 'End_Date__c', 'End_Date__c']].iterrows()])
15 exp_df
Cell In[11], line 5, in <listcomp>(.0)
1 df['Id'] = df['Id'].astype(str)
2 df['weeks'] = np.ceil((df['End_Date__c'] - df['Start_Date__c']) / np.timedelta64(1, "W"))
4 exp_df = pd.concat([pd.DataFrame({
----> 5 'Id': np.repeat(idx 1, weeks),
6 'Start_Date__c': start,
7 'End_Date__c ': end,
8 'Week': pd.date_range(start, periods=weeks, freq='W-MON'),
9 'Week Count': np.arange(1, weeks 1),
10 'Year': pd.date_range(start, periods=weeks, freq='W-MON').year,
11 'Week #': pd.date_range(start, periods=weeks, freq='W-MON').strftime('%U'),
12 'Period': pd.date_range(start, periods=weeks, freq='W-MON').strftime('%Y-%U')
13 }) for idx, (start, end, weeks) in df.loc[:, ['Start_Date__c', 'End_Date__c', 'End_Date__c']].iterrows()])
15 exp_df
File <__array_function__ internals>:5, in repeat(*args, **kwargs)
File /opt/anaconda3/lib/python3.9/site-packages/numpy/core/fromnumeric.py:479, in repeat(a, repeats, axis)
436 @array_function_dispatch(_repeat_dispatcher)
437 def repeat(a, repeats, axis=None):
438 """
439 Repeat elements of an array.
440
(...)
477
478 """
--> 479 return _wrapfunc(a, 'repeat', repeats, axis=axis)
File /opt/anaconda3/lib/python3.9/site-
packages/numpy/core/fromnumeric.py:54, in _wrapfunc(obj, method, *args, **kwds)
52 bound = getattr(obj, method, None)
53 if bound is None:
---> 54 return _wrapit(obj, method, *args, **kwds)
56 try:
57 return bound(*args, **kwds)
File /opt/anaconda3/lib/python3.9/site-packages/numpy/core/fromnumeric.py:43, in _wrapit(obj, method, *args, **kwds)
41 except AttributeError:
42 wrap = None
---> 43 result = getattr(asarray(obj), method)(*args, **kwds)
44 if wrap:
45 if not isinstance(result, mu.ndarray):
TypeError: int() argument must be a string, a bytes-like object or a number, not 'Timestamp'
CodePudding user response:
Here is an approach using numpy and pd.concat()
. import numpy as np
df['start date'] = pd.to_datetime(df['start date'])
df['End Date'] = pd.to_datetime(df['End Date'])
df['weeks'] = np.ceil((df['End Date'] - df['start date']) / np.timedelta64(1, "W"))
exp_df = pd.concat([pd.DataFrame({
'id': np.repeat(idx, weeks),
'start date': start,
'End Date': end,
'Week': pd.date_range(start, periods=weeks, freq='W-MON'),
'Week Count': np.arange(1, weeks 1),
'Year': pd.date_range(start, periods=weeks, freq='W-MON').year,
'Week #': pd.date_range(start, periods=weeks, freq='W-MON').strftime('%U'),
'Period': pd.date_range(start, periods=weeks, freq='W-MON').strftime('%Y-%U')
}) for idx, (start, end, weeks) in df.loc[:, ['start date', 'End Date', 'weeks']].iterrows()])
exp_df = exp_df.reset_index(drop=True)
print(exp_df)
Other methods used:
np.ceil()
df.loc()
id start date End Date Week Week Count Year Week # Period
0 1 2023-01-30 2023-02-19 2023-01-30 1.0 2023 05 2023-05
1 1 2023-01-30 2023-02-19 2023-02-06 2.0 2023 06 2023-06
2 1 2023-01-30 2023-02-19 2023-02-13 3.0 2023 07 2023-07
3 2 2023-06-12 2023-06-25 2023-06-12 1.0 2023 24 2023-24
4 2 2023-06-12 2023-06-25 2023-06-19 2.0 2023 25 2023-25