I have a dataframe (df_obs) indexed by date containing one column with a length in number of days of something being observed:
Date duration
2012-01-01 3
2013-04-01 26
2014-05-01 14
2016-01-01 297
I am trying to repurpose this dataframe to provide a binary indication of whether an observation happens on each day, so the result would look like this:
Date Obs
2012-01-01 1
2012-01-02 1
2012-01-03 1
2012-01-04 0
2012-01-05 0
I had a script to do this, but since performing updates it won't wor - this is what I had used before:
Time = pd.DataFrame(pd.date_range(start='01/01/2012', end='31/12/2019'))
obs = np.zeros()
for d in df_obs.itertuples():
ilong = np.argwhere(Time.date == d.Index)[0][0]
obs[ilong:ilong d.duration] = 1
This now returns the following error:
ValueError: Length of values (1) does not match length of index (2921)
Any pointers of what edits I need to make this work?
CodePudding user response:
You can use:
s = df['Date'].repeat(df['duration'])
dates = (pd.to_datetime(s)
.add(pd.to_timedelta(s.groupby(level=0).cumcount(), unit='d'))
)
(pd.DataFrame({'Obs': 1}, index=dates)
.reindex(pd.date_range(df['Date'].min(), df['Date'].max()),
fill_value=0
)
.rename_axis('Date').reset_index()
)
Output:
Date Obs
0 2012-01-01 1
1 2012-01-02 1
2 2012-01-03 1
3 2012-01-04 0
4 2012-01-05 0
... ... ...
1457 2015-12-28 0
1458 2015-12-29 0
1459 2015-12-30 0
1460 2015-12-31 0
1461 2016-01-01 1
CodePudding user response:
Thank you to @mozway for the above suggestion. Using the above method I encountered problems with a duplicate index which prevented me from reindexing, so I made some modifications. The code below produced what I was looking for, probably not very efficient,
Time = pd.DataFrame(pd.date_range(start='01/01/2012', end='31/12/2019'))
s = df_obs['date'].repeat(df_obs['duration'])
dates = pd.to_datetime(s).add(pd.to_timedelta(s.groupby(level=0).cumcount(), unit='d'))
obs = pd.DataFrame({'Obs': 1}, index=dates)
obs = obs.loc[~obs.index.duplicated(), :]
obs = obs.reindex(Time, fill_value=0)