Home > Enterprise >  Pandas/Numpy: how can I convert a count of observations over a period of time/days to a binary indic
Pandas/Numpy: how can I convert a count of observations over a period of time/days to a binary indic

Time:11-30

I have a dataframe (df_obs) indexed by date containing one column with a length in number of days of something being observed:

Date         duration
2012-01-01   3
2013-04-01   26
2014-05-01   14
2016-01-01   297

I am trying to repurpose this dataframe to provide a binary indication of whether an observation happens on each day, so the result would look like this:

Date         Obs
2012-01-01   1
2012-01-02   1
2012-01-03   1
2012-01-04   0
2012-01-05   0

I had a script to do this, but since performing updates it won't wor - this is what I had used before:

Time = pd.DataFrame(pd.date_range(start='01/01/2012', end='31/12/2019'))

obs = np.zeros()
for d in df_obs.itertuples():
    ilong = np.argwhere(Time.date == d.Index)[0][0]
    obs[ilong:ilong d.duration] = 1

This now returns the following error:

ValueError: Length of values (1) does not match length of index (2921)

Any pointers of what edits I need to make this work?

CodePudding user response:

You can use:

s = df['Date'].repeat(df['duration'])
dates = (pd.to_datetime(s)
           .add(pd.to_timedelta(s.groupby(level=0).cumcount(), unit='d'))
         )

(pd.DataFrame({'Obs': 1}, index=dates)
   .reindex(pd.date_range(df['Date'].min(), df['Date'].max()),
            fill_value=0
           )
   .rename_axis('Date').reset_index()
)

Output:

           Date  Obs
0    2012-01-01    1
1    2012-01-02    1
2    2012-01-03    1
3    2012-01-04    0
4    2012-01-05    0
...         ...  ...
1457 2015-12-28    0
1458 2015-12-29    0
1459 2015-12-30    0
1460 2015-12-31    0
1461 2016-01-01    1

CodePudding user response:

Thank you to @mozway for the above suggestion. Using the above method I encountered problems with a duplicate index which prevented me from reindexing, so I made some modifications. The code below produced what I was looking for, probably not very efficient,

 Time = pd.DataFrame(pd.date_range(start='01/01/2012', end='31/12/2019'))
 s = df_obs['date'].repeat(df_obs['duration'])
 dates = pd.to_datetime(s).add(pd.to_timedelta(s.groupby(level=0).cumcount(), unit='d'))
 obs = pd.DataFrame({'Obs': 1}, index=dates) 
 obs = obs.loc[~obs.index.duplicated(), :]
 obs = obs.reindex(Time, fill_value=0)
  • Related