transform event based data into time series data with pandas using groupby and reindex-CodePudding

We want to transform event-based data into multiple time series.

As an example we use pandas to plot some graphics of the changes in salary per employee in a company over time. An event of a change in salary is a entry in a table with a date, a name and the new salary.

           employee  salary
date                       
2000-01-01     anna    4500
2003-01-01      oli    5000
2010-01-01     anna    6500
2012-01-01     lena    5000
2013-01-01      oli    7000
2016-01-01     lena    6500
2017-01-09      joe    5000
2018-01-09    peter    5000
2019-01-09      joe    5500
2019-01-31     lena       0
2020-01-01     anna    8500
2020-01-09    peter    5500
2021-01-09      joe    6000
2022-02-28    peter       0

The changes happen in irregularly-spaced intervals thus to work with the data we want reindex to a common regularly-spaced index and then do fill operations on missing data points.

time_series_index = pd.date_range(df_events.index.min(), df_events.index.max())
df_time_series = pd.DataFrame()
for name, group in df_events.groupby('employee'):
   time_series = group['salary'].reindex(time_series_index) 
   time_series = time_series.ffill().fillna(0)
   df_time_series[name] = time_series   
print(df_time_series)

              anna     joe  lena     oli   peter
2000-01-01  4500.0     0.0   0.0     0.0     0.0
2000-01-02  4500.0     0.0   0.0     0.0     0.0
2000-01-03  4500.0     0.0   0.0     0.0     0.0
2000-01-04  4500.0     0.0   0.0     0.0     0.0
2000-01-05  4500.0     0.0   0.0     0.0     0.0
...            ...     ...   ...     ...     ...
2022-02-24  8500.0  6000.0   0.0  7000.0  5500.0
2022-02-25  8500.0  6000.0   0.0  7000.0  5500.0
2022-02-26  8500.0  6000.0   0.0  7000.0  5500.0
2022-02-27  8500.0  6000.0   0.0  7000.0  5500.0
2022-02-28  8500.0  6000.0   0.0  7000.0     0.0

The loop above does the job of reindexing to a common index.

Now the question arose whether the approach is state-of-the-art or if there is more compact and straight forward way to do it. We assume the problem of transformation of events to time series is a common problem and therefore we expected there would be a standard to solve these kind of problems.

We tried to make it compact by removing the loop as follows.

df_time_series = df_events.groupby('employee')['salary'].reindex(time_series_index)

It throws AttributeError:

AttributeError: 'SeriesGroupBy' object has no attribute 'reindex'

CodePudding user response：

This should work. If your index is already a datetime index, then you do not need the .rename(pd.to_datetime) part

(df.rename(pd.to_datetime)
.set_index('employee',append = True)
.unstack()
.asfreq('D')
.ffill()
.fillna(0))

Output:

            salary                             
employee      anna     joe lena     oli   peter
2000-01-01  4500.0     0.0  0.0     0.0     0.0
2000-01-02  4500.0     0.0  0.0     0.0     0.0
2000-01-03  4500.0     0.0  0.0     0.0     0.0
2000-01-04  4500.0     0.0  0.0     0.0     0.0
2000-01-05  4500.0     0.0  0.0     0.0     0.0
...            ...     ...  ...     ...     ...
2022-02-24  8500.0  6000.0  0.0  7000.0  5500.0
2022-02-25  8500.0  6000.0  0.0  7000.0  5500.0
2022-02-26  8500.0  6000.0  0.0  7000.0  5500.0
2022-02-27  8500.0  6000.0  0.0  7000.0  5500.0
2022-02-28  8500.0  6000.0  0.0  7000.0     0.0