We want to transform event-based data into multiple time series.
As an example we use pandas to plot some graphics of the changes in salary per employee in a company over time. An event of a change in salary is a entry in a table with a date, a name and the new salary.
employee salary
date
2000-01-01 anna 4500
2003-01-01 oli 5000
2010-01-01 anna 6500
2012-01-01 lena 5000
2013-01-01 oli 7000
2016-01-01 lena 6500
2017-01-09 joe 5000
2018-01-09 peter 5000
2019-01-09 joe 5500
2019-01-31 lena 0
2020-01-01 anna 8500
2020-01-09 peter 5500
2021-01-09 joe 6000
2022-02-28 peter 0
The changes happen in irregularly-spaced intervals thus to work with the data we want reindex to a common regularly-spaced index and then do fill operations on missing data points.
time_series_index = pd.date_range(df_events.index.min(), df_events.index.max())
df_time_series = pd.DataFrame()
for name, group in df_events.groupby('employee'):
time_series = group['salary'].reindex(time_series_index)
time_series = time_series.ffill().fillna(0)
df_time_series[name] = time_series
print(df_time_series)
anna joe lena oli peter
2000-01-01 4500.0 0.0 0.0 0.0 0.0
2000-01-02 4500.0 0.0 0.0 0.0 0.0
2000-01-03 4500.0 0.0 0.0 0.0 0.0
2000-01-04 4500.0 0.0 0.0 0.0 0.0
2000-01-05 4500.0 0.0 0.0 0.0 0.0
... ... ... ... ... ...
2022-02-24 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-25 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-26 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-27 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-28 8500.0 6000.0 0.0 7000.0 0.0
The loop above does the job of reindexing to a common index.
Now the question arose whether the approach is state-of-the-art or if there is more compact and straight forward way to do it. We assume the problem of transformation of events to time series is a common problem and therefore we expected there would be a standard to solve these kind of problems.
We tried to make it compact by removing the loop as follows.
df_time_series = df_events.groupby('employee')['salary'].reindex(time_series_index)
It throws AttributeError:
AttributeError: 'SeriesGroupBy' object has no attribute 'reindex'
CodePudding user response:
This should work. If your index is already a datetime index, then you do not need the .rename(pd.to_datetime)
part
(df.rename(pd.to_datetime)
.set_index('employee',append = True)
.unstack()
.asfreq('D')
.ffill()
.fillna(0))
Output:
salary
employee anna joe lena oli peter
2000-01-01 4500.0 0.0 0.0 0.0 0.0
2000-01-02 4500.0 0.0 0.0 0.0 0.0
2000-01-03 4500.0 0.0 0.0 0.0 0.0
2000-01-04 4500.0 0.0 0.0 0.0 0.0
2000-01-05 4500.0 0.0 0.0 0.0 0.0
... ... ... ... ... ...
2022-02-24 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-25 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-26 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-27 8500.0 6000.0 0.0 7000.0 5500.0
2022-02-28 8500.0 6000.0 0.0 7000.0 0.0