Home > Blockchain >  Pandas: enter missing rows in a dataframe
Pandas: enter missing rows in a dataframe

Time:12-02

I'm collecting time series data, but sometimes for some time points there is no data to be collected. Just say for example I am collecting data across four time points, I might get a dataframe like this:

df_ = pd.DataFrame({'group': ['A']*3 ['B']*3,
                    'time': [1,2,4,1,3,4],
                    'value': [100,105,111,200,234,222]})

sometimes there is a datapoint missing and so there is no row for that point, I would like groupby and to forward fill with the previous value to create a new row form which would look like this:

df_missing_completed = pd.DataFrame({'group': ['A']*4 ['B']*4,
                                     'time': [1,2,3,4,1,2,3,4],
                                     'value': [100, 101, 105,111,200, 202, 234,222]})

I had the idea that I could create an new dataframe as a template with all the dates and time points, without any values, join it with the real data which would induce NA's, and do a ffillon the value column to fill in the missing data, like below:

df_template = pd.DataFrame({'group': ['A']*4 ['B']*4,
                                 'time': [1,2,3,4,1,2,3,4]})
df_final = pd.merge(df_template, df_, on = ['group', 'time'], how='left')
df_final['filled_values'] = df_final['value'].fillna(method='ffill')

but this seems like a messy solution, and with the real data the df_templete will be more complex to create. Does anyone know a better one? Thanks!

CodePudding user response:

I would use:

(df_.pivot(index='time', columns='group', values='value')
    # reindex only of you want to add missing times for all groups
    .reindex(range(df_['time'].min(), df_['time'].max() 1))
    .ffill().unstack().reset_index(name='value')
)

Output:

  group  time  value
0     A     1  100.0
1     A     2  105.0
2     A     3  105.0
3     A     4  111.0
4     B     1  200.0
5     B     2  200.0
6     B     3  234.0
7     B     4  222.0

CodePudding user response:

Instead of a template dataframe you could create a new index and then reindex with ffill:

new_idx = pd.MultiIndex.from_product([list('AB'), range(1,5)], names=['group', 'time'])
df_.set_index(['group', 'time']).reindex(new_idx, method='ffill').reset_index()

The result keeps the datatype of the value column:

  group  time  value
0     A     1    100
1     A     2    105
2     A     3    105
3     A     4    111
4     B     1    200
5     B     2    200
6     B     3    234
7     B     4    222
  • Related