Home > OS >  Add missing rows in pandas DataFrame
Add missing rows in pandas DataFrame

Time:04-24

I have a DataFrame that looks like this:

df = pd.DataFrame.from_dict({'id':       [1, 2, 1, 1, 2, 3],
                             'reward':  [0.1, 0.25, 0.15, 0.05, 0.4, 0.45],
                            'time': ['10:00:00', '12:00:00', '10:00:05', '10:00:07', '12:00:03', '15:00:00']} )

What I want to get is:

out = pd.DataFrame.from_dict({'id':       [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3],
                             'reward':  [0.1, 0, 0, 0, 0, 0.15, 0.0, 0.05,  0.25, 0.0, 0.0, 0.4, 0.45],
                            'time': ['10:00:00',  '10:00:01', '10:00:02',  '10:00:03', '10:00:04', '10:00:05', '10:00:06', '10:00:07', 
                                     '12:00:00', '12:00:01', '12:00:02', '12:00:03', '15:00:00']} )

In short, for each id, add the time rows missing with value 0. How do I do this? I wrote something with a loop, but it's going to be prohibitively slow for my use case which has several million rows

CodePudding user response:

Here's one way using groupby.apply where we use date_range to add the missing times. Then merge it back to df and fill in the missing values of the other columns:

df['time'] = pd.to_datetime(df['time'])
out = df.merge(df.groupby('id')['time'].apply(lambda x: pd.date_range(x.iat[0], x.iat[-1], freq='S')).explode(), how='right')
out['id'] = out['id'].ffill().astype(int)
out['reward'] = out['reward'].fillna(0)

Output:

    id  reward                time
0    1    0.10 2022-04-23 10:00:00
1    1    0.00 2022-04-23 10:00:01
2    1    0.00 2022-04-23 10:00:02
3    1    0.00 2022-04-23 10:00:03
4    1    0.00 2022-04-23 10:00:04
5    1    0.15 2022-04-23 10:00:05
6    1    0.00 2022-04-23 10:00:06
7    1    0.05 2022-04-23 10:00:07
8    2    0.25 2022-04-23 12:00:00
9    2    0.00 2022-04-23 12:00:01
10   2    0.00 2022-04-23 12:00:02
11   2    0.40 2022-04-23 12:00:03
12   3    0.45 2022-04-23 15:00:00

CodePudding user response:

One option is with complete from pyjanitor to abstract the process:

# dev version has some performance improvements
# pip install git https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor

df = df.astype({'time':np.datetime64})

# create mapping for expanded time
new_time = {'time' : lambda df: pd.date_range(df.min(), df.max(), freq='1S')}

# generate expanded rows
df.complete(new_time, by = 'id', fill_value = 0)

    id  reward                time
0    1    0.10 2022-04-24 10:00:00
1    1    0.00 2022-04-24 10:00:01
2    1    0.00 2022-04-24 10:00:02
3    1    0.00 2022-04-24 10:00:03
4    1    0.00 2022-04-24 10:00:04
5    1    0.15 2022-04-24 10:00:05
6    1    0.00 2022-04-24 10:00:06
7    1    0.05 2022-04-24 10:00:07
8    2    0.25 2022-04-24 12:00:00
9    2    0.00 2022-04-24 12:00:01
10   2    0.00 2022-04-24 12:00:02
11   2    0.40 2022-04-24 12:00:03
12   3    0.45 2022-04-24 15:00:00

Another option, which could be faster is using a combination of groupby, explode and merge:

# get the min and max dates
temp = df.groupby('id').time.agg(['min', 'max'])

# generate list of dates
outcome = [pd.date_range(start, end, freq='1S') 
           for start, end in 
           zip(temp['min'], temp['max'])]

outcome = pd.Series(outcome, index = temp.index).rename('time').explode()

# merge back to original df
(pd
.merge(outcome, df, on = ['id', 'time'], how = 'outer')
.fillna({'reward':0})
.loc[:, df.columns]
)

    id  reward                time
0    1    0.10 2022-04-24 10:00:00
1    1    0.00 2022-04-24 10:00:01
2    1    0.00 2022-04-24 10:00:02
3    1    0.00 2022-04-24 10:00:03
4    1    0.00 2022-04-24 10:00:04
5    1    0.15 2022-04-24 10:00:05
6    1    0.00 2022-04-24 10:00:06
7    1    0.05 2022-04-24 10:00:07
8    2    0.25 2022-04-24 12:00:00
9    2    0.00 2022-04-24 12:00:01
10   2    0.00 2022-04-24 12:00:02
11   2    0.40 2022-04-24 12:00:03
12   3    0.45 2022-04-24 15:00:00
  • Related