In a process we construct different types of objects, e.g. of type R and Q. One type (R) shall then be projected onto the other type (Q) successively.
The data generation as an example.
import pandas as pd
import numpy as np
np.random.seed(42)
N = 16
item_types = ['Q', 'R']
item_type = np.random.choice(item_types, N)
data_a = np.where(item_type == 'R', np.arange(N), np.nan)
index_of_a_R_type = np.argwhere(item_type == 'R')[-3][0]
data_a[index_of_a_R_type] = np.nan
df = pd.DataFrame(zip(data_a, item_type), columns=['a', 'type'])
At the console the df looks as follows.
a type
0 NaN Q
1 1.0 R
2 NaN Q
3 NaN Q
4 NaN Q
5 NaN R
6 NaN Q
7 NaN Q
8 NaN Q
9 9.0 R
10 NaN Q
11 NaN Q
12 NaN Q
13 NaN Q
14 14.0 R
15 NaN Q
The projection can be achieved with the method pad. There is one issue with this approach, if there is a NAN value in the R type, it will be overwritten, which should be prevented. A NAN value should be projected, not overwritten in this case. This behavior can be achieved by replacing the NAN values by a key value, then pad, then replace the key value with NAN again.
Here is the function that does the intended projection.
def project_R_on_Q(df):
REPLACEMENT_VALUE = -999
mask_R_and_nan = (df['type'] == 'R') & df['a'].isna()
df.loc[mask_R_and_nan, 'a'] = REPLACEMENT_VALUE
df['a'] = df['a'].pad()
df['a'].replace(REPLACEMENT_VALUE, np.nan, inplace=True)
return df
df = project_R_on_Q(df)
The output on the console looks like this.
a type
0 NaN Q
1 1.0 R
2 1.0 Q
3 1.0 Q
4 1.0 Q
5 NaN R
6 NaN Q
7 NaN Q
8 NaN Q
9 9.0 R
10 9.0 Q
11 9.0 Q
12 9.0 Q
13 9.0 Q
14 14.0 R
15 14.0 Q
Although this works it appears to be a rather poor solution to a common use-case. We wonder if there is a more straight forward approach to this problem.
CodePudding user response:
You can use groupby.ffill
on the groups starting with R:
df.groupby(df['type'].eq('R').cumsum()).ffill()
output:
a type
0 NaN Q
1 1.0 R
2 1.0 Q
3 1.0 Q
4 1.0 Q
5 NaN R
6 NaN Q
7 NaN Q
8 NaN Q
9 9.0 R
10 9.0 Q
11 9.0 Q
12 9.0 Q
13 9.0 Q
14 14.0 R
15 14.0 Q