dataframe succesive projection of one type onto the other using pad-CodePudding

In a process we construct different types of objects, e.g. of type R and Q. One type (R) shall then be projected onto the other type (Q) successively.

The data generation as an example.

import pandas as pd
import numpy as np

np.random.seed(42)
N = 16
item_types = ['Q', 'R']

item_type = np.random.choice(item_types, N)
data_a = np.where(item_type == 'R', np.arange(N), np.nan)
index_of_a_R_type = np.argwhere(item_type == 'R')[-3][0]
data_a[index_of_a_R_type] = np.nan
df = pd.DataFrame(zip(data_a, item_type), columns=['a', 'type'])

At the console the df looks as follows.

       a type
0    NaN    Q
1    1.0    R
2    NaN    Q
3    NaN    Q
4    NaN    Q
5    NaN    R
6    NaN    Q
7    NaN    Q
8    NaN    Q
9    9.0    R
10   NaN    Q
11   NaN    Q
12   NaN    Q
13   NaN    Q
14  14.0    R
15   NaN    Q

The projection can be achieved with the method pad. There is one issue with this approach, if there is a NAN value in the R type, it will be overwritten, which should be prevented. A NAN value should be projected, not overwritten in this case. This behavior can be achieved by replacing the NAN values by a key value, then pad, then replace the key value with NAN again.

Here is the function that does the intended projection.

def project_R_on_Q(df):
    REPLACEMENT_VALUE = -999
    mask_R_and_nan = (df['type'] == 'R') & df['a'].isna()
    df.loc[mask_R_and_nan, 'a'] = REPLACEMENT_VALUE
    df['a'] = df['a'].pad()
    df['a'].replace(REPLACEMENT_VALUE, np.nan, inplace=True)
    return df

df = project_R_on_Q(df)

The output on the console looks like this.

       a type
0    NaN    Q
1    1.0    R
2    1.0    Q
3    1.0    Q
4    1.0    Q
5    NaN    R
6    NaN    Q
7    NaN    Q
8    NaN    Q
9    9.0    R
10   9.0    Q
11   9.0    Q
12   9.0    Q
13   9.0    Q
14  14.0    R
15  14.0    Q

Although this works it appears to be a rather poor solution to a common use-case. We wonder if there is a more straight forward approach to this problem.

CodePudding user response：

You can use groupby.ffill on the groups starting with R:

df.groupby(df['type'].eq('R').cumsum()).ffill()

output:

       a type
0    NaN    Q
1    1.0    R
2    1.0    Q
3    1.0    Q
4    1.0    Q
5    NaN    R
6    NaN    Q
7    NaN    Q
8    NaN    Q
9    9.0    R
10   9.0    Q
11   9.0    Q
12   9.0    Q
13   9.0    Q
14  14.0    R
15  14.0    Q