Efficient way to fill pandas cells on condition-CodePudding

I have a pandas dataframe in the following format:

  col1 col2
0    a    A
1    a
2    a    A
3    a
4    b
5    b
6    c    A
7    c

I want to search for col1 values which have at least N corresponding col2 == 'A' values. In such cases, I want to fill in the rest of corresponding cells in col2 column with 'A's.

Let's see a concrete example. Let's suppose that N=1. In this case, we check if there is at least one 'A' in col2 for the various col1 values. For col1='a', there are 2 'A's in col2, hence this case meets the condition. Let's thus fill rows with index 1 and 3 in col2 with 'A's. Moving on, we see that for col1='b', there are no 'A's in col2, hence no need to fill anything here. Lastly, we see that for col1='c' there is exactly one 'A' in col2. As this also meets the condition, we will fill in row with index 7 in col2 too with 'A'.

Similarly, when the N threshold was set to 2, new 'A's would only be added to rows with index 1 and 3 in col2, but not in the last row (index 7).

I am assuming there might be an efficient, vectorized way to go about this. For now, I could only think of iterating through the dataframe, which doesn't scale well for my original dataset with 10s of thousands of rows.

This is what I tried so far. This works for the N=1 case, although I didn't manage to figure out how to generalize it to any N thresholds, let alone how to perhaps implement this more nicely:

df = df.sort_values(['col1','col2'], ascending=[True,False]).reset_index(drop=True)
for idx, row in df.iloc[1:,:].iterrows():
    if df.loc[idx,'col1'] == df.loc[idx-1,'col1'] and df.loc[idx,'col2']=='' and df.loc[idx-1,'col2']!='':
        df.loc[idx,'col2'] = df.loc[idx-1,'col2']

Code to reproduce dataframe:

df = pd.DataFrame(
    [['a','A'],['a',''],['a','A'],['a',''],['b',''],['b',''],['c','A'],['c','']], 
    columns=['col1','col2']
)

CodePudding user response：

Is this what you're looking for?

N = 1
value = 'A'

df.loc[df.groupby('col1')['col2'].transform(lambda x: sum(x == value) >= N), 'col2'] = value

print(df)

  col1 col2
0    a    A
1    a    A
2    a    A
3    a    A
4    b     
5    b     
6    c    A
7    c    A

... then with N = 2 ...

  col1 col2
0    a    A
1    a    A
2    a    A
3    a    A
4    b     
5    b     
6    c    A
7    c

CodePudding user response：

Here's a solution that avoids lambda functions:

N = 1
V = 'A'

df['col2'] = df['col1'].map(df.groupby('col1')['col2'].value_counts().swaplevel()[V].ge(N).map({True:V})).fillna('')

Output:

>>> df
  col1 col2
0    a    A
1    a    A
2    a    A
3    a    A
4    b     
5    b     
6    c    A
7    c    A

CodePudding user response：

Here's a solution that operates on numpy arrays:

def fill_col2(df, N=1):
    df_numpy = df.to_numpy()
    for val in np.unique(df_numpy[:,0]):
        if np.sum(df_numpy[df_numpy[:,0]==val, 1]=='A') >= N:
            df_numpy[df_numpy[:,0]==val, 1] = 'A'
    return df_numpy[:,1]

df['col2'] = fill_col2(df[['col1', 'col2']])

Output:

  col1 col2
0    a    A
1    a    A
2    a    A
3    a    A
4    b     
5    b     
6    c    A
7    c    A