Pandas dataframe - Set boolean column based on groupby count result-CodePudding

I have a dataframe like the following, multi-indexed on filename and number and containing a column with a label name.

arr = [['filename1', 1, 'label1'],
       ['filename1', 2, 'label1'],
       ['filename1', 3, 'label2'],
       ['filename2', 1, 'label2'],
       ['filename2', 2, 'label2'],
       ['filename2', 3, 'label1']]

df = pd.DataFrame(arr, columns=['filename', 'number', 'label']).set_index(['filename', 'number'])

All I want to do is to create another boolean column that is True if the number of occurrences of label2 within the same filename is greater or equal to 2.

In this case, the expected output would be:

                       label    selected
filename    number  
filename1   1          label1   False  # only 1x label2 in filename1
            2          label1   False
            3          label2   False
filename2   1          label2   True   # 2x label2 in filename2
            2          label2   True
            3          label1   True

What I managed so far: I got the list of filenames containing at least 2 times label2 with the following:

selected_filenames = df.loc[df['label'] == 'label2', 'label'].groupby(level=0).count().ge(2)

...but I didn't manage to apply the result from groupby().count().ge() back to the original dataframe.

I had also a look at pd.transform but I don't want to use any transformation function, just broadcast the groupby result as is to all the rows.

CodePudding user response：

Let us try transform with join

out = df.join(df.eq('label2')['label'].groupby(level=0).transform('sum').ge(2).to_frame('selected'))
Out[350]: 
                   label  selected
filename  number                  
filename1 1       label1     False
          2       label1     False
          3       label2     False
filename2 1       label2      True
          2       label2      True
          3       label1      True

CodePudding user response：

Starting from your command, you can just select the rows with the "selected" filenames and assign True (this will broadcast to all rows), then fillna the others with False:

s = df.loc[df['label'] == 'label2', 'label'].groupby(level=0).count().ge(2)
df.loc[s[s].index, 'selected'] = True
df['selected'] = df['selected'].fillna(False)

output:

                   label  selected
filename  number                  
filename1 1       label1     False
          2       label1     False
          3       label2     False
filename2 1       label2      True
          2       label2      True
          3       label1      True