I have a dataframe like the following, multi-indexed on filename and number and containing a column with a label name.
arr = [['filename1', 1, 'label1'],
['filename1', 2, 'label1'],
['filename1', 3, 'label2'],
['filename2', 1, 'label2'],
['filename2', 2, 'label2'],
['filename2', 3, 'label1']]
df = pd.DataFrame(arr, columns=['filename', 'number', 'label']).set_index(['filename', 'number'])
All I want to do is to create another boolean column that is True
if the number of occurrences of label2
within the same filename is greater or equal to 2.
In this case, the expected output would be:
label selected
filename number
filename1 1 label1 False # only 1x label2 in filename1
2 label1 False
3 label2 False
filename2 1 label2 True # 2x label2 in filename2
2 label2 True
3 label1 True
What I managed so far: I got the list of filenames containing at least 2 times label2
with the following:
selected_filenames = df.loc[df['label'] == 'label2', 'label'].groupby(level=0).count().ge(2)
...but I didn't manage to apply the result from groupby().count().ge()
back to the original dataframe.
I had also a look at pd.transform but I don't want to use any transformation function, just broadcast the groupby result as is to all the rows.
CodePudding user response:
Let us try transform
with join
out = df.join(df.eq('label2')['label'].groupby(level=0).transform('sum').ge(2).to_frame('selected'))
Out[350]:
label selected
filename number
filename1 1 label1 False
2 label1 False
3 label2 False
filename2 1 label2 True
2 label2 True
3 label1 True
CodePudding user response:
Starting from your command, you can just select the rows with the "selected" filenames and assign True (this will broadcast to all rows), then fillna
the others with False:
s = df.loc[df['label'] == 'label2', 'label'].groupby(level=0).count().ge(2)
df.loc[s[s].index, 'selected'] = True
df['selected'] = df['selected'].fillna(False)
output:
label selected
filename number
filename1 1 label1 False
2 label1 False
3 label2 False
filename2 1 label2 True
2 label2 True
3 label1 True