I have seen a lot of similar questions but none seem to work for my case. I'm pretty sure this is just a groupby transform but I keep getting KeyError
along with axis
issues. I am trying to groupby filename
and check count where pred != gt
.
For example Index 2 is the only one for f1.wav
so 1, and Index (13,14,18) for f2.wav
so 3.
df = pd.DataFrame([{'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 2, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f1.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f2.wav'}, {'pred': 2, 'gt': 2, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 2, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 2, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 0, 'filename': 'f2.wav'}, {'pred': 2, 'gt': 2, 'filename': 'f2.wav'}, {'pred': 0, 'gt': 2, 'filename': 'f2.wav'}, {'pred': 2, 'gt': 0, 'filename': 'f2.wav'}])
pred gt filename
0 0 0 f1.wav
1 0 0 f1.wav
2 2 0 f1.wav
3 0 0 f1.wav
4 0 0 f1.wav
5 0 0 f1.wav
6 0 0 f1.wav
7 0 0 f1.wav
8 0 0 f1.wav
9 0 0 f1.wav
10 0 0 f2.wav
Expected output
pred gt filename counts
0 0 0 f1.wav 1
1 0 0 f1.wav 1
2 2 0 f1.wav 1
3 0 0 f1.wav 1
4 0 0 f1.wav 1
5 0 0 f1.wav 1
6 0 0 f1.wav 1
7 0 0 f1.wav 1
8 0 0 f1.wav 1
9 0 0 f1.wav 1
10 0 0 f2.wav 3
11 0 0 f2.wav 3
12 2 2 f2.wav 3
13 0 2 f2.wav 3
14 0 2 f2.wav 3
15 0 0 f2.wav 3
16 0 0 f2.wav 3
17 2 2 f2.wav 3
18 0 2 f2.wav 3
19 2 0 f2.wav 3
I was thinking
df.groupby('filename').transform(lambda x: x['pred'].ne(x['gt']).sum(), axis=1)
but I get TypeError: Transform function invalid for data types
CodePudding user response:
.transform
operates on each column individually, so you won't be able to access both 'pred' and 'gt' in a transform operation.
This leaves you with 2 options:
- aggregate and reindex or join back to the original shape
- pre-compute the boolean array and
.transform
on that
approach 2 will probably be the fastest here:
df['counts'] = (
(df['pred'] != df['gt'])
.groupby(df['filename']).transform('sum')
)
print(df)
pred gt filename counts
0 0 0 f1.wav 1
1 0 0 f1.wav 1
2 2 0 f1.wav 1
3 0 0 f1.wav 1
4 0 0 f1.wav 1
5 0 0 f1.wav 1
6 0 0 f1.wav 1
7 0 0 f1.wav 1
8 0 0 f1.wav 1
9 0 0 f1.wav 1
10 0 0 f2.wav 4
11 0 0 f2.wav 4
12 2 2 f2.wav 4
13 0 2 f2.wav 4
14 0 2 f2.wav 4
15 0 0 f2.wav 4
16 0 0 f2.wav 4
17 2 2 f2.wav 4
18 0 2 f2.wav 4
19 2 0 f2.wav 4
Note that f2.wav
has 4 instances where 'pre' != 'gt' (index 13, 14, 18, 19)
CodePudding user response:
Considering that df
is the dataframe OP shares in the question, in order to groupby filename
and check count where pred
!=
gt
, one can use pandas.DataFrame.groupby
and
pandas.DataFrame.apply
as follows
df2 = df.groupby('filename').apply(lambda x: x[x['pred'] != x['gt']])
[Out]:
pred gt filename
filename
f1.wav 2 2 0 f1.wav
f2.wav 13 0 2 f2.wav
14 0 2 f2.wav
18 0 2 f2.wav
19 2 0 f2.wav
Assuming one wants to count the number of occurrences for each filename
, as, after the previous operation, filename
is both an index level and a column label, which is ambiguous, and considering that OP wants to have a column named count
to count the number of each item in each group, one will have to groupby
level (one of the various parameters one can pass), and, finally, use pandas.core.groupby.GroupBy.cumcount
. (Note: As opposed to the accepted answer, this approach will count sequentially)
df2['count'] = df2.groupby(level=0).cumcount() 1 # The 1 is to make the count start at 1 instead of 0.
[Out]:
pred gt filename count
filename
f1.wav 2 2 0 f1.wav 1
f2.wav 13 0 2 f2.wav 1
14 0 2 f2.wav 2
18 0 2 f2.wav 3
19 2 0 f2.wav 4
A one-liner would look like the following
df2['count'] = df.groupby('filename').apply(lambda x: x[x['pred'] != x['gt']]).groupby(level=0).cumcount() 1
[Out]:
pred gt filename count
filename
f1.wav 2 2 0 f1.wav 1
f2.wav 13 0 2 f2.wav 1
14 0 2 f2.wav 2
18 0 2 f2.wav 3
19 2 0 f2.wav 4
If having the count in a separate column is not a requirement, considering df2
as the dataframe after the first operation mentioned in this answer (when df2
was created), then one can simply use the following (which gives a more high-level overview)
df3 = df2.groupby(level=0).count().iloc[:, 0]
[Out]:
filename
f1.wav 1
f2.wav 4
Name: pred, dtype: int64