Is it possible to compare values on the same dataframe and set a new column to count the repetitions with pandas?
From:
id | column1 | column2
0 | a | a
1 | b | b
2 | a | c
3 | c | c
4 | b | b
To:
id | column1 | column2 | count
0 | a | a | 1
1 | b | b | 2
2 | a | c | 0
3 | c | c | 1
4 | b | b | 2
CodePudding user response:
You can use filter
to get the columns of interest, identify the rows with all identical values and use it to create a custom group for a groupby.transform('size')
:
# filter columns in "column"
df2 = df.filter(like='column')
# keep rows with identical columns
s = df2.iloc[:, 0].where(df2.eq(s, axis=0).all(axis=1))
# groupby unique value and get the count
df['count'] = df.groupby(s).transform('size').fillna(0, downcast='infer')
NB. this is working with an arbitrary number of columns
If you really have only two columns, you can simplify to:
m = df['column1'].eq(df['column2'])
df['count'] = (m.groupby(df['column1'].where(m))
.transform('size').fillna(0, downcast='infer')
)
Output:
id column1 column2 count
0 0 a a 1
1 1 b b 2
2 2 a c 0
3 3 c c 1
4 4 b b 2
CodePudding user response:
Compare both columns to helper m
column and use GroupBy.transform
with sum
:
df['count'] = (df.assign(m=df.column1.eq(df.column2))
.groupby(['column1', 'column2'])['m']
.transform('sum'))
print (df)
id column1 column2 count
0 0 a a 1
1 1 b b 2
2 2 a c 0
3 3 c c 1
4 4 b b 2