Given a DataFrame df
as follows:
id target1 target2 value1 value2 value3
0 1 1 -1 NaN 0.0 NaN
1 2 -1 1 1.0 1.0 1.0
2 3 1 -1 1.0 -1.0 -1.0
3 4 1 -1 1.0 1.0 1.0
4 5 -1 -1 -1.0 1.0 1.0
5 6 -1 -1 1.0 -1.0 -1.0
6 7 -1 1 -1.0 NaN 1.0
Let's say for columns target1
and target2
, I hope to know which value columns (value1, value2, value3, value4, etc.) have more same values with them respectively. Please note if value
column's values are NaN
s, it will not be included in the comparison.
For example, for target1-value1
, we find id is 3, 4, 5 and 7, they have same values, so the ratio is 4/6 = 0.666667
(id==1
will not be included since value1
is NaN
for that cell). Same logic for other columns.
So my question is how could I get the expected result below with Pandas?
value_cols target1 target2
0 value1 0.666667 0.333333
1 value2 0.333333 0.500000
2 value3 0.333333 0.666667
CodePudding user response:
You could use apply
. The general idea is to iterate over the "target" columns and count the number of matches with "value" columns (with eq
on axis sum
); then find the ratio by dividing the totals found with the number of non-NaN rows for each "value" column:
val_cols = ['value1','value2','value3']
totals = df[val_cols].notna().sum()
out = df[['target1','target2']].apply(lambda col: df[val_cols].eq(col, axis=0).sum() / totals)
# out.rename_axis('value_cols').reset_index()
Output:
target1 target2
value1 0.666667 0.333333
value2 0.333333 0.500000
value3 0.333333 0.666667