Calculate the ratio of specific columns with the same values as other columns in Pandas-CodePudding

Given a DataFrame df as follows:

   id  target1  target2  value1  value2  value3
0   1        1       -1     NaN     0.0     NaN
1   2       -1        1     1.0     1.0     1.0
2   3        1       -1     1.0    -1.0    -1.0
3   4        1       -1     1.0     1.0     1.0
4   5       -1       -1    -1.0     1.0     1.0
5   6       -1       -1     1.0    -1.0    -1.0
6   7       -1        1    -1.0     NaN     1.0

Let's say for columns target1 and target2, I hope to know which value columns (value1, value2, value3, value4, etc.) have more same values with them respectively. Please note if value column's values are NaNs, it will not be included in the comparison.

For example, for target1-value1, we find id is 3, 4, 5 and 7, they have same values, so the ratio is 4/6 = 0.666667 (id==1 will not be included since value1 is NaN for that cell). Same logic for other columns.

So my question is how could I get the expected result below with Pandas?

  value_cols   target1   target2
0     value1  0.666667  0.333333
1     value2  0.333333  0.500000
2     value3  0.333333  0.666667

CodePudding user response：

You could use apply. The general idea is to iterate over the "target" columns and count the number of matches with "value" columns (with eq on axis sum); then find the ratio by dividing the totals found with the number of non-NaN rows for each "value" column:

val_cols = ['value1','value2','value3']
totals = df[val_cols].notna().sum()
out = df[['target1','target2']].apply(lambda col: df[val_cols].eq(col, axis=0).sum() / totals)

# out.rename_axis('value_cols').reset_index()

Output:

         target1   target2
value1  0.666667  0.333333
value2  0.333333  0.500000
value3  0.333333  0.666667