For building an ensemble model I want to create a table with all results of a classification. Next I want to calculate per row the amount of different values and find the most frequent value.
Let's say the initial table looks like:
---- -------- -------- --------
| | col1 | col2 | col3 |
|---- -------- -------- --------|
| 0 | 1 | 2 | 3 | <- 3 different values, no most frequent one, take largest (3)
| 1 | 2 | 2 | 2 | <- 1 value, 2 is most frequent
| 2 | 3 | 2 | 2 | <- 2 values, 2 is most frequent
---- -------- -------- --------
If there is no most frequent one, like in this example in row 0
, it should take the largest one - in this example it would be 3
.
Final result should look like:
---- -------- -------- -------- -------------------- -----------------
| | col1 | col2 | col3 | different_values | most_frequent |
|---- -------- -------- -------- -------------------- -----------------|
| 0 | 1 | 2 | 3 | 3 | 3 |
| 1 | 2 | 2 | 2 | 1 | 2 |
| 2 | 3 | 2 | 2 | 2 | 2 |
---- -------- -------- -------- -------------------- -----------------
I know how to solve it column by column, but I'm struggling with row by row.
MWE
Data:
import pandas as pd
df = pd.DataFrame({
"col1":[1,2,3],
"col2":[2,2,2],
"col3":[3,2,2]
})
Result:
df["different_values"] = [3,1,2]
df["most_frequent"] = [3, 2, 2]
CodePudding user response:
Check nunqiue
and mode
df["most_frequent"] = df.mode(axis=1) # when there is only one most freq value return
#df.mode(axis=1).max(1) #if there is more than one same freq value
#df.mode(axis=1).min(1) # for get the smallest
df["different_values"] = df.nunique(axis=1)
df
Out[73]:
col1 col2 col3 different_values most_frequent
0 1 2 3 3 3
1 2 2 2 1 2
2 3 2 2 2 2