I have a dataframe like:
lst = [["High", "A"], ["High", "A"], ["High", "B"],["Medium", "A"], ["Medium", "B"], ["Medium", "C"]]
df = pd.DataFrame(lst, columns =["Class", "Grade"])
I need to get the mode (majority vote) of "Grade" in each "Class". If it's a tie vote, assign "x".
Below is what I expect to get:
Class | Grade | Majority_vote |
---|---|---|
High | A | A |
High | A | A |
High | B | A |
Medium | A | x |
Medium | B | x |
Medium | C | x |
This is my code:
df['majority_vote'] = df.groupby(['Class'])['Grade'].transform(lambda x: x.mode()[0])
I think the code will return 'nan' if it's a tie vote. Then, I will change 'nan' to 'x' later.
However, what I get is below:
Class | Grade | Majority_vote |
---|---|---|
High | A | A |
High | A | A |
High | B | A |
Medium | A | A |
Medium | B | A |
Medium | C | A |
At class "Medium", the code returns the 1st element ("A") instead of 'nan'.
Any other method is appreciated. Could you please help me? Thank you in advance.
CodePudding user response:
The issue with using x.mode()[0]
is that pd.Series(['A', 'B', 'C']).mode()
evaluates to ['A', 'B', 'C']
. Meanwhile, pd.Series(['A', 'A', 'B']).mode()
evaluates to ['A']
.
Here is a function that will return the mode (if there is only one) and "x" if there is a tie (i.e., multiple modes).
import pandas as pd
lst = [["High", "A"], ["High", "A"], ["High", "B"],["Medium", "A"], ["Medium", "B"], ["Medium", "C"]]
df = pd.DataFrame(lst, columns=["Class", "Grade"])
def get_mode_or_x(series):
mode = series.mode()
if mode.size == 1:
return mode[0]
return "x"
df.loc[:, "majority_vote"] = df.groupby("Class")["Grade"].transform(get_mode_or_x)
index | Class | Grade | majority_vote |
---|---|---|---|
0 | High | A | A |
1 | High | A | A |
2 | High | B | A |
3 | Medium | A | x |
4 | Medium | B | x |
5 | Medium | C | x |