I have the following dataframe:
df
Place Target
1 A 0
2 B 0
3 C 1
4 B 0
5 F 1
6 Z 0
df['Target'].value_counts()
0 4
1 2
What I want is to a new df with the same value_count for all the column values and I want it to be equal to the minority one (here, it's 2).
One desired df would look like:
df2
Place Target
1 A 0
2 B 0
3 C 1
5 F 1
df2['Target'].value_counts()
0 2
1 2
Note that the selection (or suppression) process can be done randomly. Thank you for your help!
CodePudding user response:
Here's a solution using groupby
and head
:
df = pd.DataFrame({'Place': ['A', 'B', 'C', 'B', 'F', 'Z'], 'Target': [0, 0, 1, 0, 1 , 0]})
v_counts = df.Target.value_counts()
minimum = min(v_counts)
df2 = df.groupby('Target').head(minimum)
Output:
Place Target
0 A 0
1 B 0
2 C 1
4 F 1
CodePudding user response:
A bit late to the party, but here's an alternative solution of random undersampling
def random_undersampling(df: pd.DataFrame, target_column: str) -> pd.DataFrame:
# Constructing a DataFrame with all the values counted
value_counts = df[target_column].value_counts()
# Taking least representative value (minority class)
minority_size = value_counts.iloc[-1]
index_keep = np.array([
# Chosing random index from all the values present.
# Note: reason for using `.default_rng` vs just `.choice` is non repetative nature of `.default_rng`
np.random.default_rng().choice(
a=df[df[target_column] == value].index,
size=minority_size,
replace=False
)
for value in value_counts.index
]).flatten() # Flatten is used to flatten an array
# Selecting the index to balance out all the representative values:
return df.iloc[index_keep]