Home > OS >  Pandas : Select rows from dataframe having equal value_counts of a specific column
Pandas : Select rows from dataframe having equal value_counts of a specific column

Time:08-11

I have the following dataframe:

df

    Place  Target
1    A       0
2    B       0
3    C       1
4    B       0
5    F       1
6    Z       0

df['Target'].value_counts()

0   4
1   2

What I want is to a new df with the same value_count for all the column values and I want it to be equal to the minority one (here, it's 2).

One desired df would look like:

df2

    Place   Target
1     A       0
2     B       0
3     C       1
5     F       1

df2['Target'].value_counts()

0   2
1   2

Note that the selection (or suppression) process can be done randomly. Thank you for your help!

CodePudding user response:

Here's a solution using groupby and head:

df = pd.DataFrame({'Place': ['A', 'B', 'C', 'B', 'F', 'Z'], 'Target': [0, 0, 1, 0, 1 , 0]})

v_counts = df.Target.value_counts()
minimum = min(v_counts)

df2 = df.groupby('Target').head(minimum)

Output:

  Place  Target
0     A       0
1     B       0
2     C       1
4     F       1

CodePudding user response:

A bit late to the party, but here's an alternative solution of random undersampling

def random_undersampling(df: pd.DataFrame, target_column: str) -> pd.DataFrame:
    # Constructing a DataFrame with all the values counted 
    value_counts = df[target_column].value_counts()
    # Taking least representative value (minority class)
    minority_size = value_counts.iloc[-1]
    
    index_keep = np.array([
        # Chosing random index from all the values present.
        # Note: reason for using `.default_rng` vs just `.choice` is non repetative nature of `.default_rng`
        np.random.default_rng().choice(
            a=df[df[target_column] == value].index, 
            size=minority_size, 
            replace=False
            ) 
        for value in value_counts.index
    ]).flatten() # Flatten is used to flatten an array
    
    # Selecting the index to balance out all the representative values:
    return df.iloc[index_keep]
  • Related