How to find categorical data where one category (including NaN) represents at least 80% of all categ-CodePudding

I have Pandas DataFrame in Python like below:

COL1	COL2	COL3
ABC	11	NaN
NaN	10	NaN
ABC	11	NaN
ABC	11	NaN
DDD	12	NaN
ABC	NaN	GAME

And I need to create list of variables, where one category represents >= 80% of all categories of a given categorical variable. So I need to:

Select only categorical variables
Make value_counts(dropna=False), because I need to include as categories also missing variables
Create list of variables from above DataFrame where one category represents >= 80% of all categories of a given categorical variable

So, as a result I need something like: my_list = ["COL3"], because --> (5xNaN) / 6 rows = 0.83

How can I do that in Python Pandas ?

CodePudding user response：

# Import pandas
import pandas as pd

# Load the data
df = pd.DataFrame({'COL1': ['ABC', 'NaN', 'ABC', 'ABC', 'DDD', 'ABC'],
                   'COL2': [11, 10, 11, 11, 12, 'NaN'],
                   'COL3': ['NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'GAME']})

# Initialize the list to store the variables
my_list = []

# Iterate over the categorical variables
for cat_col in df.select_dtypes(include='object'):
    # Get the value counts for the variable, including missing values
    counts = df[cat_col].value_counts(dropna=False)
    # Calculate the proportion of the most common category
    prop = counts.iloc[0] / counts.sum()
    # If the proportion is greater than or equal to 0.8, add the variable to the list
    if prop >= 0.8:
        my_list.append(cat_col)

# Print the list of variables
print(my_list)

CodePudding user response：

While @Jan answer works, I would like to propose a pandas implementation avoiding using a for loop:

#Selects the categories
categories = df.select_dtypes(include='object')
#Fill the nan values with a placeholder 
#(assuming the nans are np.nans and not string representation of nan)
categories = categories.fillna("NO DATA")
#Describe the df to compute the frequencies
description = categories.describe().transpose()
#Select only the ones with highest freq superior to 0.8
my_list = description[description["freq"] > description["count"]*0.8].index.values

Output (in my_list):

array(['COL3'], dtype=object)

CodePudding user response：

If you would like to use a one-liner, you can apply a lambda function:

df = pd.DataFrame({'COL1': ['ABC', 'NaN', 'ABC', 'ABC', 'DDD', 'ABC'],
               'COL2': [11, 10, 11, 11, 12, 'NaN'],
               'COL3': ['NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'GAME']})

df.apply(lambda x: (100*x.value_counts(dropna=False).iloc[0]/x.shape[0])).to_frame('% One Value')