I have Pandas DataFrame in Python like below:
COL1 | COL2 | COL3 |
---|---|---|
ABC | 11 | NaN |
NaN | 10 | NaN |
ABC | 11 | NaN |
ABC | 11 | NaN |
DDD | 12 | NaN |
ABC | NaN | GAME |
And I need to create list of variables, where one category represents >= 80% of all categories of a given categorical variable. So I need to:
- Select only categorical variables
- Make value_counts(dropna=False), because I need to include as categories also missing variables
- Create list of variables from above DataFrame where one category represents >= 80% of all categories of a given categorical variable
So, as a result I need something like: my_list = ["COL3"]
, because --> (5xNaN) / 6 rows = 0.83
How can I do that in Python Pandas ?
CodePudding user response:
# Import pandas
import pandas as pd
# Load the data
df = pd.DataFrame({'COL1': ['ABC', 'NaN', 'ABC', 'ABC', 'DDD', 'ABC'],
'COL2': [11, 10, 11, 11, 12, 'NaN'],
'COL3': ['NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'GAME']})
# Initialize the list to store the variables
my_list = []
# Iterate over the categorical variables
for cat_col in df.select_dtypes(include='object'):
# Get the value counts for the variable, including missing values
counts = df[cat_col].value_counts(dropna=False)
# Calculate the proportion of the most common category
prop = counts.iloc[0] / counts.sum()
# If the proportion is greater than or equal to 0.8, add the variable to the list
if prop >= 0.8:
my_list.append(cat_col)
# Print the list of variables
print(my_list)
CodePudding user response:
While @Jan answer works, I would like to propose a pandas implementation avoiding using a for loop:
#Selects the categories
categories = df.select_dtypes(include='object')
#Fill the nan values with a placeholder
#(assuming the nans are np.nans and not string representation of nan)
categories = categories.fillna("NO DATA")
#Describe the df to compute the frequencies
description = categories.describe().transpose()
#Select only the ones with highest freq superior to 0.8
my_list = description[description["freq"] > description["count"]*0.8].index.values
Output (in my_list):
array(['COL3'], dtype=object)
CodePudding user response:
If you would like to use a one-liner, you can apply a lambda function:
df = pd.DataFrame({'COL1': ['ABC', 'NaN', 'ABC', 'ABC', 'DDD', 'ABC'],
'COL2': [11, 10, 11, 11, 12, 'NaN'],
'COL3': ['NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'GAME']})
df.apply(lambda x: (100*x.value_counts(dropna=False).iloc[0]/x.shape[0])).to_frame('% One Value')