I have DataFrame like below:
COL1 | COL2 | COL3 | ... | COLn
-----|------|------|------|----
111 | A | Y | ... | ...
222 | A | Y | ... | ...
333 | B | Z | ... | ...
444 | C | Z | ... | ...
555 | D | P | ... | ...
And I need to do dummy coding (pandas.get_dummies()
) only for categories which share in variable is at least 40%.
So, in COL2 only category "A" share in variable is at least 40%, in COL3 categories "Y" and "Z" share is at least 40% in variable.
So, as a result I need output like below:
COL1 | COL2_A | COL3_Y | COL_Z | ... | COLn
-------|---------|---------|-------|------|-------
111 | 1 | 1 | 0 | ... | ...
222 | 1 | 1 | 0 | ... | ...
333 | 0 | 0 | 1 | ... | ...
444 | 0 | 0 | 1 | ... | ...
555 | 0 | 0 | 0 | ... | ...
reproducible example of source df below:
df = pd.DataFrame()
df["COL1"] = [111,222,333,444,555]
df["COL2"] = ["A", "A", "B", "C", "D"]
df["COL3"] = ["Y", "Y", "Z", "Z", "P"]
df
CodePudding user response:
Make dummy variables and remove some of them whose occurence ratio is smaller than 0.4 as follows.
import pandas as pd
df = pd.DataFrame()
df["COL1"] = [111,222,333,444,555]
df["COL2"] = ["A", "A", "B", "C", "D"]
df["COL3"] = ["Y", "Y", "Z", "Z", "P"]
origin_cols = df.columns
df = pd.get_dummies(df)
dummy_vars = [col for col in df.columns if (col.split('_')[0] in origin_cols) and ('_' in col)]
cond = df[dummy_vars].sum() / len(df) < 0.4
useless_cols = cond[cond].index
df = df.drop(useless_cols, axis = 1)