I have the following problem. My DataFrame looks like this (only with 100.000 entries):
col_1 col_2 col_3
green yellow red
yellow green purple
green yellow red
yellow brown green
red yellow purple
red green yellow
What I want though, is that all the greens are in one column, all the reds and all the yellows, etc. So it should look like this:
col_1 col_2 col_3 col_4 col_5
green yellow red
green yellow purple
green yellow red
green yellow brown
yellow red purple
green yellow red
How do I do this? Thanks in advance.
CodePudding user response:
Here's one approach: With get_dummies
convert it to one-hot encoded columns; sum across the columns and use np.where
to populate the DataFrame with column names. Finally, fix the column names:
s = pd.get_dummies(df)
s.columns = [c.split('_')[-1] for c in s.columns]
s = s.groupby(level=0, axis=1).sum()
out = s.apply(lambda c: np.where(c, c.name, '')).rename(columns=dict(zip(s.columns, ['col5','col1','col4','col3','col2']))).sort_index(axis=1)
Output:
col1 col2 col3 col4 col5
0 green yellow red
1 green yellow purple
2 green yellow red
3 green yellow brown
4 yellow red purple
5 green yellow red
CodePudding user response:
Here is one approach with pandas.get_dummies
or str.get_dummies
:
# credit https://stackoverflow.com/a/71143503
df2 = df.apply('|'.join, axis=1).str.get_dummies()
out = df2*df2.columns
or
df2 = (
df.apply(lambda c: pd.get_dummies(c).stack())
.max(1)
.unstack()
.astype(int)
)
out = df2*df2.columns
output:
brown green purple red yellow
0 green red yellow
1 green purple yellow
2 green red yellow
3 brown green yellow
4 purple red yellow
5 green red yellow
alternative output:
df2 = df.apply('|'.join, axis=1).str.get_dummies()
out = df2*df2.columns
out.columns = [f'col_{i}' for i,_ in enumerate(out, start=1)]
output:
col_1 col_2 col_3 col_4 col_5
0 green red yellow
1 green purple yellow
2 green red yellow
3 brown green yellow
4 purple red yellow
5 green red yellow