Pandas Dataframe create columns by grouping cells by values-CodePudding

I have the following problem. My DataFrame looks like this (only with 100.000 entries):

col_1      col_2        col_3
green      yellow       red
yellow     green        purple
green      yellow       red
yellow     brown        green
red        yellow       purple
red        green        yellow

What I want though, is that all the greens are in one column, all the reds and all the yellows, etc. So it should look like this:

col_1      col_2        col_3      col_4      col_5 
green      yellow       red
green      yellow                  purple
green      yellow       red
green      yellow                             brown
           yellow       red        purple
green      yellow       red

How do I do this? Thanks in advance.

CodePudding user response：

Here's one approach: With get_dummies convert it to one-hot encoded columns; sum across the columns and use np.where to populate the DataFrame with column names. Finally, fix the column names:

s = pd.get_dummies(df)
s.columns = [c.split('_')[-1] for c in s.columns]
s = s.groupby(level=0, axis=1).sum()
out = s.apply(lambda c: np.where(c, c.name, '')).rename(columns=dict(zip(s.columns, ['col5','col1','col4','col3','col2']))).sort_index(axis=1)

Output:

    col1    col2 col3    col4   col5
0  green  yellow  red               
1  green  yellow       purple       
2  green  yellow  red               
3  green  yellow               brown
4         yellow  red  purple       
5  green  yellow  red

CodePudding user response：

Here is one approach with pandas.get_dummies or str.get_dummies:

# credit https://stackoverflow.com/a/71143503
df2 = df.apply('|'.join, axis=1).str.get_dummies()
out = df2*df2.columns

df2 = (
 df.apply(lambda c: pd.get_dummies(c).stack())
   .max(1)
   .unstack()
   .astype(int)
)
out = df2*df2.columns

output:

   brown  green  purple  red  yellow
0         green          red  yellow
1         green  purple       yellow
2         green          red  yellow
3  brown  green               yellow
4                purple  red  yellow
5         green          red  yellow

alternative output:

df2 = df.apply('|'.join, axis=1).str.get_dummies()

out = df2*df2.columns

out.columns = [f'col_{i}' for i,_ in enumerate(out, start=1)]

output:

   col_1  col_2   col_3 col_4   col_5
0         green           red  yellow
1         green  purple        yellow
2         green           red  yellow
3  brown  green                yellow
4                purple   red  yellow
5         green           red  yellow