How to change the column order according to the category list?-CodePudding

I have a dataframe df and the column cats is categorical, as:

> df['cats'] = pd.Categorical(df['cats'], categories=['A ', 'A', 'B ', 'B', 'C', '-'], ordered=True)

From df I got a df1 like the following table:

	groups	cats	value_name	...
0	group1	-	0.04	...
1	group1	A	0.02	...
2	group1	A	0.12	...
3	group1	B	0.00	...
4	group1	B	0.25	...
5	group2	-	0.00	...
6	group2	A	0.05	...
7	group2	A	0.30	...
8	group2	B	0.09	...
9	group2	B	0.04	...
10	group3	-	0.13	...
...	...	...	...	...

Then I used df.pivot() and changed the MultiIndex columns to simple columns by renaming them:

> df2 = df1.iloc[:,0:2].set_index('groups').pivot(columns='value_name')
> df2.columns = df2.columns.get_level_values(1)

The resulting dataframe is:

cats	-	A	A	B	B
groups
group1	0.04	0.02	0.12	0.00	0.25
group2	0.00	0.05	0.30	0.09	0.04
group3	0.13	0.06	0.09	0.20	0.00

This is the target dataframe:

cats	A	A	B	B	-
groups
group1	0.12	0.02	0.25	0.00	0.04
group2	0.30	0.05	0.04	0.09	0.00
group3	0.09	0.06	0.00	0.20	0.13

My target dataframe should have ordered columns, because I need it later in plots and the columns will be x-axis. The problem is that the cats values may not always be complete as the category I defined before. In this case, 'C' is missing in df1. I don't want to change the order by hand (slicing dataframes and concatenate them again) because I will generate many similar dfs in loops and the cats values are not the same.

Follow-up question:

In the process, I had to convert the category to string or int in order to do some calculations, and convert back to category. Sometimes, however, df['cats'] = pd.Categorical(df['cats'], categories=cats_list, ordered=True) can't replace the cats values as it supposed to.

There is another cats, which is a list of numbers. When I convert it back to category, some cats values (not all), that correspond to NaN or 0 from the calculated columns, become NaN. (It doesn't matter if it has long format or wide format. I'll just use the wide format here as an example.)

Before:

cats	0	10	20	50	100
groups
group1
group2

After:

cats	0	10	20	50	NaN
groups
group1
group2

Does anyone know what might cause this?

CodePudding user response：

For me working DataFrame.pivot with DataFrame.sort_index for correct ordering columns:

df['cats'] = pd.Categorical(df['cats'], 
                            categories=['A ', 'A', 'B ', 'B', 'C', '-'], 
                            ordered=True)

df2 = df.pivot('groups','cats','value_name').sort_index(axis=1)
print (df2)
cats      A      A    B      B     -
groups                              
group1  0.12  0.02  0.25  0.00  0.04
group2  0.30  0.05  0.04  0.09  0.00
group3   NaN   NaN   NaN   NaN  0.13

CodePudding user response：

Your pivot seems incorrect, you should fix the parameters and combine with sort_index on axis=1:

df2 = (df
 .pivot(index='groups', columns='cats', values='value_name')
 .sort_index(axis=1)
)

output:

cats      A      A    B      B     -
groups                              
group1  0.12  0.02  0.25  0.00  0.04
group2  0.30  0.05  0.04  0.09  0.00
group3   NaN   NaN   NaN   NaN  0.13

You can check that you have an ordered CategoricalIndex as column:

df2.columns
CategoricalIndex(['A ', 'A', 'B ', 'B', '-'],
                 categories=['A ', 'A', 'B ', 'B', 'C', '-'],
                 ordered=True, dtype='category', name='cats')