Home > OS >  How to change the column order according to the category list?
How to change the column order according to the category list?

Time:06-30

I have a dataframe df and the column cats is categorical, as:

> df['cats'] = pd.Categorical(df['cats'], categories=['A ', 'A', 'B ', 'B', 'C', '-'], ordered=True)

From df I got a df1 like the following table:

groups cats value_name ...
0 group1 - 0.04 ...
1 group1 A 0.02 ...
2 group1 A 0.12 ...
3 group1 B 0.00 ...
4 group1 B 0.25 ...
5 group2 - 0.00 ...
6 group2 A 0.05 ...
7 group2 A 0.30 ...
8 group2 B 0.09 ...
9 group2 B 0.04 ...
10 group3 - 0.13 ...
... ... ... ... ...

Then I used df.pivot() and changed the MultiIndex columns to simple columns by renaming them:

> df2 = df1.iloc[:,0:2].set_index('groups').pivot(columns='value_name')
> df2.columns = df2.columns.get_level_values(1)

The resulting dataframe is:

cats - A A B B
groups
group1 0.04 0.02 0.12 0.00 0.25
group2 0.00 0.05 0.30 0.09 0.04
group3 0.13 0.06 0.09 0.20 0.00

This is the target dataframe:

cats A A B B -
groups
group1 0.12 0.02 0.25 0.00 0.04
group2 0.30 0.05 0.04 0.09 0.00
group3 0.09 0.06 0.00 0.20 0.13

My target dataframe should have ordered columns, because I need it later in plots and the columns will be x-axis. The problem is that the cats values may not always be complete as the category I defined before. In this case, 'C' is missing in df1. I don't want to change the order by hand (slicing dataframes and concatenate them again) because I will generate many similar dfs in loops and the cats values are not the same.

Follow-up question:

In the process, I had to convert the category to string or int in order to do some calculations, and convert back to category. Sometimes, however, df['cats'] = pd.Categorical(df['cats'], categories=cats_list, ordered=True) can't replace the cats values as it supposed to.

There is another cats, which is a list of numbers. When I convert it back to category, some cats values (not all), that correspond to NaN or 0 from the calculated columns, become NaN. (It doesn't matter if it has long format or wide format. I'll just use the wide format here as an example.)

Before:

cats 0 10 20 50 100
groups
group1
group2

After:

cats 0 10 20 50 NaN
groups
group1
group2

Does anyone know what might cause this?

CodePudding user response:

For me working DataFrame.pivot with DataFrame.sort_index for correct ordering columns:

df['cats'] = pd.Categorical(df['cats'], 
                            categories=['A ', 'A', 'B ', 'B', 'C', '-'], 
                            ordered=True)

df2 = df.pivot('groups','cats','value_name').sort_index(axis=1)
print (df2)
cats      A      A    B      B     -
groups                              
group1  0.12  0.02  0.25  0.00  0.04
group2  0.30  0.05  0.04  0.09  0.00
group3   NaN   NaN   NaN   NaN  0.13

CodePudding user response:

Your pivot seems incorrect, you should fix the parameters and combine with sort_index on axis=1:

df2 = (df
 .pivot(index='groups', columns='cats', values='value_name')
 .sort_index(axis=1)
)

output:

cats      A      A    B      B     -
groups                              
group1  0.12  0.02  0.25  0.00  0.04
group2  0.30  0.05  0.04  0.09  0.00
group3   NaN   NaN   NaN   NaN  0.13

You can check that you have an ordered CategoricalIndex as column:

df2.columns
CategoricalIndex(['A ', 'A', 'B ', 'B', '-'],
                 categories=['A ', 'A', 'B ', 'B', 'C', '-'],
                 ordered=True, dtype='category', name='cats')
  • Related