I have a dataframe df
and the column cats
is categorical, as:
> df['cats'] = pd.Categorical(df['cats'], categories=['A ', 'A', 'B ', 'B', 'C', '-'], ordered=True)
From df
I got a df1
like the following table:
groups | cats | value_name | ... | |
---|---|---|---|---|
0 | group1 | - | 0.04 | ... |
1 | group1 | A | 0.02 | ... |
2 | group1 | A | 0.12 | ... |
3 | group1 | B | 0.00 | ... |
4 | group1 | B | 0.25 | ... |
5 | group2 | - | 0.00 | ... |
6 | group2 | A | 0.05 | ... |
7 | group2 | A | 0.30 | ... |
8 | group2 | B | 0.09 | ... |
9 | group2 | B | 0.04 | ... |
10 | group3 | - | 0.13 | ... |
... | ... | ... | ... | ... |
Then I used df.pivot()
and changed the MultiIndex
columns to simple columns by renaming them:
> df2 = df1.iloc[:,0:2].set_index('groups').pivot(columns='value_name')
> df2.columns = df2.columns.get_level_values(1)
The resulting dataframe is:
cats | - | A | A | B | B |
---|---|---|---|---|---|
groups | |||||
group1 | 0.04 | 0.02 | 0.12 | 0.00 | 0.25 |
group2 | 0.00 | 0.05 | 0.30 | 0.09 | 0.04 |
group3 | 0.13 | 0.06 | 0.09 | 0.20 | 0.00 |
This is the target dataframe:
cats | A | A | B | B | - |
---|---|---|---|---|---|
groups | |||||
group1 | 0.12 | 0.02 | 0.25 | 0.00 | 0.04 |
group2 | 0.30 | 0.05 | 0.04 | 0.09 | 0.00 |
group3 | 0.09 | 0.06 | 0.00 | 0.20 | 0.13 |
My target dataframe should have ordered columns, because I need it later in plots and the columns will be x-axis. The problem is that the cats
values may not always be complete as the category I defined before. In this case, 'C'
is missing in df1
. I don't want to change the order by hand (slicing dataframes and concatenate them again) because I will generate many similar df
s in loops and the cats
values are not the same.
Follow-up question:
In the process, I had to convert the category
to string
or int
in order to do some calculations, and convert back to category
. Sometimes, however, df['cats'] = pd.Categorical(df['cats'], categories=cats_list, ordered=True)
can't replace the cats
values as it supposed to.
There is another cats
, which is a list of numbers. When I convert it back to category
, some cats
values (not all), that correspond to NaN
or 0
from the calculated columns, become NaN
. (It doesn't matter if it has long format or wide format. I'll just use the wide format here as an example.)
Before:
cats | 0 | 10 | 20 | 50 | 100 |
---|---|---|---|---|---|
groups | |||||
group1 | |||||
group2 |
After:
cats | 0 | 10 | 20 | 50 | NaN |
---|---|---|---|---|---|
groups | |||||
group1 | |||||
group2 |
Does anyone know what might cause this?
CodePudding user response:
For me working DataFrame.pivot
with DataFrame.sort_index
for correct ordering columns:
df['cats'] = pd.Categorical(df['cats'],
categories=['A ', 'A', 'B ', 'B', 'C', '-'],
ordered=True)
df2 = df.pivot('groups','cats','value_name').sort_index(axis=1)
print (df2)
cats A A B B -
groups
group1 0.12 0.02 0.25 0.00 0.04
group2 0.30 0.05 0.04 0.09 0.00
group3 NaN NaN NaN NaN 0.13
CodePudding user response:
Your pivot
seems incorrect, you should fix the parameters and combine with sort_index
on axis=1
:
df2 = (df
.pivot(index='groups', columns='cats', values='value_name')
.sort_index(axis=1)
)
output:
cats A A B B -
groups
group1 0.12 0.02 0.25 0.00 0.04
group2 0.30 0.05 0.04 0.09 0.00
group3 NaN NaN NaN NaN 0.13
You can check that you have an ordered CategoricalIndex as column:
df2.columns
CategoricalIndex(['A ', 'A', 'B ', 'B', '-'],
categories=['A ', 'A', 'B ', 'B', 'C', '-'],
ordered=True, dtype='category', name='cats')