Home > Back-end >  Is there a good way to turn a pandas column multiindex into a column category?
Is there a good way to turn a pandas column multiindex into a column category?

Time:05-11

Probably easier to understand with an example so here we go:

random = np.random.uniform(size=(3))

what_i_have = pd.DataFrame({ 
    ('a', 'a'): random,
    ('b', 'b1'): np.linspace(3, 5, 3),
    ('b', 'b2'): np.linspace(6, 8, 3),
    ('b', 'b3'): np.linspace(9, 11, 3)
})

what_i_want = pd.DataFrame({
    ('a', 'a'): np.concatenate((random, random, random)),
    ('b', 'b_category'): ['b1']*3   ['b2']*3   ['b3']*3,
    ('b', 'b_value'): np.linspace(3, 11, 9)
})

print(what_i_have)
print('----------------------------------')
print(what_i_want)

Output:

          a    b           
          a   b1   b2    b3
0  0.587075  3.0  6.0   9.0
1  0.798710  4.0  7.0  10.0
2  0.206860  5.0  8.0  11.0
----------------------------------
          a          b        
          a b_category b_value
0  0.587075         b1     3.0
1  0.798710         b1     4.0
2  0.206860         b1     5.0
3  0.587075         b2     6.0
4  0.798710         b2     7.0
5  0.206860         b2     8.0
6  0.587075         b3     9.0
7  0.798710         b3    10.0
8  0.206860         b3    11.0

My issue is that my data doesn't just have b1 b2 b3, it also has b4, b5, b6... All the way to about b90. The obvious solution would be to make a loop creating 90 dataframes, one for each category, then concatenating them into one dataframe, but I imagine there must be a better way of doing it.

edit:

what_i_have.unstack() doesn't really solve the issue, as can be seen below. It could be an intermediate step but there's still some work to do with this result before reaching what I want and I don't see much of an advantage in doing this over the loop solution I've previously mentioned:

a  a   0     0.587075
       1     0.798710
       2     0.206860
b  b1  0     3.000000
       1     4.000000
       2     5.000000
   b2  0     6.000000
       1     7.000000
       2     8.000000
   b3  0     9.000000
       1    10.000000
       2    11.000000

CodePudding user response:

Keeping the MultiIndex, might be a better way out there though:

df = df.melt(id_vars=[df.columns[0]], var_name=['b','b_category'], value_name='b_value')
a = df[[('a','a')]]
b = df[['b', 'b_category', 'b_value']].pivot(columns='b').swaplevel(0,1, axis=1)
df = pd.concat([a, b], axis=1)
df.columns = pd.MultiIndex.from_tuples(df.columns)
print(df)

Output:

          a          b
          a b_category b_value
0  0.737076         b1     3.0
1  0.718409         b1     4.0
2  0.269516         b1     5.0
3  0.737076         b2     6.0
4  0.718409         b2     7.0
5  0.269516         b2     8.0
6  0.737076         b3     9.0
7  0.718409         b3    10.0
8  0.269516         b3    11.0

CodePudding user response:

what_i_have.columns = what_i_have.columns.droplevel()
what_i_have.melt(id_vars=['a1'],value_vars=['b1','b2','b3'],var_name='b_category', value_name='b_value')

I think these commands should solve your issue. You lose the multilevel due to the use of the melt. Not sure how to get around that at the moment but otherwise this should meet your requirements.

output

output

CodePudding user response:

Pull out b and melt:

out = what_i_have.pop('b')

# we need this length to extend 
# what_i_have
size = out.columns.size

out = out.melt()

out.columns = ['b_category', 'b_value']

out.columns = [['b', 'b'], out.columns]

Reindex what is left of what_i_have and concatenate:

index = np.tile(what_i_have.a.index, size)

what_i_want = what_i_have.reindex(index).reset_index(drop = True)

pd.concat([what_i_want, out], axis = 1)

          a          b
          a b_category b_value
0  0.883754         b1     3.0
1  0.172427         b1     4.0
2  0.631352         b1     5.0
3  0.883754         b2     6.0
4  0.172427         b2     7.0
5  0.631352         b2     8.0
6  0.883754         b3     9.0
7  0.172427         b3    10.0
8  0.631352         b3    11.0
  • Related