Home > other >  Pandas groupby two columns and expand the third
Pandas groupby two columns and expand the third

Time:11-11

I have a Pandas dataframe with the following structure:

A       B       C
a       b       1
a       b       2
a       b       3
c       d       7
c       d       8
c       d       5
c       d       6
c       d       3
e       b       4
e       b       3
e       b       2
e       b       1

And I will like to transform it into this:

A       B       C1      C2      C3      C4      C5
a       b       1       2       3       NAN     NAN
c       d       7       8       5       6       3
e       b       4       3       2       1       NAN

In other words, something like groupby A and B and expand C into different columns.

Knowing that the length of each group is different.

C is already ordered

Shorter groups can have NAN or NULL values (empty), it does not matter.

SOLUTION:

The accepted solution but avoiding the deprecation warning:

N = 3
g  = df_grouped.groupby(['A','B']).cumcount()
df_grouped['g1'], df_grouped['g2'] = g // N, (g % N)   1
df_grouped = (df_grouped.pivot(index=['A','B','g1'], columns='g2', values='C')
        .add_prefix('C_')
        .astype('Int64')
        .droplevel(-1)
        .rename_axis(columns=None)
        .reset_index())

Thanks!

CodePudding user response:

Use GroupBy.cumcount with add 1 for starting new columns names by 1, then pass to DataFrame.pivot, add DataFrame.add_prefix for rename columns names with remove columns name by DataFrame.rename_axis and convert MultiIndex by DataFrame.reset_index to columns A,B:

df['g'] = df.groupby(['A','B']).cumcount().add(1)

df = df.pivot(['A','B'], 'g', 'C').add_prefix('C').rename_axis(columns=None).reset_index()
print (df)
   A  B   C1   C2   C3   C4   C5
0  a  b  1.0  2.0  3.0  NaN  NaN
1  c  d  7.0  8.0  5.0  6.0  3.0
2  e  b  4.0  3.0  2.0  1.0  NaN

If need integers add DataFrame.astype with Int64:

df['g'] = df.groupby(['A','B']).cumcount().add(1)

df = (df.pivot(['A','B'], 'g', 'C')
        .add_prefix('C')
        .astype('Int64')
        .rename_axis(columns=None)
        .reset_index())
print (df)
   A  B  C1  C2  C3    C4    C5
0  a  b   1   2   3  <NA>  <NA>
1  c  d   7   8   5     6     3
2  e  b   4   3   2     1  <NA>

EDIT: If need maximum N new columns it means A,B are duplicated, need helper groups g1, g2 with integer and modulo division and add to helper new level in index:

N = 4
g  = df.groupby(['A','B']).cumcount()
df['g1'], df['g2'] = g // N, (g % N)   1
df = (df.pivot(['A','B','g1'], 'g2', 'C')
        .add_prefix('C')
        .droplevel(-1)
        .rename_axis(columns=None)
        .reset_index())
print (df)
   A  B   C1   C2   C3   C4
0  a  b  1.0  2.0  3.0  NaN
1  c  d  7.0  8.0  5.0  6.0
2  c  d  3.0  NaN  NaN  NaN
3  e  b  4.0  3.0  2.0  1.0 
  • Related