Home > Software engineering >  How to groupby a column but keep all rows as columns
How to groupby a column but keep all rows as columns

Time:05-06

I have a dataframe that was a result of a join operation. This operation had multiple matches, resulting in multiple rows. I want to move resulting match rows to be moved in to columns. Here is an example:

import pandas as pd
a = pd.DataFrame([[111,2,3]], columns=['id', 'var1', 'var2'])
b = pd.DataFrame([[111,'999','some data'],
                  [111,'999888','some more data']],
                  columns=['id', 'B', 'C'])
c = pd.merge(a, b, on='id')

I get:

    id      var1    var2    B       C
0   111     2       3       999     some data
1   111     2       3       999888  some more data

but really I want:

    id  var1    var2    B   C           B       C
0   111 2       3       999 some data   999888  some more data

I was thinking pivot was what I wanted but it makes the value the columns, not what I want. How can I achieve this and what is this operation called?

EDIT: To clarify, I don't care about the column names, could be b1 and b2 etc.

CodePudding user response:

Use:

c.groupby('id').agg({'var1': lambda x: x[0], 'var2': lambda x: x[0], 'B': lambda x: list(x), 'C': lambda x: list(x)})

CodePudding user response:

IIUC, you can first reshape your dataframe b to force having duplicated columns, then join to a:

b2 = (b
  .assign(col=b.groupby('id').cumcount())
  .pivot(index='id', columns='col')
  .sort_index(level='col', axis=1, sort_remaining=False)
  .droplevel('col', axis=1)
)

#        B          C       B               C
# id                                         
# 111  999  some data  999888  some more data


c = a.join(b2, on='id')

#     id  var1  var2    B          C       B               C
# 0  111     2     3  999  some data  999888  some more data

with non-duplicated column names:

b2 = (b.assign(col=b.groupby('id').cumcount().add(1))
  .pivot(index='id', columns='col')
  .sort_index(level='col', axis=1, sort_remaining=False)
  .pipe(lambda d: d.set_axis(d.columns.map(lambda x: '_'.join(map(str,x))),
                             axis=1))
)

#      B_1        C_1     B_2             C_2
# id                                         
# 111  999  some data  999888  some more data


c = a.join(b2, on='id')

#     id  var1  var2  B_1        C_1     B_2             C_2
# 0  111     2     3  999  some data  999888  some more data

CodePudding user response:

Let us stack and unstack to reshape the dataframe:

k = ['id', 'var1', 'var2']
c = c.set_index([*k, c.groupby(k).cumcount().add(1)]).stack().unstack([-1, -2])
c.columns = c.columns.map('{0[0]}_{0[1]}'.format)

Result

print(c)

               B_1        C_1     B_2             C_2
id  var1 var2                                        
111 2    3     999  some data  999888  some more data

CodePudding user response:

A solution in brief:

c = pd.concat([
    c.groupby('id').nth(0).drop(columns=['B','C']).reset_index(),
    c[['B','C']].unstack().to_frame().swaplevel().sort_index().T
    ], axis=1)
c = c.rename(columns={col:col[1] if isinstance(col, tuple) else col for col in c.columns})

Output:

    id  var1  var2    B          C       B               C
0  111     2     3  999  some data  999888  some more data

Some details / variations:

To get a result without duplicate column names, instead having each column in each "result set" be a tuple of the result set number (0, 1, ...) and the column name (B, C):

import pandas as pd
a = pd.DataFrame([[111,2,3]], columns=['id', 'var1', 'var2'])
b = pd.DataFrame([[111,'999','some data'], [111,'999888','some more data']], columns=['id', 'B', 'C'])
c = pd.merge(a, b, on='id')
print(c)

x = c.groupby('id').nth(0).drop(columns=['B','C']).reset_index()
y = c[['B','C']].unstack().to_frame().swaplevel().sort_index().T
c = pd.concat([x, y], axis=1)
print(c)

Output:

    id  var1  var2 (0, B)     (0, C)  (1, B)          (1, C)
0  111     2     3    999  some data  999888  some more data

After this, if you decide you want each B, C result to be in a pair of columns with duplicate names (B, C) you can then do this:

c.columns = [col[1] if isinstance(col, tuple) else col for col in c.columns]

... which will make c look like this:

    id  var1  var2    B          C       B               C
0  111     2     3  999  some data  999888  some more data
  • Related