Home > OS >  How can I groupby row with multi-column with pandas?
How can I groupby row with multi-column with pandas?

Time:11-18

I'm beginner of pandas so I have a question below. There's a lot of answers about groupby rows but I can't find the answer what I want.

anyway my datatable is below.

   COLUMN1    COLUMN2   COLUMN3
0  APPLE      RED       JOHN, JANE
1  BANANA     YELLOW    SMITH
1  BANANA     YELLOW    EMILY
2  GRAPE      VIOLET    JESSICA
2  GRAPE      VIOLET    REIRA
2  GRAPE      VIOLET    EMMA
2  GRAPE      PURPLE    JOE
2  GRAPE      PURPLE    LISA
3  MELON      GREEN     RIO
3  MELON      GREEN     REIRA
.. 

and I want to get this table. (edit : EXCEPT YELLOW)

   COLUMN1    COLUMN2   COLUMN3
0  APPLE      RED       JOHN, JANE
1  BANANA     YELLOW    SMITH
1  BANANA     YELLOW    EMILY
2  GRAPE      VIOLET    JESSICA, REIRA, EMMA
2  GRAPE      PURPLE    JOE, LISA
3  MELON      GREEN     RIO, REIRA
.. 

How can I get this? Please give me a hint or answer then I'll appreciate a lot.. thank you.

CodePudding user response:

must make reproducible example for answer

EXAMPLE

data = [['APPLE', 'RED', 'JOHN, JANE'],
        ['BANANA', 'YELLOW', 'SMITH'],
        ['BANANA', 'YELLOW', 'EMILY'],
        ['GRAPE', 'VIOLET', 'JESSICA'],
        ['GRAPE', 'VIOLET', 'REIRA'],
        ['GRAPE', 'VIOLET', 'EMMA'],
        ['GRAPE', 'PURPLE', 'JOE'],
        ['GRAPE', 'PURPLE', 'LISA'],
        ['MELON', 'GREEN', 'RIO'],
        ['MELON', 'GREEN', 'REIRA']]
df = pd.DataFrame(data, index=[0, 1, 1, 2, 2, 2, 2, 2, 3, 3], columns=['col1', 'col2', 'col3'])

output(df):

    col1    col2    col3
0   APPLE   RED     JOHN, JANE
1   BANANA  YELLOW  SMITH
1   BANANA  YELLOW  EMILY
2   GRAPE   VIOLET  JESSICA
2   GRAPE   VIOLET  REIRA
2   GRAPE   VIOLET  EMMA
2   GRAPE   PURPLE  JOE
2   GRAPE   PURPLE  LISA
3   MELON   GREEN   RIO
3   MELON   GREEN   REIRA



First
make col3 to list

df1 = df.groupby([df.index, 'col1', 'col2']).agg(list).reset_index()

output(df1):

    level_0 col1    col2    col3
0   0       APPLE   RED     [JOHN, JANE]
1   1       BANANA  YELLOW  [SMITH, EMILY]
2   2       GRAPE   PURPLE  [JOE, LISA]
3   2       GRAPE   VIOLET  [JESSICA, REIRA, EMMA]
4   3       MELON   GREEN   [RIO, REIRA]



Second
join col3 except yellow and explode yellow

df1.assign(col3=df1.apply(lambda x: ','.join(x['col3']) if x['col2'] != 'YELLOW' else x['col3'] , axis=1)).explode('col3')

result:

    level_0 col1    col2    col3
0   0       APPLE   RED     JOHN, JANE
1   1       BANANA  YELLOW  SMITH
1   1       BANANA  YELLOW  EMILY
2   2       GRAPE   PURPLE  JOE,LISA
3   2       GRAPE   VIOLET  JESSICA,REIRA,EMMA
4   3       MELON   GREEN   RIO,REIRA

if you want level_0 to index, use set_index or set_axis

and at next time make reproducible exmaple for answer.

CodePudding user response:

import pandas as pd
df = pd.DataFrame({'col1': ['Apple', 'Banana', 'Banana', 'Grape', 'Grape', 'Grape', 'Apple'], 'col2': ['Red', 'Yellow', 'Yellow', 'Violet', 'Violet', 'Purple', 'Red'], 'col3':['John, Jane', 'Smith', 'Emily', 'Jecica', 'Reira', 'Joe', 'Rio']})
df2 = df.groupby(['col1', 'col2'])['col3'].apply(list).reset_index()
df2['col3'] = df2['col3'].apply(lambda x: ', '.join(map(str, x)))
df2

To avoid splitting Yellow of Banana,

import pandas as pd
df = pd.DataFrame({'col1': ['Apple', 'Banana', 'Banana', 'Grape', 'Grape', 'Grape', 'Apple'], 'col2': ['Red', 'Yellow', 'Yellow*', 'Violet', 'Violet', 'Purple', 'Red'], 'col3':['John, Jane', 'Smith', 'Emily', 'Jecica', 'Reira', 'Joe', 'Rio']})
df2 = df.groupby(['col1', 'col2'])['col3'].apply(list).reset_index()
df2['col3'] = df2['col3'].apply(lambda x: ', '.join(map(str, x)))
df2['col2'] = df2['col2'].replace('Yellow*', 'Yellow')
df2
  • Related