Home > OS >  Split cells into multiple rows and make groupby counts in Pandas
Split cells into multiple rows and make groupby counts in Pandas

Time:10-22

I am trying to split the cells into multiple rows by comma and make the groupby count. One complication is that sometimes I get strange whitespace after the split (I don't understand why and I cannot replicate the strange case). This would make the groupby count wrong. To overcome this, I could strip out the whitespace after each split. My question is how to make the process more 'integrated' - accommodating the whitespace stripping and fewer repeated lines.

data = {'place':["US", "Japan", "UK", "Japan", "UK"],
        'colour':["red", "yellow", "blue", "red", "green"],
        'fruit1':["organge, apple", "apple", "organge, dragon fruit, watermelon", "organge, others", "others"],
        'fruit2':["apple, organge", "others", "watermelon, dragon fruit, organge", "watermelon", "others"]}
df = pd.DataFrame(data)
df

dt = (df.set_index(['place', 'colour', 'fruit1'])
        .apply(lambda x: x.str.split(',').explode())
        .reset_index()) 

dt2 = (dt.set_index(['place', 'colour', 'fruit2'])
        .apply(lambda x: x.str.split(',').explode())
        .reset_index()) 

dt2['fruit1'] = dt2['fruit1'].str.strip()
dt2['fruit2'] = dt2['fruit2'].str.strip()

dt2.groupby(['fruit1','fruit2']).size().reset_index(name='counts')

Expected output:

           fruit1         fruit2  counts
0           apple        organge       1
1           apple          apple       1
2    dragon fruit   dragon fruit       1
3    dragon fruit        organge       1
4    dragon fruit     watermelon       1
5          others     watermelon       1
6      watermelon   dragon fruit       1
7      watermelon        organge       1
8      watermelon     watermelon       1
9           apple         others       1
10        organge   dragon fruit       1
11        organge        organge       2
12        organge          apple       1
13        organge     watermelon       2
14         others         others       1

CodePudding user response:

You can split by , with space both columns and then create product of them, last count them:

dt = df[['fruit1','fruit2']].apply(lambda x: x.str.split(', '))
        
from  itertools import product
dt = pd.DataFrame([j for i in dt.to_numpy() for j in product(*i)],
                   columns=['fruit1','fruit2'])

df = dt.groupby(['fruit1','fruit2']).size().reset_index(name='counts')
   
print (df)
          fruit1        fruit2  counts
0          apple         apple       1
1          apple       organge       1
2          apple        others       1
3   dragon fruit  dragon fruit       1
4   dragon fruit       organge       1
5   dragon fruit    watermelon       1
6        organge         apple       1
7        organge  dragon fruit       1
8        organge       organge       2
9        organge    watermelon       2
10        others        others       1
11        others    watermelon       1
12    watermelon  dragon fruit       1
13    watermelon       organge       1
14    watermelon    watermelon       1
  • Related