How can I aggregate rows together according to a selected column using a pandas DataFrame-CodePudding

this is my first question in Stack Overflow. I will water down the problem that I have at the moment. I am trying to clean a dataset for a User-based collaborative filtering recommendendation system.

Here's an oversimplication of the dataset I have with all the use-cases

data = pd.DataFrame({'name':    ['John' ,'Jane' ,'Joe'  ,'John' ,'Jane' ,   'Joe'],
                     'movie1':  [''     , 'bad' , 'avg' , 'nice', ''    , ''    ],
                     'movie2':  ['good' , ''    , ''    , ''    , 'poor', ''    ],
                     'movie3':  ['bad'  , ''    , 'good', ''    , ''    , ''    ],
                     })

From for how I sourced my data I know that even though John, Jane and Joe might repeat themselves any amount of times, they will never have more than one rating for any given movie.

I want to be able to aggregate repeated users into a single row so that my output in terminal looks like this:

 name movie1 movie2 movie3
0  John   nice  good    bad
1  Jane    bad  poor            
2   Joe    avg          good

This problem is very similar to this question but the difference is that I'm dealing with string objects and not numbers, therefore I can't use aggregation functions How can I "merge" rows by same value in a column in Pandas with aggregation functions?

My real dataset has 4260 columns and 24169 rows, therefore I'm not capable of applying something like df.groupby(['name','month'])['text'].apply(','.join).reset_index() because it's not possible to write down all the column names. From: Concatenate strings from several rows using Pandas groupby

I tried following the answers of this question but I either got errors or my dataframe stayed the same. Pandas | merge rows with same id

Even though logically it didn't make sense, I tried using data.groupby('name').ffill().drop_duplicates('name' ,keep='last') and I got the following error = KeyError: Index(['name'], dtype='object')

Passing False to as_index inside the groupby gave me the exact same error data.groupby('name', as_index=False).ffill().reset_index().drop_duplicates('name', keep='last')

The closest that I've gotten has been this: data = data.groupby('name', as_index=False).apply(lambda x: x.fillna(method='ffill').iloc[0])

The output that it gives me only deletes repeated rows but doesn't add the ratings to the leftover data:

   name movie1 movie2 movie3
0  Jane    bad              
1   Joe    avg          good
2  John          good    bad

Complete Code:

import pandas as pd

data = pd.DataFrame({'name':    ['John' ,'Jane' ,'Joe'  ,'John' ,'Jane' ,   'Joe'],
                     'movie1':  [''     , 'bad' , 'avg' , 'nice', ''    , ''    ],
                     'movie2':  ['good' , ''    , ''    , ''    , 'poor', ''    ],
                     'movie3':  ['bad'  , ''    , 'good', ''    , ''    , ''    ],
                     })
print('Baseline:')
print(data.head())

#data = data.join(data['name'])
#data.groupby('name').ffill().drop_duplicates('name' ,keep='last')
#data.groupby('name', as_index=False).ffill().reset_index().drop_duplicates('name', keep='last')
data = data.groupby('name', as_index=False).apply(lambda x: x.fillna(method='ffill').iloc[0])
#data.groupby('name').ffill().drop_duplicates('name', keep='last')
#data =  data.groupby(['name'])[['movie1','movie2','movie3']].apply('.'.join).reset_index()
print('End result:')
print(data.head())

CodePudding user response：

IIUC, you can use groupby_first. The trick is to replace empty string by nan then rollback after selecting the first valid value:

>>> data.replace('', np.nan).groupby('name', as_index=False, sort=False).first().fillna('')

   name movie1 movie2 movie3
0  John   nice   good    bad
1  Jane    bad   poor       
2   Joe    avg          good