this is my first question in Stack Overflow. I will water down the problem that I have at the moment. I am trying to clean a dataset for a User-based collaborative filtering recommendendation system.
Here's an oversimplication of the dataset I have with all the use-cases
data = pd.DataFrame({'name': ['John' ,'Jane' ,'Joe' ,'John' ,'Jane' , 'Joe'],
'movie1': ['' , 'bad' , 'avg' , 'nice', '' , '' ],
'movie2': ['good' , '' , '' , '' , 'poor', '' ],
'movie3': ['bad' , '' , 'good', '' , '' , '' ],
})
From for how I sourced my data I know that even though John, Jane and Joe might repeat themselves any amount of times, they will never have more than one rating for any given movie.
I want to be able to aggregate repeated users into a single row so that my output in terminal looks like this:
name movie1 movie2 movie3
0 John nice good bad
1 Jane bad poor
2 Joe avg good
This problem is very similar to this question but the difference is that I'm dealing with string objects and not numbers, therefore I can't use aggregation functions How can I "merge" rows by same value in a column in Pandas with aggregation functions?
My real dataset has 4260 columns and 24169 rows, therefore I'm not capable of applying something like df.groupby(['name','month'])['text'].apply(','.join).reset_index()
because it's not possible to write down all the column names. From: Concatenate strings from several rows using Pandas groupby
I tried following the answers of this question but I either got errors or my dataframe stayed the same. Pandas | merge rows with same id
Even though logically it didn't make sense, I tried using data.groupby('name').ffill().drop_duplicates('name' ,keep='last')
and I got the following error = KeyError: Index(['name'], dtype='object')
Passing False to as_index inside the groupby gave me the exact same error data.groupby('name', as_index=False).ffill().reset_index().drop_duplicates('name', keep='last')
The closest that I've gotten has been this: data = data.groupby('name', as_index=False).apply(lambda x: x.fillna(method='ffill').iloc[0])
The output that it gives me only deletes repeated rows but doesn't add the ratings to the leftover data:
name movie1 movie2 movie3
0 Jane bad
1 Joe avg good
2 John good bad
Complete Code:
import pandas as pd
data = pd.DataFrame({'name': ['John' ,'Jane' ,'Joe' ,'John' ,'Jane' , 'Joe'],
'movie1': ['' , 'bad' , 'avg' , 'nice', '' , '' ],
'movie2': ['good' , '' , '' , '' , 'poor', '' ],
'movie3': ['bad' , '' , 'good', '' , '' , '' ],
})
print('Baseline:')
print(data.head())
#data = data.join(data['name'])
#data.groupby('name').ffill().drop_duplicates('name' ,keep='last')
#data.groupby('name', as_index=False).ffill().reset_index().drop_duplicates('name', keep='last')
data = data.groupby('name', as_index=False).apply(lambda x: x.fillna(method='ffill').iloc[0])
#data.groupby('name').ffill().drop_duplicates('name', keep='last')
#data = data.groupby(['name'])[['movie1','movie2','movie3']].apply('.'.join).reset_index()
print('End result:')
print(data.head())
CodePudding user response:
IIUC, you can use groupby_first
. The trick is to replace empty string by nan then rollback after selecting the first valid value:
>>> data.replace('', np.nan).groupby('name', as_index=False, sort=False).first().fillna('')
name movie1 movie2 movie3
0 John nice good bad
1 Jane bad poor
2 Joe avg good