Merging two dataframes without repeating values-CodePudding

I have two dataframes that contain unique ID's, a review column (positive in one, and negative in the other) and a rating column (again one for positive and one for negative):

df1:

unique_id	pos_review	pos_rating
1	"Great, would recommend...	8
1	"Really cool, you should go...	7
2	"I had a great time, you..	9
3	"Good way to spend your night...	8
4	"I might go again for how good it was...	9

df2:

unique_id	neg_review	neg_rating
1	"Really boring...	4
2	"I'll never try this again...	2
2	"I would not recommend..	3
3	"Could have been better...	4
3	"No one should ever go...	1

I'm trying to combine the two so that the unique id's can line up, but if one had more reviews than the other, the missing review would result in a NaN value which I'd later replace with "No Review". So ideally I'd end up with:

df3:

unique_id	pos_review	pos_rating	neg_review	neg_rating
1	"Great, would recommend...	8	"Really boring...	4
1	"Really cool, you should go...	7	NaN	NaN
2	"I had a great time, you..	9	"I'll never try this again...	2
2	NaN	NaN	"I would not recommend..	3
3	"Good way to spend your night...	8	"Could have been better...	4
3	NaN	NaN	"No one should ever go...	1
4	"I might go again for how good it was...	9	NaN	NaN

I've tried using df3 = df1.merge(df2, on='unique_id', how='inner'), but that just repeats the first review of my df1 for every review in my df2, like such (looking at unique_id 2 below):

unique_id	pos_review	pos_rating	neg_review	neg_rating
1	"Great, would recommend...	8	"Really boring...	4
1	"Really cool, you should go...	7	NaN	NaN
2	"I had a great time, you..	9	"I'll never try this again...	2
2	"I had a great time, you..	9	"I would not recommend..	3
3	"Good way to spend your night...	8	"Could have been better...	4
3	NaN	NaN	"No one should ever go...	1
4	"I might go again for how good it was...	9	NaN	NaN

Any ideas on how I can get the above df3?

CodePudding user response：

You need a cumulative counter as second argument for your groupby.

df3 = pd.merge(
    df1,df2, 
    left_on=['unique_id',df1.groupby('unique_id').cumcount()],
    right_on=['unique_id',df2.groupby('unique_id').cumcount()],
    how='outer')

delivers the desired result

CodePudding user response：

Change inner to outer , and create the sub key with cumcount

df1['key'] = df1.groupby('unique_id').cumcount()
df2['key'] = df2.groupby('unique_id').cumcount()
df3 = df1.merge(df2,on = ['unique_id','key'],how='outer').sort_values('unique_id')

Out[134]:

   unique_id             pos_review  pos_rating  key neg_review  neg_rating
0          1  reatwouldrecommend...         8.0    0    Really         4.0
1          1                 Really         7.0    1       NaN         NaN
2          2                    had         9.0    0      I'll         2.0
5          2                    NaN         NaN    1         I         3.0
3          3                   Good         8.0    0     Could         4.0
6          3                    NaN         NaN    1        No         1.0
4          4                      I         9.0    0       NaN         NaN

# you can also drop the key column with df3 = df3.drop(['key'],axis=1)

CodePudding user response：

Updated after @HenryEcker pointed out that append is to be depreciated.

I would use pd.concat instead of DataFrame.merge since 'unique_id' is not actually unique in the sense of a table value.

df3 = pd.concat([df1, df2], ignore_index=True)

Mabye merge is confusing your sense of what the output table should be. I think your example of your ideal df3 needs to include additional rows with NaNs

E.g. for unique_id = 1 you should have three rows:

two with NaNs in the negative columns
one with NaNs in the positive columns

I'm not sure why you would assign a negative review to only one one row of unique_id = 1 but not to others. Best to just keep all rows and use NaNs in all where appropriate

Then if you want to aggregate use DataFrame.groupby. E.g. for the mean rating

grouped_mean = df3.groupby('unique_id').mean()

Note that this would give you a new df with a mean for the negative reviews and a mean for the positive reviews since they are in different columns in df3