I have two dataframes that contain unique ID's, a review column (positive in one, and negative in the other) and a rating column (again one for positive and one for negative):
df1:
unique_id | pos_review | pos_rating |
---|---|---|
1 | "Great, would recommend... | 8 |
1 | "Really cool, you should go... | 7 |
2 | "I had a great time, you.. | 9 |
3 | "Good way to spend your night... | 8 |
4 | "I might go again for how good it was... | 9 |
df2:
unique_id | neg_review | neg_rating |
---|---|---|
1 | "Really boring... | 4 |
2 | "I'll never try this again... | 2 |
2 | "I would not recommend.. | 3 |
3 | "Could have been better... | 4 |
3 | "No one should ever go... | 1 |
I'm trying to combine the two so that the unique id's can line up, but if one had more reviews than the other, the missing review would result in a NaN value which I'd later replace with "No Review". So ideally I'd end up with:
df3:
unique_id | pos_review | pos_rating | neg_review | neg_rating |
---|---|---|---|---|
1 | "Great, would recommend... | 8 | "Really boring... | 4 |
1 | "Really cool, you should go... | 7 | NaN | NaN |
2 | "I had a great time, you.. | 9 | "I'll never try this again... | 2 |
2 | NaN | NaN | "I would not recommend.. | 3 |
3 | "Good way to spend your night... | 8 | "Could have been better... | 4 |
3 | NaN | NaN | "No one should ever go... | 1 |
4 | "I might go again for how good it was... | 9 | NaN | NaN |
I've tried using df3 = df1.merge(df2, on='unique_id', how='inner')
, but that just repeats the first review of my df1 for every review in my df2, like such (looking at unique_id 2 below):
unique_id | pos_review | pos_rating | neg_review | neg_rating |
---|---|---|---|---|
1 | "Great, would recommend... | 8 | "Really boring... | 4 |
1 | "Really cool, you should go... | 7 | NaN | NaN |
2 | "I had a great time, you.. | 9 | "I'll never try this again... | 2 |
2 | "I had a great time, you.. | 9 | "I would not recommend.. | 3 |
3 | "Good way to spend your night... | 8 | "Could have been better... | 4 |
3 | NaN | NaN | "No one should ever go... | 1 |
4 | "I might go again for how good it was... | 9 | NaN | NaN |
Any ideas on how I can get the above df3?
CodePudding user response:
You need a cumulative counter as second argument for your groupby
.
df3 = pd.merge(
df1,df2,
left_on=['unique_id',df1.groupby('unique_id').cumcount()],
right_on=['unique_id',df2.groupby('unique_id').cumcount()],
how='outer')
delivers the desired result
CodePudding user response:
Change inner
to outer
, and create the sub key with cumcount
df1['key'] = df1.groupby('unique_id').cumcount()
df2['key'] = df2.groupby('unique_id').cumcount()
df3 = df1.merge(df2,on = ['unique_id','key'],how='outer').sort_values('unique_id')
Out[134]:
unique_id pos_review pos_rating key neg_review neg_rating
0 1 reatwouldrecommend... 8.0 0 Really 4.0
1 1 Really 7.0 1 NaN NaN
2 2 had 9.0 0 I'll 2.0
5 2 NaN NaN 1 I 3.0
3 3 Good 8.0 0 Could 4.0
6 3 NaN NaN 1 No 1.0
4 4 I 9.0 0 NaN NaN
# you can also drop the key column with df3 = df3.drop(['key'],axis=1)
CodePudding user response:
Updated after @HenryEcker pointed out that append
is to be depreciated.
I would use pd.concat
instead of DataFrame.merge
since 'unique_id' is not actually unique in the sense of a table value.
df3 = pd.concat([df1, df2], ignore_index=True)
Mabye merge
is confusing your sense of what the output table should be. I think your example of your ideal df3 needs to include additional rows with NaNs
E.g. for unique_id = 1
you should have three rows:
- two with NaNs in the negative columns
- one with NaNs in the positive columns
I'm not sure why you would assign a negative review to only one one row of unique_id = 1 but not to others. Best to just keep all rows and use NaNs in all where appropriate
Then if you want to aggregate use DataFrame.groupby
. E.g. for the mean rating
grouped_mean = df3.groupby('unique_id').mean()
Note that this would give you a new df with a mean for the negative reviews and a mean for the positive reviews since they are in different columns in df3