I have two data frame and I want to update one column of df_source
based on the condition in both data frames:
df_source = pd.Dataframe({'Sentiment':['neg', 'neg','pos'], 'text': ['hello ', '12where', 'here [null]'], 'pred': ['neu', 'neg', 'pos')})
df2 = pd.Dataframe({'Sentiment':['pos', 'neg','pos', 'neu'], 'text': ['hello ', '12 where I', 'hello g* ', 'here [null]'], 'pred': ['neu', 'neg', 'neu', 'neu')})
I want to update the column of Sentiment
in df_source
based on this condition: if the text in both data frame were exactly the same and the pred column was the same then replace the sentiment in df_source with the sentiment in df2
So the output would be like this (as only one sample meets both condition "hello "):
Sentiment. text. pred
pos hello neu
neg 12where neg
pos here [null] pos
What I have done:
df_source['Sentiment'] = df.where(((df['text']== df_source['text']) & (df['pred'] == dfs['pred'])) , df['Sentiment'])
It should work but this raises error (ValueError: Can only compare identically-labeled Series objects
).
CodePudding user response:
First merge on the two columns and suffix.
df_source = df_source.merge(df2, how ='left', on =['text', 'pred'], suffixes=('_x', ''))
Replace the NaNs where there was no match using combine_first and then drop the extra merge column
df_source =df_source.assign(Sentiment= df_source['Sentiment'].combine_first(df_source.Sentiment_x) ).drop('Sentiment_x',1)
text pred Sentiment
0 hello neu pos
1 12where neg neg
2 here [null] pos pos