I have two data frame and there are two columns that I want to check them. The number of samples in two data frame are different. I tried to do that in two ways but raises error that ValueError: Can only compare identically-labeled Series objects
.
dftrain
text
Hello
How are you?
I'm doing fine..
Agent please ...
Hiiiiii
dftest
text
hello
How are you?
Im doing fine
Agent please
So the result would be:
text
How are you?
I did this:
comparison_column = np.where(dftest["text"] == dftrain["text"], True, False)
but it seems its for the cases where the number of samples in both data frame are the same.
I found this link close to what I need but still is different.
CodePudding user response:
You can apply on the smallest DataFrame like dftest
then check in unique()
values in largest DataFrame like dftrain
like below :
>>> dftrain = pd.DataFrame({'col1': ['text', 'Hello', 'How are you?', 'Hello', 'Hello' , 'Hello']})
>>> dftest = pd.DataFrame({'col2': ['text', 'hello', 'How are you?', 'hello']})
>>> dftest.loc[dftest['col2'].apply(lambda x : x in dftrain.col1.unique()), 'col2']
0 text
2 How are you?
Name: col2, dtype: object
>>> dftest.loc[dftest['col2'].apply(lambda x : x in dftrain.col1.unique()), 'col2'].tolist()
['text', 'How are you?']