how to compare two data frame on one string column that the number of samples are different pandas-CodePudding

I have two data frame and there are two columns that I want to check them. The number of samples in two data frame are different. I tried to do that in two ways but raises error that ValueError: Can only compare identically-labeled Series objects.

dftrain

text
Hello
How are you?
I'm doing fine..
Agent please ...
Hiiiiii

dftest

text
hello
How are you?
Im doing fine
Agent please

So the result would be:

text
How are you?

I did this: comparison_column = np.where(dftest["text"] == dftrain["text"], True, False) but it seems its for the cases where the number of samples in both data frame are the same.

I found this link close to what I need but still is different.

CodePudding user response：

You can apply on the smallest DataFrame like dftest then check in unique() values in largest DataFrame like dftrain like below :

>>> dftrain = pd.DataFrame({'col1': ['text', 'Hello', 'How are you?', 'Hello', 'Hello' , 'Hello']})

>>> dftest = pd.DataFrame({'col2': ['text', 'hello', 'How are you?', 'hello']})

>>> dftest.loc[dftest['col2'].apply(lambda x : x in dftrain.col1.unique()), 'col2']

0            text
2    How are you?
Name: col2, dtype: object

>>> dftest.loc[dftest['col2'].apply(lambda x : x in dftrain.col1.unique()), 'col2'].tolist()

['text', 'How are you?']