Home > Blockchain >  Pandas search a dataframe for substrings using the values from a separate dataframe
Pandas search a dataframe for substrings using the values from a separate dataframe

Time:05-25

I've gone around with this but can't seem to get it right. Basically I have an excel spreadsheet with multiple sheets, for this python program I only care about two of the sheets, and more specifically one column from each sheet. I want to take all of the values from one datframe/column (A) and see if a second dataframe/column (B) contains a substring of that value from column A. Ultimately I just want a csv output of the cases where a value from column A does not have a substring match in column B.

So I've got this so far

Reading in the excel file and creating two dataframes each using the column I'm interested in:

df_A = pd.read_excel('test.xlsx',
        sheet_name='Sheet_1',
        usecols=['Column_A'])

df_B = pd.read_excel('test.xlsx',
        sheet_name='Sheet_2',
        usecols=['Column_B'])

Here are the contents of the dataframes:

Column_A
20220201_ABC_TEST-00012345_987654
20220201_ABC_TEST-00012346_987654
20220201_ABC_TEST-00012347_987654
20220201_ABC_TEST-00012351_987654
20220201_ABC_TEST-00012352_987654
20220201_ABC_TEST-00012353_987654

Column_B
TEST-00012345
TEST-00012346
TEST-00012347
TEST-00012348
TEST-00012349
TEST-00012350

And this is the part that I don't know how to do properly, that is take all the values from df_A and comparing them to all the values in df_B to find out which values from df_A don't have substring matches in df_B.

substring_matches = df_A.str.contains(df_B)
print(substring_matches)

This gives an error:

AttributeError: 'DataFrame' object has no attribute 'str'

So something I'm doing is not quite right here.

CodePudding user response:

Try this:

substring_matches = df_A['Column_A'].apply(lambda s1: df_B['Column B'].apply(lambda s2: s2 in s1).any())
  • Related