I've gone around with this but can't seem to get it right. Basically I have an excel spreadsheet with multiple sheets, for this python program I only care about two of the sheets, and more specifically one column from each sheet. I want to take all of the values from one datframe/column (A) and see if a second dataframe/column (B) contains a substring of that value from column A. Ultimately I just want a csv output of the cases where a value from column A does not have a substring match in column B.
So I've got this so far
Reading in the excel file and creating two dataframes each using the column I'm interested in:
df_A = pd.read_excel('test.xlsx',
sheet_name='Sheet_1',
usecols=['Column_A'])
df_B = pd.read_excel('test.xlsx',
sheet_name='Sheet_2',
usecols=['Column_B'])
Here are the contents of the dataframes:
Column_A
20220201_ABC_TEST-00012345_987654
20220201_ABC_TEST-00012346_987654
20220201_ABC_TEST-00012347_987654
20220201_ABC_TEST-00012351_987654
20220201_ABC_TEST-00012352_987654
20220201_ABC_TEST-00012353_987654
Column_B
TEST-00012345
TEST-00012346
TEST-00012347
TEST-00012348
TEST-00012349
TEST-00012350
And this is the part that I don't know how to do properly, that is take all the values from df_A and comparing them to all the values in df_B to find out which values from df_A don't have substring matches in df_B.
substring_matches = df_A.str.contains(df_B)
print(substring_matches)
This gives an error:
AttributeError: 'DataFrame' object has no attribute 'str'
So something I'm doing is not quite right here.
CodePudding user response:
Try this:
substring_matches = df_A['Column_A'].apply(lambda s1: df_B['Column B'].apply(lambda s2: s2 in s1).any())