With str.contains(), can I back out which comparators resulted in a 'False' result?-CodePudding

I have a Pandas dataframe, df1, that contains a column of names along with associated data columns. I have another, df2, with a column of names derived from a different source. There are formatting differences between the two. In this example df2 drops suffixes and uses nicknames instead of formal names (but I can't know what other format disparities might exist in the dataset):

df1 = pd.DataFrame([('Jordan Smith Jr.', 25, 180),
                   ('Andrew Johnson', 34, 200),
                   ('Anthony Tipton III', 42, 175),
                   ('Chris Black', 24, 160)],
                   columns=('name', 'age', 'weight'))

df2 = pd.DataFrame([('Jordan Smith'),
                   ('Drew Johnson'),
                   ('Anthony Tipton')],
                   columns=('name'))

I want to filter and return rows within df1 for people that exist in df2. df1 is inclusive of df2, so the only boolean comparison failures will be due to formatting differences. I want to collect the set of names in df2 that don't compare favorably so that I can inspect them and handle accordingly. I can filter df1 like this:

df1 = df1[df1['name'].str.contains('|'.join(df2['name'])]

This handles the suffix disparity and reduces the number of comparison failures in a way I'm comfortable with, but is there an easy way to collect the members of df2 that did not compare favorably? I can do it iteratively with a loop, but I feel like there's likely a more elegant way to do it.

---------------EDIT------------------

The code above returns half of what I want and I'm not looking for help with it (unless there's a better and completely different way to approach this):

>>>df1
[('Jordan Smith Jr.', 25, 180),
 ('Anthony Tipton III', 42, 175)]

I'm looking for a way to back out the names within df2 that did not resolve as 'True' for any 'name' in df1 when this check is evaluated:

df1['name'].str.contains('|'.join(df2['name'])

I can accomplish this with something like:

for name in df2:
    temp = df1['name'].str.contains(name)
    if temp.any():
        *append name to collection list/dataframe/etc.*

but I feel like there's a better way.

CodePudding user response：

IIUC, you want to identify names in df2 that never matched in df1?

You can use set operations on extractall:

set(df2['name']).difference(df1['name'].str.extractall(f"({'|'.join(df2['name'])})")[0])

Output: {'Drew Johnson'}

Intermediate, with all matched terms:

df1['name'].str.extractall(f"({'|'.join(df2['name'])})")[0]

   match
0  0          Jordan Smith
2  0        Anthony Tipton
Name: 0, dtype: object

CodePudding user response：

Since none of the names have a full match in both DataFrames, I will add one to illustrate:

In[1]: df1 = pd.DataFrame([('Jordan Smith Jr.', 25, 180),
                           ('Andrew Johnson', 34, 200),
                           ('Anthony Tipton III', 42, 175),
                           ('Chris Black', 24, 160)],
                          columns=('name', 'age', 'weight'))

In[2]: df2 = pd.DataFrame([('Jordan Smith'),
                           ('Drew Johnson'),
                           ('Anthony Tipton'),
                           ('Chris Black')],  # <-- added one here
                          columns=['name'])  # had to put it into square brackets to work

Then you may find those names that fully intersect by finding .intersection() between two sets of names:

In[1]: name_intersection = list(set(df1['name']).intersection(set(df2['name'])))
In[2]: name_intersection

Out[2]: ['Chris Black']  # shows fully intersecting names only

Then you may filter out matching names from df1 by using ~ before the condition:

In[1]: df_unmatched = df1[~df1.name.isin(name_intersection)]
In[2]: df_unmatched

Out[2]:     name                 age  weight
        0   Jordan Smith Jr.     25   180
        1   Andrew Johnson       34   200
        2   Anthony Tipton III   42   175

This would work if your question was to find non-identical names. You can further use your way to filter out names, but without additional examples I can't provide you a general way, hence 'Andrew' is 'Drew', but there could be 'Joseph' that is 'Joe', or some other pattern.