Look in df list, return Boolean if all list elements have a substring-CodePudding

I have a dataframe with a string column that contains a sequence of author names and their affiliations.

Address
'Smith, Jane (University of X); Doe, Betty (Institute of Y)'
'Walter, Bill (Z University); Albertson, John (Z University); Chen, Hilary (University of X)'
'Note, Joe (University of X); Cal, Stephanie (University of X)'

I want to create a new column with a Boolean TRUE/FALSE that tests if all authors are from University X. Note there can be any number of authors in the string.

Desired output:

T/F
FALSE
FALSE
TRUE

I think I can split the Address column using

df['Address_split'] = df['Address'].str.split(';', expand=False)

which then creates the list of names in the cell.

Address_split
['Smith, Jane (University of X)', 'Doe, Betty (University of Y)']

I even think I can use the all() function to test for a Boolean for one cell at a time.

all([("University X" in i) for i in df['Address_split'][2]]) returns TRUE

But I am struggling to think through how I can do this on each cell's list individually. I think I need some combination of map and/or apply.

CodePudding user response：

You can split but expand so you can stack into one big Series. Then you can use extract to get the name and location.

Then your check is that all the values are 'University of X' which can be done with an equality comparison all within a groupby. Since the grouping is based on the original index you can simply assign the result back to the original DataFrame

s = (df['Address'].str.split(';', expand=True).stack()
       .str.extract('(.*)\s\((.*)\)')
       .rename(columns={0: 'name', 1: 'location'}))
#                 name         location
#0 0       Smith, Jane  University of X
#  1        Doe, Betty   Institute of Y
#1 0      Walter, Bill     Z University
#  1   Albertson, John     Z University
#  2      Chen, Hilary  University of X
#2 0         Note, Joe  University of X
#  1    Cal, Stephanie  University of X

df['T/F'] = s['location'].eq('University of X').groupby(level=0).all()

print(df)                                                                                                                         
                                             Address    T/F
0  Smith, Jane (University of X); Doe, Betty (Ins...  False
1  Walter, Bill (Z University); Albertson, John (...  False
2  Note, Joe (University of X); Cal, Stephanie (U...   True

CodePudding user response：

You can use str.extractall to extract all the universities in parentheses and check if matches with University of X.

df['T/F'] = df['Address'].str.extractall(r"\(([^)]*)\)").eq('University of X').groupby(level=0).all()

                                             Address    T/F
0  'Smith, Jane (University of X); Doe, Betty (In...  False
1  'Walter, Bill (Z University); Albertson, John ...  False
2  'Note, Joe (University of X); Cal, Stephanie (...   True

CodePudding user response：

Here are some other options:

u = 'University of X'

df['Address'].str.count(u).eq(df['Address'].str.count(';') 1)

df['Address'].str.findall('([\w ] )(?=\))').map(lambda x: set(x) == {u})

Output:

0    False
1    False
2     True