I have a dataframe with a string column that contains a sequence of author names and their affiliations.
Address |
---|
'Smith, Jane (University of X); Doe, Betty (Institute of Y)' |
'Walter, Bill (Z University); Albertson, John (Z University); Chen, Hilary (University of X)' |
'Note, Joe (University of X); Cal, Stephanie (University of X)' |
I want to create a new column with a Boolean TRUE/FALSE that tests if all authors are from University X. Note there can be any number of authors in the string.
Desired output:
T/F |
---|
FALSE |
FALSE |
TRUE |
I think I can split the Address
column using
df['Address_split'] = df['Address'].str.split(';', expand=False)
which then creates the list of names in the cell.
Address_split |
---|
['Smith, Jane (University of X)', 'Doe, Betty (University of Y)'] |
I even think I can use the all()
function to test for a Boolean for one cell at a time.
all([("University X" in i) for i in df['Address_split'][2]])
returns TRUE
But I am struggling to think through how I can do this on each cell's list individually. I think I need some combination of map
and/or apply
.
CodePudding user response:
You can split
but expand
so you can stack
into one big Series. Then you can use extract
to get the name and location.
Then your check is that all the values are 'University of X'
which can be done with an equality comparison all
within a groupby
. Since the grouping is based on the original index you can simply assign the result back to the original DataFrame
s = (df['Address'].str.split(';', expand=True).stack()
.str.extract('(.*)\s\((.*)\)')
.rename(columns={0: 'name', 1: 'location'}))
# name location
#0 0 Smith, Jane University of X
# 1 Doe, Betty Institute of Y
#1 0 Walter, Bill Z University
# 1 Albertson, John Z University
# 2 Chen, Hilary University of X
#2 0 Note, Joe University of X
# 1 Cal, Stephanie University of X
df['T/F'] = s['location'].eq('University of X').groupby(level=0).all()
print(df)
Address T/F
0 Smith, Jane (University of X); Doe, Betty (Ins... False
1 Walter, Bill (Z University); Albertson, John (... False
2 Note, Joe (University of X); Cal, Stephanie (U... True
CodePudding user response:
You can use str.extractall
to extract all the universities in parentheses and check if matches with University of X
.
df['T/F'] = df['Address'].str.extractall(r"\(([^)]*)\)").eq('University of X').groupby(level=0).all()
Address T/F
0 'Smith, Jane (University of X); Doe, Betty (In... False
1 'Walter, Bill (Z University); Albertson, John ... False
2 'Note, Joe (University of X); Cal, Stephanie (... True
CodePudding user response:
Here are some other options:
u = 'University of X'
df['Address'].str.count(u).eq(df['Address'].str.count(';') 1)
or
df['Address'].str.findall('([\w ] )(?=\))').map(lambda x: set(x) == {u})
Output:
0 False
1 False
2 True