I have a dataFrame containing a column of names and I want to extract the last name and make that a new column. However, I am running into a problem.
Here is a toy example of my dataframe:
Candidate_Name Party State District Office Year Img_URL
961 Heather Mizeur D Maryland 1 House 2022 https://images.ctfassets.net/00vgtve3ank7/3v1O...
962 Heidi Campbell D Tennessee 5 House 2022 https://images.ctfassets.net/00vgtve3ank7/BbSQ...
963 Helen Brady R Massachusetts 9 House 2020 https://images.ctfassets.net/00vgtve3ank7/6WmS...
964 Henry Cuellar D Texas 28 House 2022 https://images.ctfassets.net/00vgtve3ank7/4GGP...
965 Henry Cuellar D Texas 28 House 2020 https://images.ctfassets.net/00vgtve3ank7/3xNd...
966 Henry Cuellar D Texas 28 House 2018 https://images.ctfassets.net/00vgtve3ank7/uCK7...
967 Henry Martin D Missouri 6 House 2022 https://images.ctfassets.net/00vgtve3ank7/5rfd...
968 Henry Robert Martin D Missouri 6 House 2018 https://images.ctfassets.net/00vgtve3ank7/MvL8...
969 Herb Jones D Virginia 1 House 2022 https://images.ctfassets.net/00vgtve3ank7/47Uy...
970 Herman West Jr. R Georgia 2 House 2018 https://images.ctfassets.net/00vgtve3ank7/534y...
971 Hilary Turner D West Virginia 3 House 2020 https://images.ctfassets.net/00vgtve3ank7/3ZIN...
972 Hillary O'Connor Mueri D Ohio 14 House 2020 https://images.ctfassets.net/00vgtve3ank7/5i5w...
973 Hillary Scholten D Michigan 3 House 2022 https://images.ctfassets.net/00vgtve3ank7/47KO...
974 Hillary Scholten D Michigan 3 House 2020 https://images.ctfassets.net/00vgtve3ank7/3g47...
975 Hiral Tipirneni D Arizona 8 House 2018 https://images.ctfassets.net/00vgtve3ank7/3e9V...
976 Hiral Tipirneni D Arizona 6 House 2020 https://images.ctfassets.net/00vgtve3ank7/1APF...
977 Holden Hoggatt R Louisiana 3 House 2022 https://images.ctfassets.net/00vgtve3ank7/4tQP...
978 Homer Markel D Illinois 12 House 2022 https://images.ctfassets.net/00vgtve3ank7/3XXY...
979 Hosea Cleveland D South Carolina 3 House 2020 https://images.ctfassets.net/00vgtve3ank7/FZKi...
980 Hung Cao R Virginia 10 House 2022 https://images.ctfassets.net/00vgtve3ank7/4Aql...
981 Ian Todd D Minnesota 6 House 2018 https://images.ctfassets.net/00vgtve3ank7/3WkL...
982 Ike McCorkle D Colorado 4 House 2022 https://images.ctfassets.net/00vgtve3ank7/d7UB...
983 Ilhan Omar D Minnesota 5 House 2022 https://images.ctfassets.net/00vgtve3ank7/3TS6...
984 Ilhan Omar D Minnesota 5 House 2020 https://images.ctfassets.net/00vgtve3ank7/4EDC...
985 Ilhan Omar D Minnesota 5 House 2018 https://images.ctfassets.net/00vgtve3ank7/4n9U...
986 Irene Armendariz-Jackson R Texas 16 House 2022 https://images.ctfassets.net/00vgtve3ank7/2nbG...
987 Irene Armendariz-Jackson R Texas 16 House 2020 https://images.ctfassets.net/00vgtve3ank7/6wKS...
988 Iro Omere D Texas 4 House 2022 https://images.ctfassets.net/00vgtve3ank7/gewL...
989 Isaac McCorkle D Colorado 4 House 2020 https://images.ctfassets.net/00vgtve3ank7/4sUc...
990 J. Michael Galbraith D Ohio 5 House 2018 https://images.ctfassets.net/00vgtve3ank7/27nQ...
I wrote a function that I've called names:
def names(df):
"""
Description: Function to extract last names of candidate
Parameters: df: pandas dataFrame object
Depends on pandas and re
"""
if df["Candidate_Name"].eq('Jr.').all():
lastName = df["Candidate_Name"].str.split(' ').str.get(-4)
else:
lastName = df["Candidate_Name"].str.split(' ').str.get(-2)
return lastName
The goal with this function is that I should be able to grab the last name from the Candidate_Name column. However, I do have some folks with that go by Jr. and so that is adding a little bit of a complication that I tried writing an if-else statement to handle. However, something is going wrong.
Because when I run the following:
df["Last_Name"] = names(df)
I am getting this:
Candidate_Name Party State District Office Year Img_URL Last_Name
961 Heather Mizeur D Maryland 1 House 2022 https://images.ctfassets.net/00vgtve3ank7/3v1O... Mizeur
962 Heidi Campbell D Tennessee 5 House 2022 https://images.ctfassets.net/00vgtve3ank7/BbSQ... Campbell
963 Helen Brady R Massachusetts 9 House 2020 https://images.ctfassets.net/00vgtve3ank7/6WmS... Brady
964 Henry Cuellar D Texas 28 House 2022 https://images.ctfassets.net/00vgtve3ank7/4GGP... Cuellar
965 Henry Cuellar D Texas 28 House 2020 https://images.ctfassets.net/00vgtve3ank7/3xNd... Cuellar
966 Henry Cuellar D Texas 28 House 2018 https://images.ctfassets.net/00vgtve3ank7/uCK7... Cuellar
967 Henry Martin D Missouri 6 House 2022 https://images.ctfassets.net/00vgtve3ank7/5rfd... Martin
968 Henry Robert Martin D Missouri 6 House 2018 https://images.ctfassets.net/00vgtve3ank7/MvL8... Martin
969 Herb Jones D Virginia 1 House 2022 https://images.ctfassets.net/00vgtve3ank7/47Uy... Jones
970 Herman West Jr. R Georgia 2 House 2018 https://images.ctfassets.net/00vgtve3ank7/534y... Jr.
971 Hilary Turner D West Virginia 3 House 2020 https://images.ctfassets.net/00vgtve3ank7/3ZIN... Turner
972 Hillary O'Connor Mueri D Ohio 14 House 2020 https://images.ctfassets.net/00vgtve3ank7/5i5w... Mueri
973 Hillary Scholten D Michigan 3 House 2022 https://images.ctfassets.net/00vgtve3ank7/47KO... Scholten
974 Hillary Scholten D Michigan 3 House 2020 https://images.ctfassets.net/00vgtve3ank7/3g47... Scholten
975 Hiral Tipirneni D Arizona 8 House 2018 https://images.ctfassets.net/00vgtve3ank7/3e9V... Tipirneni
976 Hiral Tipirneni D Arizona 6 House 2020 https://images.ctfassets.net/00vgtve3ank7/1APF... Tipirneni
977 Holden Hoggatt R Louisiana 3 House 2022 https://images.ctfassets.net/00vgtve3ank7/4tQP... Hoggatt
978 Homer Markel D Illinois 12 House 2022 https://images.ctfassets.net/00vgtve3ank7/3XXY... Markel
979 Hosea Cleveland D South Carolina 3 House 2020 https://images.ctfassets.net/00vgtve3ank7/FZKi... Cleveland
980 Hung Cao R Virginia 10 House 2022 https://images.ctfassets.net/00vgtve3ank7/4Aql... Cao
981 Ian Todd D Minnesota 6 House 2018 https://images.ctfassets.net/00vgtve3ank7/3WkL... Todd
982 Ike McCorkle D Colorado 4 House 2022 https://images.ctfassets.net/00vgtve3ank7/d7UB... McCorkle
983 Ilhan Omar D Minnesota 5 House 2022 https://images.ctfassets.net/00vgtve3ank7/3TS6... Omar
984 Ilhan Omar D Minnesota 5 House 2020 https://images.ctfassets.net/00vgtve3ank7/4EDC... Omar
985 Ilhan Omar D Minnesota 5 House 2018 https://images.ctfassets.net/00vgtve3ank7/4n9U... Omar
986 Irene Armendariz-Jackson R Texas 16 House 2022 https://images.ctfassets.net/00vgtve3ank7/2nbG... Armendariz-Jackson
987 Irene Armendariz-Jackson R Texas 16 House 2020 https://images.ctfassets.net/00vgtve3ank7/6wKS... Armendariz-Jackson
988 Iro Omere D Texas 4 House 2022 https://images.ctfassets.net/00vgtve3ank7/gewL... Omere
989 Isaac McCorkle D Colorado 4 House 2020 https://images.ctfassets.net/00vgtve3ank7/4sUc... McCorkle
990 J. Michael Galbraith D Ohio 5 House 2018 https://images.ctfassets.net/00vgtve3ank7/27nQ... Galbraith
So it is obviously not ignoring the element that contains Jr. in it... (see row 970 for example).
What did I do incorrectly here? I've played around with different values for the str.get()
function, but it still keeps giving me that. I also don't want to do it from the other direction because some folks have middle initials or go by their middle initial (see line 990 for an example). So what is not working here? Why is the if-else statement not catching it?
CodePudding user response:
Perhaps you can use regular expression to extract the last name (without Jr.
, Sr.
etc.):
df["Last_Name"] = df["Candidate_Name"].str.extract(r"([^\s] )\s*(?=Jr\.|Sr\.|$)")
print(df[["Candidate_Name", "Last_Name"]])
Prints:
961 Heather Mizeur Mizeur
962 Heidi Campbell Campbell
963 Helen Brady Brady
964 Henry Cuellar Cuellar
965 Henry Cuellar Cuellar
966 Henry Cuellar Cuellar
967 Henry Martin Martin
968 Henry Robert Martin Martin
969 Herb Jones Jones
970 Herman West Jr. West
971 Hilary Turner Turner
972 Hillary O'Connor Mueri Mueri
973 Hillary Scholten Scholten
974 Hillary Scholten Scholten
975 Hiral Tipirneni Tipirneni
976 Hiral Tipirneni Tipirneni
977 Holden Hoggatt Hoggatt
978 Homer Markel Markel
979 Hosea Cleveland Cleveland
980 Hung Cao Cao
981 Ian Todd Todd
982 Ike McCorkle McCorkle
983 Ilhan Omar Omar
984 Ilhan Omar Omar
985 Ilhan Omar Omar
986 Irene Armendariz-Jackson Armendariz-Jackson
987 Irene Armendariz-Jackson Armendariz-Jackson
988 Iro Omere Omere
989 Isaac McCorkle McCorkle
990 J. Michael Galbraith Galbraith
CodePudding user response:
You are using:
if df["Candidate_Name"].eq('Jr.').all()
Which translates to if df['Candidate_Name'] == 'Jr.'
which is not the case. Also, using .all()
is causing another unintended behaviour. You should vectorize it and use either in
or contains()
. Consider using this:
df["Last_Name"] = np.where(df['Candidate_Name'].str.contains('Jr.',case=False,regex=True)
df["Candidate_Name"].str.split().str[-2],
df['Candidate_Name'].str.split().str[-1])
This is a more efficient based on your data returns the expected output.