Home > Net >  str.get() not grabbing correct element after str.split
str.get() not grabbing correct element after str.split

Time:11-14

I have a dataFrame containing a column of names and I want to extract the last name and make that a new column. However, I am running into a problem.

Here is a toy example of my dataframe:

                        Candidate_Name Party           State District Office  Year                                            Img_URL
961                    Heather Mizeur      D        Maryland        1  House  2022  https://images.ctfassets.net/00vgtve3ank7/3v1O...
962                    Heidi Campbell      D       Tennessee        5  House  2022  https://images.ctfassets.net/00vgtve3ank7/BbSQ...
963                       Helen Brady      R   Massachusetts        9  House  2020  https://images.ctfassets.net/00vgtve3ank7/6WmS...
964                     Henry Cuellar      D           Texas       28  House  2022  https://images.ctfassets.net/00vgtve3ank7/4GGP...
965                     Henry Cuellar      D           Texas       28  House  2020  https://images.ctfassets.net/00vgtve3ank7/3xNd...
966                     Henry Cuellar      D           Texas       28  House  2018  https://images.ctfassets.net/00vgtve3ank7/uCK7...
967                      Henry Martin      D        Missouri        6  House  2022  https://images.ctfassets.net/00vgtve3ank7/5rfd...
968               Henry Robert Martin      D        Missouri        6  House  2018  https://images.ctfassets.net/00vgtve3ank7/MvL8...
969                        Herb Jones      D        Virginia        1  House  2022  https://images.ctfassets.net/00vgtve3ank7/47Uy...
970                   Herman West Jr.      R         Georgia        2  House  2018  https://images.ctfassets.net/00vgtve3ank7/534y...
971                     Hilary Turner      D   West Virginia        3  House  2020  https://images.ctfassets.net/00vgtve3ank7/3ZIN...
972            Hillary O'Connor Mueri      D            Ohio       14  House  2020  https://images.ctfassets.net/00vgtve3ank7/5i5w...
973                  Hillary Scholten      D        Michigan        3  House  2022  https://images.ctfassets.net/00vgtve3ank7/47KO...
974                  Hillary Scholten      D        Michigan        3  House  2020  https://images.ctfassets.net/00vgtve3ank7/3g47...
975                   Hiral Tipirneni      D         Arizona        8  House  2018  https://images.ctfassets.net/00vgtve3ank7/3e9V...
976                   Hiral Tipirneni      D         Arizona        6  House  2020  https://images.ctfassets.net/00vgtve3ank7/1APF...
977                    Holden Hoggatt      R       Louisiana        3  House  2022  https://images.ctfassets.net/00vgtve3ank7/4tQP...
978                      Homer Markel      D        Illinois       12  House  2022  https://images.ctfassets.net/00vgtve3ank7/3XXY...
979                   Hosea Cleveland      D  South Carolina        3  House  2020  https://images.ctfassets.net/00vgtve3ank7/FZKi...
980                          Hung Cao      R        Virginia       10  House  2022  https://images.ctfassets.net/00vgtve3ank7/4Aql...
981                          Ian Todd      D       Minnesota        6  House  2018  https://images.ctfassets.net/00vgtve3ank7/3WkL...
982                      Ike McCorkle      D        Colorado        4  House  2022  https://images.ctfassets.net/00vgtve3ank7/d7UB...
983                        Ilhan Omar      D       Minnesota        5  House  2022  https://images.ctfassets.net/00vgtve3ank7/3TS6...
984                        Ilhan Omar      D       Minnesota        5  House  2020  https://images.ctfassets.net/00vgtve3ank7/4EDC...
985                        Ilhan Omar      D       Minnesota        5  House  2018  https://images.ctfassets.net/00vgtve3ank7/4n9U...
986          Irene Armendariz-Jackson      R           Texas       16  House  2022  https://images.ctfassets.net/00vgtve3ank7/2nbG...
987          Irene Armendariz-Jackson      R           Texas       16  House  2020  https://images.ctfassets.net/00vgtve3ank7/6wKS...
988                         Iro Omere      D           Texas        4  House  2022  https://images.ctfassets.net/00vgtve3ank7/gewL...
989                    Isaac McCorkle      D        Colorado        4  House  2020  https://images.ctfassets.net/00vgtve3ank7/4sUc...
990              J. Michael Galbraith      D            Ohio        5  House  2018  https://images.ctfassets.net/00vgtve3ank7/27nQ...

I wrote a function that I've called names:

def names(df):
    """
    Description: Function to extract last names of candidate

    Parameters: df: pandas dataFrame object

    Depends on pandas and re
    """
    if df["Candidate_Name"].eq('Jr.').all():
        lastName = df["Candidate_Name"].str.split(' ').str.get(-4)
    else:
        lastName = df["Candidate_Name"].str.split(' ').str.get(-2)
    return lastName

The goal with this function is that I should be able to grab the last name from the Candidate_Name column. However, I do have some folks with that go by Jr. and so that is adding a little bit of a complication that I tried writing an if-else statement to handle. However, something is going wrong.

Because when I run the following:

df["Last_Name"] = names(df)

I am getting this:

                        Candidate_Name Party           State District Office  Year                                            Img_URL           Last_Name
961                    Heather Mizeur      D        Maryland        1  House  2022  https://images.ctfassets.net/00vgtve3ank7/3v1O...              Mizeur
962                    Heidi Campbell      D       Tennessee        5  House  2022  https://images.ctfassets.net/00vgtve3ank7/BbSQ...            Campbell
963                       Helen Brady      R   Massachusetts        9  House  2020  https://images.ctfassets.net/00vgtve3ank7/6WmS...               Brady
964                     Henry Cuellar      D           Texas       28  House  2022  https://images.ctfassets.net/00vgtve3ank7/4GGP...             Cuellar
965                     Henry Cuellar      D           Texas       28  House  2020  https://images.ctfassets.net/00vgtve3ank7/3xNd...             Cuellar
966                     Henry Cuellar      D           Texas       28  House  2018  https://images.ctfassets.net/00vgtve3ank7/uCK7...             Cuellar
967                      Henry Martin      D        Missouri        6  House  2022  https://images.ctfassets.net/00vgtve3ank7/5rfd...              Martin
968               Henry Robert Martin      D        Missouri        6  House  2018  https://images.ctfassets.net/00vgtve3ank7/MvL8...              Martin
969                        Herb Jones      D        Virginia        1  House  2022  https://images.ctfassets.net/00vgtve3ank7/47Uy...               Jones
970                   Herman West Jr.      R         Georgia        2  House  2018  https://images.ctfassets.net/00vgtve3ank7/534y...                 Jr.
971                     Hilary Turner      D   West Virginia        3  House  2020  https://images.ctfassets.net/00vgtve3ank7/3ZIN...              Turner
972            Hillary O'Connor Mueri      D            Ohio       14  House  2020  https://images.ctfassets.net/00vgtve3ank7/5i5w...               Mueri
973                  Hillary Scholten      D        Michigan        3  House  2022  https://images.ctfassets.net/00vgtve3ank7/47KO...            Scholten
974                  Hillary Scholten      D        Michigan        3  House  2020  https://images.ctfassets.net/00vgtve3ank7/3g47...            Scholten
975                   Hiral Tipirneni      D         Arizona        8  House  2018  https://images.ctfassets.net/00vgtve3ank7/3e9V...           Tipirneni
976                   Hiral Tipirneni      D         Arizona        6  House  2020  https://images.ctfassets.net/00vgtve3ank7/1APF...           Tipirneni
977                    Holden Hoggatt      R       Louisiana        3  House  2022  https://images.ctfassets.net/00vgtve3ank7/4tQP...             Hoggatt
978                      Homer Markel      D        Illinois       12  House  2022  https://images.ctfassets.net/00vgtve3ank7/3XXY...              Markel
979                   Hosea Cleveland      D  South Carolina        3  House  2020  https://images.ctfassets.net/00vgtve3ank7/FZKi...           Cleveland
980                          Hung Cao      R        Virginia       10  House  2022  https://images.ctfassets.net/00vgtve3ank7/4Aql...                 Cao
981                          Ian Todd      D       Minnesota        6  House  2018  https://images.ctfassets.net/00vgtve3ank7/3WkL...                Todd
982                      Ike McCorkle      D        Colorado        4  House  2022  https://images.ctfassets.net/00vgtve3ank7/d7UB...            McCorkle
983                        Ilhan Omar      D       Minnesota        5  House  2022  https://images.ctfassets.net/00vgtve3ank7/3TS6...                Omar
984                        Ilhan Omar      D       Minnesota        5  House  2020  https://images.ctfassets.net/00vgtve3ank7/4EDC...                Omar
985                        Ilhan Omar      D       Minnesota        5  House  2018  https://images.ctfassets.net/00vgtve3ank7/4n9U...                Omar
986          Irene Armendariz-Jackson      R           Texas       16  House  2022  https://images.ctfassets.net/00vgtve3ank7/2nbG...  Armendariz-Jackson
987          Irene Armendariz-Jackson      R           Texas       16  House  2020  https://images.ctfassets.net/00vgtve3ank7/6wKS...  Armendariz-Jackson
988                         Iro Omere      D           Texas        4  House  2022  https://images.ctfassets.net/00vgtve3ank7/gewL...               Omere
989                    Isaac McCorkle      D        Colorado        4  House  2020  https://images.ctfassets.net/00vgtve3ank7/4sUc...            McCorkle
990              J. Michael Galbraith      D            Ohio        5  House  2018  https://images.ctfassets.net/00vgtve3ank7/27nQ...           Galbraith

So it is obviously not ignoring the element that contains Jr. in it... (see row 970 for example).

What did I do incorrectly here? I've played around with different values for the str.get() function, but it still keeps giving me that. I also don't want to do it from the other direction because some folks have middle initials or go by their middle initial (see line 990 for an example). So what is not working here? Why is the if-else statement not catching it?

CodePudding user response:

Perhaps you can use regular expression to extract the last name (without Jr., Sr. etc.):

df["Last_Name"] = df["Candidate_Name"].str.extract(r"([^\s] )\s*(?=Jr\.|Sr\.|$)")

print(df[["Candidate_Name", "Last_Name"]])

Prints:

961            Heather Mizeur              Mizeur
962            Heidi Campbell            Campbell
963               Helen Brady               Brady
964             Henry Cuellar             Cuellar
965             Henry Cuellar             Cuellar
966             Henry Cuellar             Cuellar
967              Henry Martin              Martin
968       Henry Robert Martin              Martin
969                Herb Jones               Jones
970           Herman West Jr.                West
971             Hilary Turner              Turner
972    Hillary O'Connor Mueri               Mueri
973          Hillary Scholten            Scholten
974          Hillary Scholten            Scholten
975           Hiral Tipirneni           Tipirneni
976           Hiral Tipirneni           Tipirneni
977            Holden Hoggatt             Hoggatt
978              Homer Markel              Markel
979           Hosea Cleveland           Cleveland
980                  Hung Cao                 Cao
981                  Ian Todd                Todd
982              Ike McCorkle            McCorkle
983                Ilhan Omar                Omar
984                Ilhan Omar                Omar
985                Ilhan Omar                Omar
986  Irene Armendariz-Jackson  Armendariz-Jackson
987  Irene Armendariz-Jackson  Armendariz-Jackson
988                 Iro Omere               Omere
989            Isaac McCorkle            McCorkle
990      J. Michael Galbraith           Galbraith

CodePudding user response:

You are using:

if df["Candidate_Name"].eq('Jr.').all()

Which translates to if df['Candidate_Name'] == 'Jr.' which is not the case. Also, using .all() is causing another unintended behaviour. You should vectorize it and use either in or contains(). Consider using this:

df["Last_Name"] = np.where(df['Candidate_Name'].str.contains('Jr.',case=False,regex=True)
                       df["Candidate_Name"].str.split().str[-2],
                       df['Candidate_Name'].str.split().str[-1])

This is a more efficient based on your data returns the expected output.

  • Related