Efficient way to replace values in column if part of those values are in predefined lists in pandas-CodePudding

So I've actually solved this, but the way I'm doing it may not be the most efficient.

For a column in my database - Industry - I want to replace values. If a value contains the word "tech", "technology" or something similar, I want to replace that value with just the word "technology".

I've followed a basic algorithm below using apply which basically loops through a predefined list (e.g. science) and checks whether any of the values are present in the current Industry cell, and replaces if they are.

It then does the same for the next list. I only have two lists so far, but I'll likely have over a dozen once I'm finished.

def industry_convert(row):
    
    science = ["research", "science", "scientific", "scientist", "academia", "education", "academic"]
    tech = ["technology", "tech", "software"]

    for v in science:
        if v.lower() in row.Industry.lower():
            row.Industry = "Research, Science, & Education"
            
    for v in tech:
        if v.lower() in row.Industry.lower():
            row.Industry = "Technology"
            
    return row

df = df.apply(industry_convert, axis = 1)

I'm just wondering if this is the best way to do this, or if there is a more pythonic or pandas way of doing it?

EDIT:

This is what some of the Industry column looks like:

Industry
Research Scientist
Science: Education
Tech
Technical Assistance
Technology
Medical
Hospitality

This what it would look like after applying the code:

Industry            
Research, Science, & Education
Research, Science, & Education
Technology
Technology
Technology
Medical
Hospitality

CodePudding user response：

Personally, I would use str.contains and .loc to assign new values.

this will work a number of times faster than looping over each row individually to check. (Which is an anti pattern with regards to the pandas API)

science = ["research", "science", "scientific", "scientist", "academia", "education", "academic"]
tech = ["technology", "tech", "software"]

df.loc[df['Industry'].str.contains(f"{'|'.join(science)}",regex=True,case=False),
                         'industry_new'] = "Research, Science, & Education"

df.loc[df['Industry'].str.contains(f"{'|'.join(tech)}",regex=True,case=False),
                         'industry_new'] = "Technology"


df['industry_new'] = df['industry_new'].fillna(df['Industry'])

print(df)

               Industry                    industry_new
0    Research Scientist  Research, Science, & Education
1    Science: Education  Research, Science, & Education
2                  Tech                      Technology
3  Technical Assistance                      Technology
4            Technology                      Technology
5               Medical                         Medical
6           Hospitality                     Hospitality

CodePudding user response：

I updated the for loop, in your function:

science = list(map(lambda x:x.lower(),["research", "science", "scientific", "scientist", "academia", "education", "academic"]))
tech = list(map(lambda x:x.lower(),["technology", "tech", "software"]))
def industry_convert(row):
    global science,tech
    


  
     if row.Industry.lower() in science:
          row.Industry = "Research, Science, & Education"
            
    
     if row.Industry.lower() in science:
          row.Industry = "Technology"
            
    return row

df = df.apply(industry_convert, axis = 1)

I computed the lists to lower only once, so that it is not recomputed and the computation of the for loop is saved.