So I've actually solved this, but the way I'm doing it may not be the most efficient.
For a column in my database - Industry
- I want to replace values. If a value contains the word "tech", "technology" or something similar, I want to replace that value with just the word "technology".
I've followed a basic algorithm below using apply
which basically loops through a predefined list (e.g. science
) and checks whether any of the values are present in the current Industry
cell, and replaces if they are.
It then does the same for the next list. I only have two lists so far, but I'll likely have over a dozen once I'm finished.
def industry_convert(row):
science = ["research", "science", "scientific", "scientist", "academia", "education", "academic"]
tech = ["technology", "tech", "software"]
for v in science:
if v.lower() in row.Industry.lower():
row.Industry = "Research, Science, & Education"
for v in tech:
if v.lower() in row.Industry.lower():
row.Industry = "Technology"
return row
df = df.apply(industry_convert, axis = 1)
I'm just wondering if this is the best way to do this, or if there is a more pythonic
or pandas
way of doing it?
EDIT:
This is what some of the Industry column looks like:
Industry
Research Scientist
Science: Education
Tech
Technical Assistance
Technology
Medical
Hospitality
This what it would look like after applying the code:
Industry
Research, Science, & Education
Research, Science, & Education
Technology
Technology
Technology
Medical
Hospitality
CodePudding user response:
Personally, I would use str.contains
and .loc
to assign new values.
this will work a number of times faster than looping over each row individually to check. (Which is an anti pattern with regards to the pandas API)
science = ["research", "science", "scientific", "scientist", "academia", "education", "academic"]
tech = ["technology", "tech", "software"]
df.loc[df['Industry'].str.contains(f"{'|'.join(science)}",regex=True,case=False),
'industry_new'] = "Research, Science, & Education"
df.loc[df['Industry'].str.contains(f"{'|'.join(tech)}",regex=True,case=False),
'industry_new'] = "Technology"
df['industry_new'] = df['industry_new'].fillna(df['Industry'])
print(df)
Industry industry_new
0 Research Scientist Research, Science, & Education
1 Science: Education Research, Science, & Education
2 Tech Technology
3 Technical Assistance Technology
4 Technology Technology
5 Medical Medical
6 Hospitality Hospitality
CodePudding user response:
I updated the for loop, in your function:
science = list(map(lambda x:x.lower(),["research", "science", "scientific", "scientist", "academia", "education", "academic"]))
tech = list(map(lambda x:x.lower(),["technology", "tech", "software"]))
def industry_convert(row):
global science,tech
if row.Industry.lower() in science:
row.Industry = "Research, Science, & Education"
if row.Industry.lower() in science:
row.Industry = "Technology"
return row
df = df.apply(industry_convert, axis = 1)
I computed the lists to lower only once, so that it is not recomputed and the computation of the for loop is saved.