Home > front end >  Python Compute new variable based on substring
Python Compute new variable based on substring

Time:05-03

I have a dataframe with one column containing text, and I would like to create a new variable if a substring exists on a column

original df

team
1New York MetsMets1
2Seattle MarinersMariners2
3Los Angeles AngelsAngels3

new df

team                       code  
1New York MetsMets1        NYM 
2Seattle MarinersMariners2 SEA
3Los Angeles AngelsAngels3 ANA

So basically if the column contains the text 'Mets' then code column will be NYM, if it contains 'Mariners' then code equals SEA, etc. The reason for this is because the Team column can contain extraneous characters or extra information that I want to ignore, so I need some type of rule such that:

if exists in team(Mariners) then code = 'SEA'
if exists in team(New York Mets) then code = 'Mets'

Thanks!

CodePudding user response:

You should try something like this :

df.loc[df['team'].str.contains('Mariners'),'code']='SEA'

but with a loop depending of the number of different words. You should do some preprocessing, maybe put everything in lower case to be sure you are not missing mariners instead of Mariners

CodePudding user response:

Create dictonary for lookup values, get substrings by Series.str.extract and then mapping by Series.map:

d= {'Mariners':'SEA','New York Mets':'Mets'}
df['code'] = df['team'].str.extract(f'({"|".join(d)})', expand=False).map(d)

print (df)
                         team  code
0         1New York MetsMets1  Mets
1  2Seattle MarinersMariners2   SEA
2  3Los Angeles AngelsAngels3   NaN
  • Related