I have a dataframe with one column containing text, and I would like to create a new variable if a substring exists on a column
original df
team
1New York MetsMets1
2Seattle MarinersMariners2
3Los Angeles AngelsAngels3
new df
team code
1New York MetsMets1 NYM
2Seattle MarinersMariners2 SEA
3Los Angeles AngelsAngels3 ANA
So basically if the column contains the text 'Mets' then code column will be NYM, if it contains 'Mariners' then code equals SEA, etc. The reason for this is because the Team column can contain extraneous characters or extra information that I want to ignore, so I need some type of rule such that:
if exists in team(Mariners) then code = 'SEA'
if exists in team(New York Mets) then code = 'Mets'
Thanks!
CodePudding user response:
You should try something like this :
df.loc[df['team'].str.contains('Mariners'),'code']='SEA'
but with a loop depending of the number of different words. You should do some preprocessing, maybe put everything in lower case to be sure you are not missing mariners instead of Mariners
CodePudding user response:
Create dictonary for lookup values, get substrings by Series.str.extract
and then mapping by Series.map
:
d= {'Mariners':'SEA','New York Mets':'Mets'}
df['code'] = df['team'].str.extract(f'({"|".join(d)})', expand=False).map(d)
print (df)
team code
0 1New York MetsMets1 Mets
1 2Seattle MarinersMariners2 SEA
2 3Los Angeles AngelsAngels3 NaN