I've got a data frame with several instances of ' *Winner*'
after the person's name. I'd like to remove this so just their name remains. A simplified version of this is below:
Data = {'YEAR': [2020, 2020, 2020],
'NAME': [ 'Tom *Winner*', 'Dick *Winner*', 'Harry *Winner*']}
df = pd.DataFrame(Data)
df['NAME'] = df['NAME'].str.replace(' [*]Winner[*]', '')
df
Which results in this:
YEAR NAME
0 2020 Tom
1 2020 Dick *Winner*
2 2020 Harry
So my question is what is different about Dick's ' *Winner*'
tag vs. the other two? Why is this one not being removed? Is there a better way to accomplish this?
As an extra confusing note, I just copied my code above and re-ran it (assuming that's how people here might test it) and it does remove Dick's ' *Winner*'
tag. So something about my original data is different and is stripped out when pasted here. Not sure how to help reproduce it this being the case.
Edit: Per request from @user17242583, here is the result of df.to_dict
{'NAME': {0: 'Tom *Winner*', 1: 'Dick\xa0*Winner*', 2: 'Harry *Winner*'},
'YEAR': {0: 2020, 1: 2020, 2: 2020}}
CodePudding user response:
Try this:
df['NAME'] = df['NAME'].str.replace('\s*\*Winner\*', '', regex=True)
Output:
>>> df
YEAR NAME
0 2020 Tom
1 2020 Dick
2 2020 Harry
The reason that works is because your data has a \xa0
, which is a thin, non-breaking space. In your original code, you have a space hardcoded, but my code uses \s*
which means "0 or more of any whitespace characters," which includes \xa0
(char code 160).