I have a data set which has a column that looks like this
Badge Number
1
3
23 / gold
22 / silver
483
I need only the numbers. Here's my code:
df = pd.read_excel('badges.xlsx')
df['Badge Number'] = df['Badge Number'].str.extract('(\d )')
print(df)
I was expecting an output like:
Badge Number
1
3
23
22
483
but I got
Badge Number
Nan
Nan
23
22
Nan
Just to test, I dumped the dataframe to a .csv and read it back with pd.read_csv(). That gave me just the numbers, as I need (though of course that's not a solution)
I also tried
df['Badge Number'] = np.where(df['Badge Number'].str.isnumeric(), df['Badge Number'], df['Badge Number'].str.extract('(\d )'))
but that just gave me all 1s. I know I am trying things I don't even remotely understand, but am hoping there's a straightforward solution.
CodePudding user response:
That's almost certainly because the numbers are actually integers, not strings. Try filling the missing values by the original numbers.
df['Badge Number'] = df['Badge Number'].str.extract('(\d )')[0].fillna(df['Badge Number'])#.astype(int)
CodePudding user response:
Adding expand
to False
df['Badge Number'] = df['Badge Number'].astype(str).str.extract('(\d )',expand=False)
df
Out[412]:
Badge Number
0 1
1 3
2 23
3 22
4 483