I have a dataframe with lat long values, whose types should be float. However, for some rows, you can find things like -74.128815°, with the '°' character in the end of the string.
id | Lat | Long |
---|---|---|
1 | 4.807 | -75.684 |
2 | 4.5405 | -75.6658 |
3 | -74.128815° | |
4 | 5.35002 | -72.4002 |
5 | 4.6774° | -75.693 |
I want to keep all float values as they are but replace the values containing '°' (and then convert them to float), so that in the end I have this:
id | Lat | Long |
---|---|---|
1 | 4.807 | -75.684 |
2 | 4.5405 | -75.6658 |
3 | -74.128815 | |
4 | 5.35002 | -72.4002 |
5 | 4.6774 | -75.693 |
The DataFrame is named df. I have tried
df[df['Lat'].str.contains('°')]
which raises ValueError: Cannot mask with non-boolean array containing NA / NaN values error
Also, I've also tried df['Lat'] = np.where(df['Lat'].str.contains('°'), df['Lat'][:-1], df['Lat'])
which raises ValueError: operands could not be broadcast together
CodePudding user response:
You can replace
the invalid character with an empty string, and then just use pd.to_numeric
:
degree_sign = u'\N{DEGREE SIGN}' # or degree_sign = "°"
df[['lat', 'long']].replace(degree_sign, '', regex=True)\
.apply(pd.to_numeric, errors='coerce')
CodePudding user response:
Using regex
replace the last non-digit character(s) ('\D ')in the columns, then convert to float:
df[['Lat', 'Long']].replace('\\D $', '', regex = True).astype(float)
Lat Long
0 4.80700 -75.684000
1 4.54050 -75.665800
2 NaN -74.128815
3 5.35002 -72.400200
4 4.67740 -75.693000