I need to remove "km" from distance
column:
[Distance]
0 114 km
1 114 km
2 9.1 km
3 33.1 km
4 182 km
5 93.2 km
6 40.4 km
7 0.0
8 0.0
9 43.4 km
Name: distance, dtype: object
Has to look like this:
[Distance]
0 114
1 114
2 9.1
3 33.1
4 182
5 93.2
6 40.4
7
8
9 43.4
CodePudding user response:
Assuming the trailing substring to remove is always km
, you can use:
df['distance'] = df['distance'].str.replace(r'\s*km$', '', regex=True)
A more generic method would be to extract the number:
df['distance'] = df['distance'].str.extract(r'(\d (?:\.\d )?)')
If you only want the number when there is "km":
df['distance'] = df['distance'].str.extract(r'(\d (?:\.\d )?)\s*km')
And to convert to numeric/NaN:
df['distance'] = pd.to_numeric(df['distance'].str.extract(r'(\d (?:\.\d )?)\s*km', expand=False), errors='coerce')
summary
df['distance1'] = df['distance'].str.replace(r'\s*km$', '', regex=True)
df['distance2'] = df['distance'].str.extract(r'(\d (?:\.\d )?)')
df['distance3'] = df['distance'].str.extract(r'(\d (?:\.\d )?)\s*km')
df['distance4'] = pd.to_numeric(df['distance'].str.extract(r'(\d (?:\.\d )?)\s*km', expand=False), errors='coerce')
print(df.dtypes)
print(df)
Output:
distance object
distance1 object
distance2 object
distance3 object
distance4 float64
dtype: object
distance distance1 distance2 distance3 distance4
0 114 km 114 114 114 114.0
1 114 km 114 114 114 114.0
2 9.1 km 9.1 9.1 9.1 9.1
3 33.1 km 33.1 33.1 33.1 33.1
4 182 km 182 182 182 182.0
5 93.2 km 93.2 93.2 93.2 93.2
6 40.4 km 40.4 40.4 40.4 40.4
7 0.0 0.0 0.0 NaN NaN
8 0.0 0.0 0.0 NaN NaN
9 43.4 km 43.4 43.4 43.4 43.4
CodePudding user response:
Here is another way to simply drop the observations with 0 and remove the 'km':
df['distance'] = df['distance'].str.replace(r'\D ', '').astype('float')
# r'\D ' removes any character that is not a digit
df['distance'] = df['distance'].replace(0, np.nan)
df['distance'].dropna(inplace=True)