How do I remove specific words from columns?-CodePudding

I need to remove "km" from distance column:

[Distance]
0     114 km
1     114 km
2     9.1 km
3    33.1 km
4     182 km
5    93.2 km
6    40.4 km
7        0.0
8        0.0
9    43.4 km
Name: distance, dtype: object

Has to look like this:

CodePudding user response：

Assuming the trailing substring to remove is always km, you can use:

df['distance'] = df['distance'].str.replace(r'\s*km$', '', regex=True)

A more generic method would be to extract the number:

df['distance'] = df['distance'].str.extract(r'(\d (?:\.\d )?)')

If you only want the number when there is "km":

df['distance'] = df['distance'].str.extract(r'(\d (?:\.\d )?)\s*km')

And to convert to numeric/NaN:

df['distance'] = pd.to_numeric(df['distance'].str.extract(r'(\d (?:\.\d )?)\s*km', expand=False), errors='coerce')

summary

df['distance1'] = df['distance'].str.replace(r'\s*km$', '', regex=True)

df['distance2'] = df['distance'].str.extract(r'(\d (?:\.\d )?)')

df['distance3'] = df['distance'].str.extract(r'(\d (?:\.\d )?)\s*km')

df['distance4'] = pd.to_numeric(df['distance'].str.extract(r'(\d (?:\.\d )?)\s*km', expand=False), errors='coerce')

print(df.dtypes)

print(df)

Output:

distance      object
distance1     object
distance2     object
distance3     object
distance4    float64
dtype: object

  distance distance1 distance2 distance3  distance4
0   114 km       114       114       114      114.0
1   114 km       114       114       114      114.0
2   9.1 km       9.1       9.1       9.1        9.1
3  33.1 km      33.1      33.1      33.1       33.1
4   182 km       182       182       182      182.0
5  93.2 km      93.2      93.2      93.2       93.2
6  40.4 km      40.4      40.4      40.4       40.4
7      0.0       0.0       0.0       NaN        NaN
8      0.0       0.0       0.0       NaN        NaN
9  43.4 km      43.4      43.4      43.4       43.4

CodePudding user response：

Here is another way to simply drop the observations with 0 and remove the 'km':

df['distance'] = df['distance'].str.replace(r'\D ', '').astype('float')
# r'\D ' removes any character that is not a digit
df['distance'] = df['distance'].replace(0, np.nan)
df['distance'].dropna(inplace=True)