I have this simplified DataFrame where I want to add a new column Distance_km. In this new column all values should be in kilometres and converted to float dtype.
d = {'Point': ['a','b','c','d'], 'Distance': ['3km', '400m','1.1km','200m']}
dist=pd.DataFrame(data=d)
dist
Point Distance
0 a 3km
1 b 400m
2 c 1.1km
3 d 200m
Point object
Distance object
dtype: object
How can I get this output?
Point Distance Distance_km
0 a 3.8km 3.8
1 b 400m 0.4
2 c 1.1km 1.1
3 d 200m 0.2
Point object
Distance object
Distance_km float64
dtype: object
Thanks in advance!
CodePudding user response:
Try:
# An "Weight" column marking those are in "m" units
dist["Weight"] = 1e-3
dist.loc[dist["Distance"].str.contains("km"),"Weight"] = 1
# Extract the numeric part of string and convert it to float
dist["NumericPart"] = dist["Distance"].str.extract("([0-9.] )\w ").astype(float)
# Merge the numeric parts with their units(weights) by multiplication
dist["Distance_km"] = dist["NumericPart"] * dist["Weight"]
You will get:
Point Distance Weight NumericPart Distance_km
0 a 3km 1.000 3.0 3.0
1 b 400m 0.001 400.0 0.4
2 c 1.1km 1.000 1.1 1.1
3 d 200m 0.001 200.0 0.2
BTW: You may like to use this instead of the second line above to guarantee the "km" str is indeed at the end of the string, just in case.
dist.loc[dist["Distance"].str.contains("km^",regex=True),"Weight"] = 1
CodePudding user response:
You could use Pandas apply method to pass your distance column values to a function that converts it to a standardized unit like so
Apply a function along an axis of the DataFrame.
Objects passed to the function are Series objects whose index is either the DataFrame’s index (axis=0) or the DataFrame’s columns (axis=1). By default (result_type=None), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument.
First create the function that will transform the data, apply
can even take in a lambda
import re
def convert_to_km(distance):
'''
distance can be a string with km or m as units
e.g. 300km, 1.1km, 200m, 4.5m
'''
# split the string into value and unit ['300', 'km']
split_dist = re.match('([\d\.] )?([a-zA-Z] )', distance)
value = split_dist.group(1) # 300
unit = split_dist.group(2) # km
if unit == 'km':
return float(value)
if unit == 'm':
return round(float(value)/1000, 2)
d = {'Point': ['a','b','c','d'], 'Distance': ['3km', '400m','1.1km','200m']}
dist=pd.DataFrame(data=d)
You can then apply this funtion to your distance column
dist['Distanc_km'] = dist.apply(lambda row: convert_to_km(row['Distance']), axis=1)
dist
The output will be
Point Distance Distanc_km
0 a 3km 3.0
1 b 400m 0.4
2 c 1.1km 1.1
3 d 200m 0.2