I have a DataFrame which contains measurements; e.g., weight, height, etc. However, sometimes the measurement column contains units together with values. Therefore, I would like to separate measurement values and units if they are together in DataFrame. Eg., In the below DataFrame, df, the height value and unit of the first entry are in respective columns. However, the value column of the 2nd and 3rd entries of height contains both value and unit together. In that case, I would like to move height units "m" and "cm" from the value column to the unit column.
measurement name | value | unit |
---|---|---|
height | 160.0 | cm |
height | 1.5 m | |
height | 155cm |
The output DataFrame should look like the below -
measurement name | value | unit |
---|---|---|
height | 160.0 | cm |
height | 1.5 | m |
height | 155.0 | cm |
May I know how I separate values and units into their respective column in DataFrame efficiency in Python?
CodePudding user response:
Use Series.str.extract
with regex for get numeric values with .
from start of string by ^
, optionaly space separator by \s*
and non numeric values in end of strings by \D
with $
and pass to DataFrame.update
for replace only extracted values:
df.update(df['value'].str.extract(r'^(?P<value>\d \.*\d*)\s*(?P<unit>\D )$'))
print (df)
measurement name value unit
0 height 160.0 cm
1 height 1.5 m
2 height 155 cm
CodePudding user response:
updating other answer to include the unit:
df['unit'] = df['value'].apply(lambda x: ''.join(l for l in x if l.isalpha()))
df.update(df['value'].str.extract(r'^(?P<value>\d \.*\d*)\s*(?P<unit>\D )$'))
print(df)
will output:
name value unit
0 height 160.0
1 height 1.5 m
2 height 155 cm