I have a dataframe with a column named 'height' and I want to convert the values into float. The default unit is in meter, but it has some values in incorrect format, or in inches. It looks like
height
0 16
1 7
2 7
3 6 m
4 2.40
5 5'8"
6 3m
7 6,9
8 9;6;3
9 Unknown
10 4.66
11 Bilinmiyor
12 11' 4"
dtype: object
Basically, I need to convert values in inches/ft to meter unit, convert values like Bilinmiyor
and Unknown
to NaN
, remove the unit specification like m
m
, replace comma in the decimal numbers with .
, and keep the largest number for value 9;6;3
. The final dtypes should be float or int.
I am new to python so I don't really know how to use advanced techniques so far. I was trying to achieve the task using
def to_num(a):
try:
return float(pd.to_numeric(a, errors = 'raise'))
except ValueError:
return a
df['height'] = to_num(df['height'])
but it didn't work. I was wondering if I should use iteration but it seems very complicated to iterate through all cells in this column, because the dataset has more than 2 million rows.
CodePudding user response:
I feel you mate, I had the same kind of problems. But thankfully this is not that hard
import pandas as pd
df = pd.DataFrame({'height': [16, 7, '6m', '2.4', '3,5', 'Asdf', '9;6;3']})
df['height'] = df['height'].astype(str) # force type str
df['height'] = df['height'].str.replace('.', ',', regex=False) # . -> ,
df['height'] = df['height'].str.replace('[A-Za-z]', '') # remove all characters (regex)
df['height'] = df['height'].str.split(';').apply(max) # pick largest value from 9;6;3
df['height'] = pd.to_numeric(df['height'], errors='coerce') # force float
And you get
height
0 16.0
1 7.0
2 6.0
3 2.4
4 3.5
5 NaN
6 9.0
Now if you want to convert your feet to meters (I'm assuming default is meter), you'll need to add a level of complexion
import pandas as pd
import numpy as np
import re
def feet_to_m(s):
if '\'' in s or "\"" in s:
if '\'' in s:
feet = float(s.split('\'')[0])
else:
feet = 0
if '\"' in s:
if '\'' in s:
inch = float(s.split('\'')[1].split('\"')[0])
else:
inch = float(s.split('\"')[0])
else:
inch = 0
return (feet*12 inch) * 0.0254
else:
return s
df = pd.DataFrame({'height': [16, 7, '6m', '2.4', '3,5', 'Asdf', '9;6;3', "11' 4\"", "4'", "15\""]})
df['height'] = df['height'].astype(str) # force type str
df['height'] = df['height'].str.replace(',', '.', regex=False) # . -> ,
df['height'] = df['height'].str.replace('[A-Za-z]', '') # remove all characters
df['height'] = df['height'].str.split(';').apply(max) # pick largest value from 9;6;3
df['height'] = df['height'].apply(feet_to_m)
df['height'] = pd.to_numeric(df['height'], errors='coerce') # force float
to get
height
0 16.0000
1 7.0000
2 6.0000
3 2.4000
4 3.5000
5 NaN
6 9.0000
7 3.4544
8 1.2192
9 0.3810
hope this helps