Recently I was working on a Data cleaning assignment, where I used age_of_marriage dataset. I started to clean data, but in the dataset there is a "height" column which is of Object type. It is in the format of feet and inch.
I want to extract 'foot' and 'inch' from the data and convert it into 'cm' using the formula. I have the formula ready for the conversion but I am not able to extract it. Also I want to convert it into Int datatype before applying the formula. I am stuck on this mode.
-------- 2 height 2449 non-null object --------
I am trying to extract it using String manipulation, but not able to do it. Can anybody help.
height |
---|
5'3" |
5'4" |
I have attached a github link to access the dataset. text
import numpy as np
import pandas as pd
from collections import Counter
agemrg = pd.read_csv('age_of_marriage_data.csv')
for height in range(len(height_list)):
BrideGroomHeight = height_list[height].rstrip(height_list[height][-1])
foot_int = int(BrideGroomHeight[0])
inch_int = int(BrideGroomHeight[2:4])
print(foot_int)
print(inch_int)
if height in ['nan']:
continue
output -
5
4
5
7
5
7
5
0
5
5
5
5
5
2
5
5
5
5
5
1
5
3
5
9
5
10
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_12772/2525694838.py in <module>
1 for height in range(len(height_list)):
----> 2 BrideGroomHeight = height_list[height].rstrip(height_list[height][-1])
3 foot_int = int(BrideGroomHeight[0])
4 inch_int = int(BrideGroomHeight[2:4])
5 print(foot_int)
AttributeError: 'float' object has no attribute 'rstrip'
There are some nan values, due to which I am not able to perform this operation.
CodePudding user response:
You can use str.extract
:
df['height2'] = df['height'].str.extract(r'''(?P<ft>\d*)'(?P<in>\d )"''') \
.astype(float).mul([30.48, 2.54]).sum(axis=1)
Or str.split
and str.strip
:
df['height3'] = df['height'].str.rstrip('"').str.split("'", expand=True) \
.astype(float).mul([30.48, 2.54]).sum(axis=1)
Output:
>>> df.filter(like='height')
height height2 height3
0 5'4" 162.56 162.56
1 5'7" 170.18 170.18
2 5'7" 170.18 170.18
3 5'0" 152.40 152.40
4 5'5" 165.10 165.10
... ... ... ...
2562 5'3" 160.02 160.02
2563 5'11" 180.34 180.34
2564 5'3" 160.02 160.02
2565 4'11" 149.86 149.86
2566 5'2" 157.48 157.48
[2567 rows x 3 columns]
CodePudding user response:
You can use .split()
to get the feet and inches portion. If you are certain you only have to deal with a few NaN rows, then a simple version could be:
df['height_feet'] = df['height'].dropna().apply(lambda x: str(x).split("'")[0])
df['height_inches'] = df['height'].dropna().apply(lambda x: str(x).split("'")[-1][0:-1])
df[['height', 'height_feet', 'height_inches']]
Basically, the feet portion is the first piece in the split, and the inches portion is the last piece in the split but without the last character.
Output:
>>> print(df[['height', 'height_feet', 'height_inches']])
height height_feet height_inches
0 5'4" 5 4
1 5'7" 5 7
2 5'7" 5 7
3 5'0" 5 0
4 5'5" 5 5
... ... ... ...
2562 5'3" 5 3
2563 5'11" 5 11
2564 5'3" 5 3
2565 4'11" 4 11
2566 5'2" 5 2
[2567 rows x 3 columns]