How to extract numbers from a DataFrame column in python?-CodePudding

Recently I was working on a Data cleaning assignment, where I used age_of_marriage dataset. I started to clean data, but in the dataset there is a "height" column which is of Object type. It is in the format of feet and inch. Dataset Image

I want to extract 'foot' and 'inch' from the data and convert it into 'cm' using the formula. I have the formula ready for the conversion but I am not able to extract it. Also I want to convert it into Int datatype before applying the formula. I am stuck on this mode.

-------- 2 height 2449 non-null object --------

I am trying to extract it using String manipulation, but not able to do it. Can anybody help.

height
5'3"
5'4"

I have attached a github link to access the dataset. text

import numpy as np
import pandas as pd
from collections import Counter

agemrg = pd.read_csv('age_of_marriage_data.csv')

for height in range(len(height_list)):
    BrideGroomHeight = height_list[height].rstrip(height_list[height][-1])
    foot_int = int(BrideGroomHeight[0])
    inch_int = int(BrideGroomHeight[2:4])
    print(foot_int)
    print(inch_int)
    
    if height in ['nan']:
        continue

output - 
5
4
5
7
5
7
5
0
5
5
5
5
5
2
5
5
5
5
5
1
5
3
5
9
5
10
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_12772/2525694838.py in <module>
      1 for height in range(len(height_list)):
----> 2     BrideGroomHeight = height_list[height].rstrip(height_list[height][-1])
      3     foot_int = int(BrideGroomHeight[0])
      4     inch_int = int(BrideGroomHeight[2:4])
      5     print(foot_int)

AttributeError: 'float' object has no attribute 'rstrip'

There are some nan values, due to which I am not able to perform this operation.

CodePudding user response：

You can use str.extract:

df['height2'] = df['height'].str.extract(r'''(?P<ft>\d*)'(?P<in>\d )"''') \
                            .astype(float).mul([30.48, 2.54]).sum(axis=1)

Or str.split and str.strip:

df['height3'] = df['height'].str.rstrip('"').str.split("'", expand=True) \
                            .astype(float).mul([30.48, 2.54]).sum(axis=1)

Output:

>>> df.filter(like='height')
     height  height2  height3
0      5'4"   162.56   162.56
1      5'7"   170.18   170.18
2      5'7"   170.18   170.18
3      5'0"   152.40   152.40
4      5'5"   165.10   165.10
...     ...      ...      ...
2562   5'3"   160.02   160.02
2563  5'11"   180.34   180.34
2564   5'3"   160.02   160.02
2565  4'11"   149.86   149.86
2566   5'2"   157.48   157.48

[2567 rows x 3 columns]

CodePudding user response：

You can use .split() to get the feet and inches portion. If you are certain you only have to deal with a few NaN rows, then a simple version could be:

df['height_feet'] = df['height'].dropna().apply(lambda x: str(x).split("'")[0])
df['height_inches'] = df['height'].dropna().apply(lambda x: str(x).split("'")[-1][0:-1])
df[['height', 'height_feet', 'height_inches']]

Basically, the feet portion is the first piece in the split, and the inches portion is the last piece in the split but without the last character.

Output:

>>> print(df[['height', 'height_feet', 'height_inches']])
     height height_feet height_inches
0      5'4"           5             4
1      5'7"           5             7
2      5'7"           5             7
3      5'0"           5             0
4      5'5"           5             5
...     ...         ...           ...
2562   5'3"           5             3
2563  5'11"           5            11
2564   5'3"           5             3
2565  4'11"           4            11
2566   5'2"           5             2

[2567 rows x 3 columns]