Home > OS >  Calculate number of digits of elements in df column
Calculate number of digits of elements in df column

Time:08-11

I have a Dataframe which looks like this:

df =    Year.of.birth    Year
12         56            2017     
17         63            2019
27         1962          2018
36         0             2019

So, Year.of.birth can take values with four, two or one digit. My end goal is to calculate the age in every row. This is my attempt:

df.index = pd.Index(range(0,len(df),1))
df.loc[df['Year.of.birth'] == 0, 'Year.of.birth'] = 2000
for i in range(len(df)):
        if int(log10(df['Year.of.birth'].iloc[i])) 1 != 4: #Checks number of digits
            if df['Year.of.birth'].iloc[i] > (df['Year'].iloc[i]-2000): 
#If born in 20th century
                df.loc[i, 'Year.of.birth']  = 1900
            else: 
                df.loc[i, 'Year.of.birth']  = 2000
df['age'] = df.loc[:,'Year'] - df.loc[:,'Year.of.birth'] 

This is rather slow for a large dataset, but it works. I have also tried something like this:

#Get the century digits
df.loc[df['Year.of.birth'] == 0, 'Year.of.birth'] = 2000
df.loc[df['Year.of.birth'] < df.loc[:,'Year'] - 2000, 'Year.of.birth']  = 2000
df.loc[df['Year.of.birth'] >= df.loc[:,'Year'] - 2000, 'Year.of.birth']  = 1900

The problem with this solution is that it cannot handle rows where Year.of.birth already has century digits.

So my question is, is there any way to calculate the number of digits for elements in a df column, without using a for loop? A list comprehension would be great, but there are a lot of if-conditions so maybe it is too complicated.

I am very grateful for any tips of improvement of my code, or of this question.

Thanks in advance!

Edit: Came up with a solution. The obvious problem is with the data sampling, that some samples has century digits and some does not – we can never be absolutely certain whether a person is born in 1916 or 2016 if the data is sampled as 16.

But a solution is:

df.loc[df['Year.of.birth'] < 100, 'Year.of.birth']  = 1900
df.loc[df['Year.of.birth'] < df.loc[:,'Year'] - 100, 'fodelse.ar']  = 100

The first row adds 1900 to Year.of.birth if the value is less than 100, and thus acts like a filter to values with century digits.

CodePudding user response:

You can create an extra column containing the length of Year.of.birth.

df.loc[:, 'length'] = df['Year.of.birth'].astype(str).str.len()

Using this column you can seperate your actions, for example:

df.loc[(df['Year.of.birth'] < df.Year - 2000) & (df.length == 2), 'Year.of.birth']  = 2000

Hope this helps!

  • Related