I have a Dataframe which looks like this:
df = Year.of.birth Year
12 56 2017
17 63 2019
27 1962 2018
36 0 2019
So, Year.of.birth can take values with four, two or one digit. My end goal is to calculate the age in every row. This is my attempt:
df.index = pd.Index(range(0,len(df),1))
df.loc[df['Year.of.birth'] == 0, 'Year.of.birth'] = 2000
for i in range(len(df)):
if int(log10(df['Year.of.birth'].iloc[i])) 1 != 4: #Checks number of digits
if df['Year.of.birth'].iloc[i] > (df['Year'].iloc[i]-2000):
#If born in 20th century
df.loc[i, 'Year.of.birth'] = 1900
else:
df.loc[i, 'Year.of.birth'] = 2000
df['age'] = df.loc[:,'Year'] - df.loc[:,'Year.of.birth']
This is rather slow for a large dataset, but it works. I have also tried something like this:
#Get the century digits
df.loc[df['Year.of.birth'] == 0, 'Year.of.birth'] = 2000
df.loc[df['Year.of.birth'] < df.loc[:,'Year'] - 2000, 'Year.of.birth'] = 2000
df.loc[df['Year.of.birth'] >= df.loc[:,'Year'] - 2000, 'Year.of.birth'] = 1900
The problem with this solution is that it cannot handle rows where Year.of.birth already has century digits.
So my question is, is there any way to calculate the number of digits for elements in a df column, without using a for loop? A list comprehension would be great, but there are a lot of if-conditions so maybe it is too complicated.
I am very grateful for any tips of improvement of my code, or of this question.
Thanks in advance!
Edit: Came up with a solution. The obvious problem is with the data sampling, that some samples has century digits and some does not – we can never be absolutely certain whether a person is born in 1916 or 2016 if the data is sampled as 16.
But a solution is:
df.loc[df['Year.of.birth'] < 100, 'Year.of.birth'] = 1900
df.loc[df['Year.of.birth'] < df.loc[:,'Year'] - 100, 'fodelse.ar'] = 100
The first row adds 1900 to Year.of.birth if the value is less than 100, and thus acts like a filter to values with century digits.
CodePudding user response:
You can create an extra column containing the length of Year.of.birth
.
df.loc[:, 'length'] = df['Year.of.birth'].astype(str).str.len()
Using this column you can seperate your actions, for example:
df.loc[(df['Year.of.birth'] < df.Year - 2000) & (df.length == 2), 'Year.of.birth'] = 2000
Hope this helps!