Error during calculation of age based on Polish PESEL in Python Pandas?-CodePudding

I have Data Frame in Python Pandas like below with str values:

NR
--------
910517196
921122192
020612567

And I try to calculate age based on values in column "NR" using below code:

ABT_DATE = pd.Timestamp(year=2021, month=6, day=30)
df['age'] = (ABT_DATE - pd.to_datetime(df.NR.str[:6], format = '%y%m%d')) / np.timedelta64(1, 'Y')
df["age"] = df.age.astype("int")

Logic of above code is: take 6 first numbers from df from column "NR" and calculate age based on that, because for example: 910517196 (first 6 numbers) is 1991-05-17.

Nevertheless, when I try to use my code I have error like below:

ValueError: unconverted data remains: 20

My DataFrame has over 400k rows so it is difficult to check all rows, but I am sure I do not have NaN and years months and days are in the correct intervals.

As you can see on below sample this code is correct and should works, why it works on small sample code and does not work on my over 400k rows Data Frame?

df = pd.DataFrame({"NR" : ["95050611475", "00112575862"]})
df['age'] = (ABT_DATE - pd.to_datetime(df.NR.str[:6], format = '%y%m%d')) / np.timedelta64(1, 'Y')
df["age"] = df.age.astype("int")
df

How can I repair my big Data Frame to be able to use my code in Python Pandas ?

CodePudding user response：

You probably have some bad formatted rows. To find them, I suggest you to use to_datetime with errors='coerce' as parameter. All unconverted values are set to NaN. So you can a boolean mask m to find bad values.

df = pd.DataFrame({"NR" : ["95050611475", "00112575862", "badformat"]})

m = pd.to_datetime(df.NR.str[:6], format='%y%m%d', errors='coerce').isna()
print(df[m])

# Output:
          NR
2  badformat