How to convert 6 first values in column to date based on some assumption in Python Pandas?-CodePudding

I have Pandas Data Frame in Python like below:

VAL
--------
99050605188
00102255789
20042388956
02111505667

Values are in str format.

First 6 numbers means date, for example:

99050605188 --> 1999-05-06
00102255789 --> 2000-10-22
20042388956 --> 1920-04-23

Be aware that:

if value in column "VAL" starts with 0 it will be year 2000 , for example 001203... ---> 2000-12-03, 021115...--> 2002-11-15
if value in column "VAL" starts with 9,8,7,6,5,4,3,2,1 it will be year 1900 , for example 200423... --> 1920-04-23

So as a result I need something like below (column "Date" in str format):

VAL           date
---------------------------
99050605188 | 1999-05-06
00102255789 | 2000-10-22
20042388956 | 1920-04-23
02111505667 | 2002-11-15

How can I do that in Python Pandas ?

CodePudding user response：

You can use np.where() to check for the first digit of column VAL to determine the century. Then, use pd.to_datetime() to convert the date, as follows:

import numpy as np

# 20xx if first digit 0, else 19xx.  Concat 20/19 with yymmdd
date_lst = np.where(df['VAL'].str[0] == '0', '20'   df['VAL'].str[:6], '19'   df['VAL'].str[:6])

# convert date in YYYYmmdd
df['date'] = pd.to_datetime(date_lst, format='%Y%m%d')

Result:

print(df)

           VAL       date
0  99050605188 1999-05-06
1  00102255789 2000-10-22
2  20042388956 1920-04-23
3  02111505667 2002-11-15

CodePudding user response：

You can parse the first 6 characters of the string using the format, %y%m%d and then change the year as per your requirement.

Demo:

from datetime import datetime
import pandas as pd

df = pd.DataFrame(
    {'val': ['99050605188', '00102255789', '20042388956', '02111505667']})

date_list = []
for s in df['val']:
    date = datetime.strptime(s[:6], '%y%m%d')
    if s[0] != '0' and date.year > 2000:
        date = date.replace(year=date.year - 100)
    date_list.append(date.date())

result = df.assign(date=pd.Series(date_list))
print(result)

Output:

           val        date
0  99050605188  1999-05-06
1  00102255789  2000-10-22
2  20042388956  1920-04-23
3  02111505667  2002-11-15

Update based on the following request from the OP:

could you make update also in terms of situation when val is NaN and in this situation return 1900-01-01 in column "date" ?

from datetime import datetime
import pandas as pd
import numpy as np

df = pd.DataFrame(
    {'val': ['99050605188', '00102255789', '20042388956', '02111505667', np.nan]})

date_list = ['19000101' if pd.isnull(s) else ('20'   s if s[0] == '0' else '19'   s)[:8] for s in df['val']]
result = df.assign(date=pd.Series(pd.to_datetime(date_list, format='%Y%m%d')))
print(result)

Output:

           val       date
0  99050605188 1999-05-06
1  00102255789 2000-10-22
2  20042388956 1920-04-23
3  02111505667 2002-11-15
4          NaN 1900-01-01