I have Pandas Data Frame in Python like below:
VAL
--------
99050605188
00102255789
20042388956
02111505667
Values are in str format.
First 6 numbers means date, for example:
- 99050605188 --> 1999-05-06
- 00102255789 --> 2000-10-22
- 20042388956 --> 1920-04-23
Be aware that:
- if value in column "VAL" starts with 0 it will be year 2000 , for example 001203... ---> 2000-12-03, 021115...--> 2002-11-15
- if value in column "VAL" starts with 9,8,7,6,5,4,3,2,1 it will be year 1900 , for example 200423... --> 1920-04-23
So as a result I need something like below (column "Date" in str format):
VAL date
---------------------------
99050605188 | 1999-05-06
00102255789 | 2000-10-22
20042388956 | 1920-04-23
02111505667 | 2002-11-15
How can I do that in Python Pandas ?
CodePudding user response:
You can use np.where()
to check for the first digit of column VAL
to determine the century. Then, use pd.to_datetime()
to convert the date, as follows:
import numpy as np
# 20xx if first digit 0, else 19xx. Concat 20/19 with yymmdd
date_lst = np.where(df['VAL'].str[0] == '0', '20' df['VAL'].str[:6], '19' df['VAL'].str[:6])
# convert date in YYYYmmdd
df['date'] = pd.to_datetime(date_lst, format='%Y%m%d')
Result:
print(df)
VAL date
0 99050605188 1999-05-06
1 00102255789 2000-10-22
2 20042388956 1920-04-23
3 02111505667 2002-11-15
CodePudding user response:
You can parse the first 6 characters of the string using the format, %y%m%d
and then change the year as per your requirement.
Demo:
from datetime import datetime
import pandas as pd
df = pd.DataFrame(
{'val': ['99050605188', '00102255789', '20042388956', '02111505667']})
date_list = []
for s in df['val']:
date = datetime.strptime(s[:6], '%y%m%d')
if s[0] != '0' and date.year > 2000:
date = date.replace(year=date.year - 100)
date_list.append(date.date())
result = df.assign(date=pd.Series(date_list))
print(result)
Output:
val date
0 99050605188 1999-05-06
1 00102255789 2000-10-22
2 20042388956 1920-04-23
3 02111505667 2002-11-15
Update based on the following request from the OP:
could you make update also in terms of situation when val is NaN and in this situation return 1900-01-01 in column "date" ?
from datetime import datetime
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'val': ['99050605188', '00102255789', '20042388956', '02111505667', np.nan]})
date_list = ['19000101' if pd.isnull(s) else ('20' s if s[0] == '0' else '19' s)[:8] for s in df['val']]
result = df.assign(date=pd.Series(pd.to_datetime(date_list, format='%Y%m%d')))
print(result)
Output:
val date
0 99050605188 1999-05-06
1 00102255789 2000-10-22
2 20042388956 1920-04-23
3 02111505667 2002-11-15
4 NaN 1900-01-01