I have a data where it has multiple column representing start and end date of task in different formats. all i want is to create a function that can handle multiple date formats and convert them in number of respective years/months
df_1 :
Name Task_1
Robert 'Nov 2022 - Dec 2022'
billy '06/2021 - 06/2022'
stuart 'NOV 2022 - 2022'
Expected output :
Name Task_1 Time Required
Robert 'Nov 2022 - Dec 2022' 1 months
billy '06/2021 - 06/2022' 12 months
stuart '2021 - 2022' 12 months
`def years_in_range(date_range):
date_range = pd.to_datetime(date_range, format='%B %Y to %B %Y')
delta = date_range.iloc[1] - date_range.iloc[0]
years = delta / timedelta(days=365)
return years`
used above to get the time required in years but not getting the expected output.
CodePudding user response:
You can split
your column in two parts, convert to_datetime
and subtract the two to get a timedelta (or integer):
s = df['Task_1'].str.split(r'\s*-\s*')
df['Time Required'] = (pd
.to_datetime(s.str[1]).dt.to_period('M')
.sub(pd.to_datetime(s.str[0]).dt.to_period('M'))
.apply(lambda x: x.n) # only if you want integers
)
Output:
Name Task_1 Time Required
0 Robert Nov 2022 - Dec 2022 <MonthEnd>
1 billy 06/2021 - 06/2022 <12 * MonthEnds>
2 stuart 2021 - 2022 <12 * MonthEnds>
Or:
s = df['Task_1'].str.split(r'\s*-\s*')
df['Time Required'] = (pd
.to_datetime(s.str[1])
.sub(pd.to_datetime(s.str[0]))
)
Output:
Name Task_1 Time Required
0 Robert Nov 2022 - Dec 2022 1
1 billy 06/2021 - 06/2022 12
2 stuart 2021 - 2022 12