I have a data that includes columns with dates:
col_1 col_2
'may 2021 - 2023' 'nov 2020 - feb 2021'
'jan 2022 - 2023' 'sep 2021- 2023'
With below code I can create the required output, but I am looking to create a function which can take a dataframe as input and produces the expected output :
s = df['col_1'].str.split(r'\s*-\s*')
df['year_1'] = (pd
.to_datetime(s.str[1])
.sub(pd.to_datetime(s.str[0])))
t = df['col_2'].str.split(r'\s*-\s*')
df['year_2'] = (pd
.to_datetime(t.str[1])
.sub(pd.to_datetime(t.str[0])))
to prepare the below output i need to rerun the code with change in variable. as explained i need to make a function. please note that number of columns can be more so code should work fine
Expected Output
col_1 Year_1 col_2 Year_2
'may 2021 - 2023' 610 days 'sep 2017-dec 2017' 91 days
'jan 2022 - 2023' 365 days 'sep 2021- 2023' 487 days
CodePudding user response:
You can use:
def compute_days(sr):
parts = sr.str.strip("'").str.split('-', expand=True)
start = pd.to_datetime(parts[0])
end = pd.to_datetime(parts[1])
return end - start
days = df.apply(compute_days).rename(columns=lambda x: f"Year_{x.split('_')[1]}")
out = pd.concat([df, days], axis=1)
Output:
col_1 col_2 Year_1 Year_2
0 'may 2021 - 2023' 'nov 2020 - feb 2021' 610 days 92 days
1 'jan 2022 - 2023' 'sep 2021- 2023' 365 days 487 days
2 '03/2017 - 08/2021' '2022 - 2023' 1614 days 365 days
3 '' '' NaT NaT