I want to perform a calculation whereby for every end date of letter need to be minus with the Start_Date,
then divide with 365 to get duration in years. Then the duration of each letter column need to be use as a 'power of' of the value of their letter column, respectively. Then the result of each letter need to be sum up to get the total.
I've tried using below coding where I'm using parsing and I managed to get the answer.
import pandas as pd
dataset = [['01-01-2015', 234, '25-05-2017', 633, '03-06-2016', 935, '30-10-2019', 673, '16-12-2020', 825, '06-07-2019'],
['01-01-2015', 664, '25-05-2017', 663, '03-06-2016', 665, '30-10-2019', 663, '16-12-2020', 665, '06-07-2019']]
ds = pd.DataFrame(dataset, columns = ['Start_Date', 'A', 'End_Date_A', 'B', 'End_Date_B', 'C', 'End_Date_C',
'D', 'End_Date_D', 'E', 'End_Date_E'])
Start_Date A End_Date_A B End_Date_B C End_Date_C D End_Date_D E End_Date_E
0 01-01-2015 234 25-05-2017 633 03-06-2016 935 30-10-2019 673 16-12-2020 825 06-07-2019
1 01-01-2015 664 25-05-2017 663 03-06-2016 665 30-10-2019 663 16-12-2020 665 06-07-2019
from dateutil import parser
import math
letters = ["A", "B", "C", "D", "E"]
total = 0
for i in ds.index:
for letter in letters:
start_date = parser.parse(ds["Start_Date"][i])
end_date = parser.parse(ds["End_Date_" letter][i])
years = (end_date - start_date).days / 365
power = math.pow(int(ds[letter][i]), years)
total = power
ds['Overall'] = total
However, it shows the same result for each of the rows.
Start_Date A End_Date_A B End_Date_B C End_Date_C D End_Date_D E End_Date_E Overall
0 01-01-2015 234 25-05-2017 633 03-06-2016 935 30-10-2019 673 16-12-2020 825 06-07-2019 1.388585e 17
1 01-01-2015 664 25-05-2017 663 03-06-2016 665 30-10-2019 663 16-12-2020 665 06-07-2019 1.388585e 17
Is there any other tips to perform this and get the total based on each row values?
CodePudding user response:
There is no need to use for-loop here, we can use a vectorized pandas based approach:
letters = pd.Index(['A', 'B', 'C', 'D', 'E'])
start = pd.to_datetime(ds['Start_Date'], dayfirst=True)
dates = ds['End_Date_' letters].apply(pd.to_datetime, dayfirst=True)
years = dates.sub(start, axis=0).astype('timedelta64[D]').div(365)
ds['Overall'] = ds[letters].pow(years.values).sum(1)
Result
print(ds)
Start_Date A End_Date_A B End_Date_B C End_Date_C D End_Date_D E End_Date_E Overall
0 01-01-2015 234 25-05-2017 633 03-06-2016 935 30-10-2019 673 16-12-2020 825 06-07-2019 7.261803e 16
1 01-01-2015 664 25-05-2017 663 03-06-2016 665 30-10-2019 663 16-12-2020 665 06-07-2019 6.624869e 16
Or if you would still like to use you existing code, here is a simple fix:
for i in ds.index:
total = 0 # Moved inside outer for-loop
for letter in letters:
start_date = parser.parse(ds["Start_Date"][i])
end_date = parser.parse(ds["End_Date_" letter][i])
years = (end_date - start_date).days / 365
power = math.pow(int(ds[letter][i]), years)
total = power
ds.loc[i, 'Overall'] = total # Notice the change here