Home > OS >  Parsing in Dataframe for each Row
Parsing in Dataframe for each Row

Time:03-17

I want to perform a calculation whereby for every end date of letter need to be minus with the Start_Date, then divide with 365 to get duration in years. Then the duration of each letter column need to be use as a 'power of' of the value of their letter column, respectively. Then the result of each letter need to be sum up to get the total.

I've tried using below coding where I'm using parsing and I managed to get the answer.

import pandas as pd

dataset = [['01-01-2015', 234, '25-05-2017', 633, '03-06-2016', 935, '30-10-2019', 673, '16-12-2020', 825, '06-07-2019'],
          ['01-01-2015', 664, '25-05-2017', 663, '03-06-2016', 665, '30-10-2019', 663, '16-12-2020', 665, '06-07-2019']]
ds = pd.DataFrame(dataset, columns = ['Start_Date', 'A', 'End_Date_A', 'B', 'End_Date_B', 'C', 'End_Date_C', 
                                      'D', 'End_Date_D', 'E', 'End_Date_E'])


    Start_Date   A  End_Date_A   B  End_Date_B   C  End_Date_C   D  End_Date_D   E  End_Date_E
0   01-01-2015  234 25-05-2017  633 03-06-2016  935 30-10-2019  673 16-12-2020  825 06-07-2019
1   01-01-2015  664 25-05-2017  663 03-06-2016  665 30-10-2019  663 16-12-2020  665 06-07-2019

from dateutil import parser
import math

letters = ["A", "B", "C", "D", "E"]
total = 0

for i in ds.index:
    for letter in letters:
        start_date = parser.parse(ds["Start_Date"][i])
        end_date = parser.parse(ds["End_Date_"   letter][i])
        years = (end_date - start_date).days / 365
        power = math.pow(int(ds[letter][i]), years)
        total = power

ds['Overall'] = total

However, it shows the same result for each of the rows.

Start_Date  A   End_Date_A  B   End_Date_B  C   End_Date_C  D   End_Date_D  E   End_Date_E  Overall
0   01-01-2015  234 25-05-2017  633 03-06-2016  935 30-10-2019  673 16-12-2020  825 06-07-2019  1.388585e 17
1   01-01-2015  664 25-05-2017  663 03-06-2016  665 30-10-2019  663 16-12-2020  665 06-07-2019  1.388585e 17

Is there any other tips to perform this and get the total based on each row values?

CodePudding user response:

There is no need to use for-loop here, we can use a vectorized pandas based approach:

letters = pd.Index(['A', 'B', 'C', 'D', 'E'])

start = pd.to_datetime(ds['Start_Date'], dayfirst=True)
dates = ds['End_Date_'   letters].apply(pd.to_datetime, dayfirst=True)
years = dates.sub(start, axis=0).astype('timedelta64[D]').div(365)

ds['Overall'] = ds[letters].pow(years.values).sum(1)

Result

print(ds)

   Start_Date    A  End_Date_A    B  End_Date_B    C  End_Date_C    D  End_Date_D    E  End_Date_E       Overall
0  01-01-2015  234  25-05-2017  633  03-06-2016  935  30-10-2019  673  16-12-2020  825  06-07-2019  7.261803e 16
1  01-01-2015  664  25-05-2017  663  03-06-2016  665  30-10-2019  663  16-12-2020  665  06-07-2019  6.624869e 16

Or if you would still like to use you existing code, here is a simple fix:

for i in ds.index:
    total = 0 # Moved inside outer for-loop
    for letter in letters:
        start_date = parser.parse(ds["Start_Date"][i])
        end_date = parser.parse(ds["End_Date_"   letter][i])
        years = (end_date - start_date).days / 365
        power = math.pow(int(ds[letter][i]), years)
        total = power

    ds.loc[i, 'Overall'] = total # Notice the change here
  • Related