Home > Software design >  Python: need efficient way to set column value based on another column value
Python: need efficient way to set column value based on another column value

Time:08-20

New Python user here, so I appreciate any ideas for best practices in addition to the issue I'm seeking advice on.

I have code that works for a small number of records, but when I run it on a large dataframe, it takes too long. I've done a lot of reading on this issue, and there are several similar posts on SO, but none have given me enough clues to resolve this. The code I have works, but it takes too long for the large input dataset I need to process thru it. First, let me show the input data:

ID bh_start_dt_list bh_$_amt_list bh_cnt orig_mth_$ group_rank cal_yr dur_mth bh_start_index
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 1 2005 8 -1
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 2 2005 9 NaN
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 3 2005 10 NaN
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 4 2005 11 NaN
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 5 2005 12 NaN
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 6 2006 1 NaN
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 7 2006 2 NaN
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 8 2006 3 NaN
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 9 2006 4 NaN
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 10 2006 5 NaN
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 11 2006 6 NaN
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 12 2006 7 0
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 13 2006 8 NaN
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 14 2006 9 NaN
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 15 2006 10 NaN

bh_cnt is the count of list items in each bh_dt_start_list & bh_$_amt_list

bh_dt_start_list corresponds to the list of items in bh_$_amt_list

group_rank was assigned using this code:

      df['group_rank'] = df.groupby(['ID'])['ID'].rank(method="first", ascending=True)

Thus, when the ID changes, the ranking starts over at 1.

bh_start_index was assigned using this code:

      def bh_idx_df(x):
             return next((i for i, y in enumerate(x['bh_start_dt_list']) if int(str(y)[0:4]) == x['cal_yr'] and
             int(str(y)[5:7]) == x['dur_mth']), None)
      df['bh_start_index'] = df.apply(bh_idx_df, axis=1)

So, essentially if the dur_ mth month and cal_yr year match a date in the bh_start_dt_list list, I ultimately want to populate a new column with the corresponding indexed value from the bh_$_amt_list. I thought that identifying the actual index first was prudent. As shown in the data above, the first date match was found on July of 2006 (bh_start_index == 0). I have separate code to assign bh_start_index = -1 (where group_rank == 1) for the first rank row value so that I can populate it with the default $ amount in the orig_mth_$ column.

This code so far all works fine, I think. I haven't tested the specific loop in the function [next((i for i, y in enumerate(x['bh_start_dt_list'])] against all data yet. It's been my experience that Python does not do well with loops on large datasets. I'll be running millions of records thru this code. I'm trying to write code that works, and then refine it. Please let me know if you think this loop is part of the inefficiency problem.

Finally, the NaN values in the bh_start_index column need to be populated too. Once each of those has a value, it will be easy to reference that index to populate a new column with a $ amount from the bh_$_amt_list. Here is the code I was using to do that:

def bh_adj_idx_df(x):
    for i in range(-1, int(x['bh_cnt'])):
        #  1st BH record
        if i == -1:
            if x['group_rank'] <= list(DF.loc[DF['bh_start_index'] == 0]['group_rank'])[0]:
                return i
        else:
            #  Last BH record
            if i == x['bh_cnt'] - 1:
                if x['group_rank'] > list(DF.loc[DF['bh_start_index'] == i]['group_rank'])[0]:
                    return i
            else:
                #  All other BH records
                if list(DF.loc[DF['bh_start_index'] == i]['group_rank'])[0] < x['group_rank'] \
                        <= list(DF.loc[DF['bh_start_index'] == i   1]['group_rank'])[0]:
                    return i
df['bh_adj_index'] = df.apply(bh_adj_idx_df, axis=1)

I wrote the dataframe DF in caps to show what I believe is a problem. The function bh_adj_idx_df(x) should reference (x) throughout the function code, but when I only use that alias I get this error:

 KeyError: 'False: boolean label can not be used without a boolean index'

Why does this mean, and why does using the literal DF dataframe name fix this error? I assume that blending the two dataframe names references in the function is a cause of the inefficiency.

Here is the desired but inefficient output data from that code:

ID bh_start_dt_list bh_$_amt_list bh_cnt orig_mth_$ group_rank cal_yr dur_mth bh_start_index bh_adj_index
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 1 2005 8 -1 -1
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 2 2005 9 NaN -1
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 3 2005 10 NaN -1
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 4 2005 11 NaN -1
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 5 2005 12 NaN -1
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 6 2006 1 NaN -1
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 7 2006 2 NaN -1
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 8 2006 3 NaN -1
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 9 2006 4 NaN -1
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 10 2006 5 NaN -1
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 11 2006 6 NaN -1
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 12 2006 7 0 -1
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 13 2006 8 NaN 0
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 14 2006 9 NaN 0
123 [2006-07-26, 2016-01-12, 2016-02-22] [16.72, 15.48, 16.72] 3 15.48 15 2006 10 NaN 0

I want the default (-1) to populate up thru the row with the first identified date from bh_start_dt_list . Then the following row would get the corresponding index value in bh_$_amt_list, until a new date was identified and so on. Here is the code to assign the dollar amount based on the index value:

def ep_be_df(x):
   if x['bh_adj_index'] == -1:
       return x['orig_mth_$']
   else:
       return x['bh_$_amt_list'][int(x['bh_adj_index'])]
df['the_correct_amt'] = df.apply(ep_be_df, axis=1)

If you read this far, congratulations and I thank you.

Can anyone suggest any numpy applications for the code I'm trying to do here? I understand numpy.where is much more efficient.

Can anyone see a way to do this without looping thru the date list?

Regards,

Tim

CodePudding user response:

Long question - it might be easier to answer if it can be simplified into a more concise example. I'll discuss it a bit more broadly below for now, hoping to help you to the next step on the way to solving this.

Pandas and numpy works well on tabular data which your data example doesn't quite look like to me. The goal would be to avoid nested looping over the data with python syntax which is slow. I'd reformat the input to tidy data - one observation per row, one variable per column, one single value per cell - and then the efficient numpy vectorized ops will be much easier to apply. We can avoid looping though the list in python, this can be done with pandas/numpy now.

In your example we'd denormalize the data into something more tabular Specifically I'd unstack the lists nested within the cells into separate observations/rows which would result in repeated data for the cells.

Once we have this new denormalized input data format without the lists I believe it will be easier to see how we can use numpy to solve this - which feels to me very doable but obscure due to the input data format.

Exampel of 'Tidy' version of the input data:

ID bh_start_dt_list bh_$_amt_list bh_cnt orig_mth_$ group_rank cal_yr dur_mth bh_start_index
123 26/07/2006 16.72 3 15.48 1 2005 8 -1
123 12/01/2016 15.48 3 15.48 1 2005 8 -1
123 22/02/2016 16.72 3 15.48 1 2005 8 -1
123 26/07/2006 16.72 3 15.48 2 2005 9 NaN
123 12/01/2016 15.48 3 15.48 2 2005 9 NaN
123 22/02/2016 16.72 3 15.48 2 2005 9 NaN

CodePudding user response:

Many thanks to ivanp for helping me rethink how to do this.

The data is restructured now to not use a list in a column for the date and amounts, therefore there is no need to search the list for the matching date using a loop. The correct value was attached to the original data using a merge on the ID, year and month. This had the added benefit of getting the corresponding amount into the correct row without having to mess around with keeping track of indexes.

I still need to find a way to fill in the NaN values between the rows that got populated with an amount. I'll submit a new, much shorter question, showing the actual data so far and the desired output.

  • Related