Home > database >  Calculate sum based on multiple rows from list column for each row in pandas dataframe
Calculate sum based on multiple rows from list column for each row in pandas dataframe

Time:02-25

I have a dataframe that looks something like this:

df = pd.DataFrame({'id': range(5), 'col_to_sum': np.random.rand(5), 'list_col': [[], [1], [1,2,3], [2], [3,1]]})
    
    id  col_to_sum  list_col
0   0   0.557736    []
1   1   0.147333    [1]
2   2   0.538681    [1, 2, 3]
3   3   0.040329    [2]
4   4   0.984439    [3, 1]

In reality I have more columns and ~30000 rows but the extra columns are irrelevant for this. Note that all the list elements are from the id column and that the id column is not necessarily the same as the index.

I want to make a new column that for each row sums the values in col_to_sum corresponding to the ids in list_col. In this example that would be:

    id  col_to_sum  list_col    sum
0   0   0.557736    []          0.000000
1   1   0.147333    [1]         0.147333
2   2   0.538681    [1, 2, 3]   0.726343
3   3   0.040329    [2]         0.538681
4   4   0.984439    [3, 1]      0.187662

I have found a way to do this but it requires looping through the entire dataframe and is quite slow on the larger df with ~30000 rows (~6 min). The way I found was

df['sum'] = 0

for i in range(len(df)):
    mask = df['id'].isin(df['list_col'].iloc[i])
    df.loc[i, 'sum'] = df.loc[mask, 'col_to_sum'].sum()

Ideally I would want a vectorized way to do this but I haven't been able to do it. Any help is greatly appreciated.

CodePudding user response:

I'm using non-random values in this demo because they're easier to reproduce and check.

I'm also using an id-column ([0, 1, 3, 2, 4]) that is not identical to the index.

Setup:

>>> df = pd.DataFrame({'id': [0, 1, 3, 2, 4], 'col_to_sum': [1, 2, 3, 4, 5], 'list_col': [[], [1], [1, 2, 3], [2], [3, 1]]})
>>> df
   id  col_to_sum   list_col
0   0           1         []
1   1           2        [1]
2   3           3  [1, 2, 3]
3   2           4        [2]
4   4           5     [3, 1]

Solution:

df = df.set_index('id')
df['sum'] = df['list_col'].apply(lambda l: df.loc[l, 'col_to_sum'].sum())
df = df.reset_index()

Output:

>>> df
   id  col_to_sum   list_col  sum
0   0           1         []    0
1   1           2        [1]    2
2   3           3  [1, 2, 3]    9
3   2           4        [2]    4
4   4           5     [3, 1]    5

CodePudding user response:

You can use a lambda function that will let you use the list_col and find the iloc of the corresponding list_col to summarize

df['sum_col'] = df['list_col'].apply(lambda x : df['col_to_sum'].iloc[x].sum())
  • Related