Home > Software engineering >  Pandas dataframes: Create new column with a formula that uses all values of column X until each row
Pandas dataframes: Create new column with a formula that uses all values of column X until each row

Time:02-03

A lot of times (e.g. for time series) I need to use all the values in a column until the current row.

For instance if my dataframe has 100 rows, I want to create a new column where the value in each row is a (sum, average, product, [any other formula]) of all the previous rows, and excluding then next ones:

Row 20 = formula(all_values_until_row_20)
Row 21 = formula(all_values_until_row_21)
etc

I think the easiest way to ask this question would be: How to implement the cumsum() function for a new column in pandas without using that specific method?

CodePudding user response:

One approach, if you cannot use cumsum is to introduce a new column or index and then apply a lambda function that uses all rows that have the new column value less than the current row's.

import pandas as pd

df = pd.DataFrame({'x': range(20, 30), 'y': range(40, 50)}).set_index('x')

df['Id'] = range(0, len(df.index))
df['Sum'] = df.apply(lambda x: df[df['Id']<=x['Id']]['y'].sum(), axis=1)
print(df)

CodePudding user response:

Since there is no sample data I go with an assumed dataframe with atleast one column with numeric data and no NaN values. I would start like below for cumulativbe sum and averages.

cumulative sum:

df['cum_sum'] = df['existing_col'].cumsum()

cumulative average:

df['cum_avg'] = df['existing_col'].cumsum() / df['index_col']

or

df['cum_avg'] = df['existing_col'].expanding().mean()

if you can provide a sample DataFrame you can get better help I believe so.

  • Related