A lot of times (e.g. for time series) I need to use all the values in a column until the current row.
For instance if my dataframe has 100 rows, I want to create a new column where the value in each row is a (sum, average, product, [any other formula]) of all the previous rows, and excluding then next ones:
Row 20 = formula(all_values_until_row_20)
Row 21 = formula(all_values_until_row_21)
etc
I think the easiest way to ask this question would be: How to implement the cumsum()
function for a new column in pandas without using that specific method?
CodePudding user response:
One approach, if you cannot use cumsum is to introduce a new column or index and then apply a lambda function that uses all rows that have the new column value less than the current row's.
import pandas as pd
df = pd.DataFrame({'x': range(20, 30), 'y': range(40, 50)}).set_index('x')
df['Id'] = range(0, len(df.index))
df['Sum'] = df.apply(lambda x: df[df['Id']<=x['Id']]['y'].sum(), axis=1)
print(df)
CodePudding user response:
Since there is no sample data I go with an assumed dataframe with atleast one column with numeric data and no NaN values. I would start like below for cumulativbe sum and averages.
cumulative sum:
df['cum_sum'] = df['existing_col'].cumsum()
cumulative average:
df['cum_avg'] = df['existing_col'].cumsum() / df['index_col']
or
df['cum_avg'] = df['existing_col'].expanding().mean()
if you can provide a sample DataFrame you can get better help I believe so.