I have a pandas dataframe
. I want to generate a new variable (column) based on multiple column inputs where the year index is greater than a certain value.
The below illustrates what I want to do, but I want to simplify it into a function as the calculation in reality is more complex than the below illustration and the variable names are longer.
Ideally the function would split the calculation into intermediate temporary values (not saved to the df
) and across multiple lines to make it easier to read. For example, could define: Share = (df['B'] df['C']) / (df['B'] df['C'] df['D'])
and then X = A Share * E
.
I have used apply
previously to apply a function to a dataframe
, but that example only used a single variable as input and no where clause, and I do not know how to extend the example.
How would I simply the below calculation to generate X
based on A
, B
, C
, D
, and E
, where year
>= 2020?
import numpy as np
import pandas as pd
np.random.seed(2981)
df = pd.DataFrame({
'year' : [2018, 2019, 2020, 2021,2018, 2019, 2020, 2021,2018, 2019, 2020, 2021],
'id' : ['ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF','GHI','GHI','GHI','GHI'],
'A': np.random.choice(range(100),12),
'B': np.random.choice(range(100),12),
'C': np.random.choice(range(100),12),
'D': np.random.choice(range(100),12),
'E': np.random.choice(range(100),12),
})
df = df.set_index('year')
df['X'] = np.where( df.index >= 2020, df['A'] (df['B'] df['C']) / (df['B'] df['C'] df['D']) * df['E'] , np.nan )
CodePudding user response:
First, you should only use apply if necessary. Vectorized functions will be much faster, and the way you have it written now in the np.where statement makes use of these. If you really want to make your code more readable (at the (probably small) expense of time and memory) you could make an intermediate column and then use it in the np.where statement.
df["Share"] = ( df.B df.C ) / ( df.B df.C df.D )
df["X"] = ( df.A df.Share * df.E ).where( df.index >= 2020 )
To answer your question, however, you can create a custom function and then apply it to your DataFrame.
def my_func( year,a,b,c,d,e ):
#This function can be longer and do more things
return np.nan if year < 2020 else a ( ( (b c) / (b c d) ) * e )
df['X'] = df.apply( lambda x: my_func( x.name, x.A, x.B, x.C, x.D, x.E ), axis = 1 )
Note that to access then index of a row when using apply with axis = 1
you need to use the name attribute.
Also, since applying a function is relatively slow, it may be worth creating columns that take care of some of the intermediate steps (such as summing several columns, etc.) so that that doesn't need to be done in each iteration.
Check out this answer for more examples of applying a custom function.