Home > database >  Pandas dataframe column wise calculation
Pandas dataframe column wise calculation

Time:03-26

I have below dataframe columns:

Index(['Location' 'Dec-2021_x', 'Jan-2022_x', 'Feb-2022_x', 'Mar-2022_x',
       'Apr-2022_x', 'May-2022_x', 'Jun-2022_x', 'Jul-2022_x', 'Aug-2022_x',
       'Sep-2022_x', 'Oct-2022_x', 'Nov-2022_x', 'Dec-2022_x', 'Jan-2023_x',
       'Feb-2023_x', 'Mar-2023_x', 'Apr-2023_x', 'May-2023_x', 'Jun-2023_x',
       'Jul-2023_x', 'Aug-2023_x', 'Sep-2023_x', 'Oct-2023_x', 'Nov-2023_x',
       'Dec-2023_x', 'Jan-2024_x', 'Feb-2024_x', 'Mar-2024_x', 'Apr-2024_x',
       'May-2024_x', 'Jun-2024_x', 'Jul-2024_x', 'Aug-2024_x', 'Sep-2024_x',
       'Oct-2024_x', 'Nov-2024_x', 'Dec-2024_x', 
       'sum_val',
       'Dec-2021_y', 'Jan-2022_y', 'Feb-2022_y',
       'Mar-2022_y', 'Apr-2022_y', 'May-2022_y', 'Jun-2022_y', 'Jul-2022_y',
       'Aug-2022_y', 'Sep-2022_y', 'Oct-2022_y', 'Nov-2022_y', 'Dec-2022_y',
       'Jan-2023_y', 'Feb-2023_y', 'Mar-2023_y', 'Apr-2023_y', 'May-2023_y',
       'Jun-2023_y', 'Jul-2023_y', 'Aug-2023_y', 'Sep-2023_y', 'Oct-2023_y',
       'Nov-2023_y', 'Dec-2023_y', 'Jan-2024_y', 'Feb-2024_y', 'Mar-2024_y',
       'Apr-2024_y', 'May-2024_y', 'Jun-2024_y', 'Jul-2024_y', 'Aug-2024_y',
       'Sep-2024_y', 'Oct-2024_y', 'Nov-2024_y', 'Dec-2024_y'],
      dtype='object')

Sample dataframe with reduced columns looks like this:

df:
Location    Dec-2021_x  Jan-2022_x  sum_val Dec-2021_y  Jan-2022_y
A           212         315         1000    12          13      
B           312         612         1100    13          17      
C           242         712         1010    15          15      
D           215         382         1001    16          17      
E           252         319         1110    17          18      

I have to create a resultant dataframe which will be in the below format:

Index(['Location' 'Dec-2021', 'Jan-2022', 'Feb-2022', 'Mar-2022',
       'Apr-2022', 'May-2022', 'Jun-2022', 'Jul-2022', 'Aug-2022',
       'Sep-2022', 'Oct-2022', 'Nov-2022', 'Dec-2022', 'Jan-2023',
       'Feb-2023', 'Mar-2023', 'Apr-2023', 'May-2023', 'Jun-2023',
       'Jul-2023', 'Aug-2023', 'Sep-2023', 'Oct-2023', 'Nov-2023',
       'Dec-2023', 'Jan-2024', 'Feb-2024', 'Mar-2024', 'Apr-2024',
       'May-2024', 'Jun-2024', 'Jul-2024', 'Aug-2024', 'Sep-2024',
       'Oct-2024', 'Nov-2024', 'Dec-2024'
      dtype='object')

The way we do this is using the formula:

'Dec-2021' = 'Dec-2021_x' * sum_val * 'Dec-2021_y' (these are all numeric columns)

and a similar way for all the months. There are 36 months to be precise. Is there any way to do this in a loop manner for each column in the month-year combination? There are around 65000 rows here so do not want to overwhelm the system.

CodePudding user response:

Use:

#sample data
np.random.seed(2022)
c = ['Location', 'Dec-2021_x', 'Jan-2022_x', 'Feb-2022_x', 'Mar-2022_x',
       'Apr-2022_x','sum_val', 'Dec-2021_y', 'Jan-2022_y', 'Feb-2022_y',
       'Mar-2022_y', 'Apr-2022_y']
df = (pd.DataFrame(np.random.randint(10, size=(5, len(c))), columns=c)
        .assign(Location=list('abcde')))
print (df)
          Location  Dec-2021_x  Jan-2022_x  Feb-2022_x  Mar-2022_x  Apr-2022_x  \
0        a           1           1           0           7           8   
1        b           8           0           3           6           8   
2        c           1           7           5           5           4   
3        d           0           7           5           5           8   
4        e           8           0           3           9           5   

   sum_val  Dec-2021_y  Jan-2022_y  Feb-2022_y  Mar-2022_y  Apr-2022_y  
0        2           8           0           5           9           1  
1        0           1           2           0           5           7  
2        8           2           3           1           0           4  
3        2           4           0           9           4           9  
4        2           1           7           2           1           7 

#remove unnecessary columns
df1 = df.drop(['sum_val'], axis=1)
#add columns names for not necessary remove - if need in ouput
df1 = df1.set_index('Location')
#split columns names by last _
df1.columns = df1.columns.str.rsplit('_', n=1, expand=True)

#seelct x and y Dataframes by second level and multiple
df2 = (df1.xs('x', axis=1, level=1).mul(df['sum_val'].to_numpy(), axis= 0) * 
       df1.xs('y', axis=1, level=1))
print (df2)
          Dec-2021  Jan-2022  Feb-2022  Mar-2022  Apr-2022
Location                                                  
a               16         0         0       126        16
b                0         0         0         0         0
c               16       168        40         0       128
d                0         0        90        40       144
e               16         0        12        18        70
  • Related