Home > database >  Sum columns in pandas based on the names of the columns
Sum columns in pandas based on the names of the columns

Time:12-16

I have a dataframe with the population by age in several cities:

City             Age_25    Age_26   Age_27  Age_28  Age_29  Age_30  

New York         11312     3646    4242     4344    4242    6464
London           6446      2534    3343     63475   34433   34434 
Paris            5242      34343   6667     132     323     3434
Hong Kong        354       979     878      6776    7676    898
Buenos Aires     4244      7687    78       8676    786     9798

I want to create a new dataframe with the sum of the columns based on ranges of three years. That is, people from 25 to 27 and people from 28 to 30. Like this:

City          Age_25_27  Age_28_30 

New York         19200   15050
London           12323   132342
Paris            46252   3889
Hong Kong        2211    15350
Buenos Aires     12009   19260 

In this example I gave a range of three year but in mine real database it has to be 5 five and with 100 ages.

How could I do that? I've saw some related answers but neither work very well in my case.

CodePudding user response:

Try this:

age_columns = df.filter(like='Age_').columns
n = age_columns.str.split('_').str[-1].astype(int)

df['Age_25-27'] = df[age_columns[(n >= 25) & (n <= 27)]].sum(axis=1)
df['Age_28-30'] = df[age_columns[(n >= 28) & (n <= 30)]].sum(axis=1)

Output:

>>> df
         City  Age_25  Age_26  Age_27  Age_28  Age_29  Age_30  Age_25-27  Age_28-30
New      York   11312    3646    4242    4344    4242  6464.0      19200    15050.0
London   6446    2534    3343   63475   34433   34434     NaN      69352    68867.0
Paris    5242   34343    6667     132     323    3434     NaN      41142     3757.0
Hong     Kong     354     979     878    6776    7676   898.0       2211    15350.0
Buenos  Aires    4244    7687      78    8676     786  9798.0      12009    19260.0

CodePudding user response:

You can use groupby:

In [1]: import pandas as pd
   ...: import numpy as np

In [2]: d = {
   ...:     'City': ['New York', 'London', 'Paris', 'Hong Kong', 'Buenos Aires'],
   ...:     'Age_25': [11312, 6446, 5242, 354, 4244],
   ...:     'Age_26': [3646, 2534, 34343, 979, 7687],
   ...:     'Age_27': [4242, 3343, 6667, 878, 78],
   ...:     'Age_28': [4344, 63475, 132, 6776, 8676],
   ...:     'Age_29': [4242, 34433, 323, 7676, 786],
   ...:     'Age_30': [6464, 34434, 3434, 898, 9798]
   ...: }
   ...: 
   ...: df = pd.DataFrame(data=d)
   ...: df = df.set_index('City')
   ...: df
Out[2]: 
              Age_25  Age_26  Age_27  Age_28  Age_29  Age_30
City                                                        
New York       11312    3646    4242    4344    4242    6464
London          6446    2534    3343   63475   34433   34434
Paris           5242   34343    6667     132     323    3434
Hong Kong        354     979     878    6776    7676     898
Buenos Aires    4244    7687      78    8676     786    9798

In [3]: n_cols = 3  # change to 5 for actual dataset
   ...: sum_every_n_cols_df = df.groupby((np.arange(len(df.columns)) // n_cols)   1, axis=1).sum()
   ...: sum_every_n_cols_df
Out[3]: 
                  1       2
City                       
New York      19200   15050
London        12323  132342
Paris         46252    3889
Hong Kong      2211   15350
Buenos Aires  12009   19260

CodePudding user response:

You can extract the columns of the dataframe and put them in a list. Use

col_list = df.columns

But ultimately, I think what you'd want to do is more of a while loop with your inputs (band of 5 and up to 100 ages) as static values that you iterate over.

band = 5
start = 20
max_age = 120

i = start
while i < max_age:
  age_start = i
  age_end = i
  sum_cols = []
  col_name = 'age_'   str(age_start)   '_to_'   str(age_end)  
  for i in range(age_start,age_end):
    age_adder = 'age_'   str(i)
    df[col_name]  = df[age_adder]
  i  = band
  • Related