Assign column adding columns in pandas dynamically (method chaining in python)-CodePudding

I want to create a new column named total which adds all the year columns (everything in these columns are integers). I want to do it dynamically because as each year passes there will be a new column (for example 2024).

How can I do this in Python using method chaining and the assign operator?

id name     2018 2019 2020 2021 2022  type
1  John      0    1    0    0    2    A
2  Bill      1    5    4    0    0    B
3  Tom       0    0    2    0    5    B
4  Mary      0    1    1    0    0    A

Expected Output:

id name     2018 2019 2020 2021 2022 type total
1  John      0    1    0    0    2    A   3
2  Bill      1    5    4    0    0    B   10
3  Tom       0    0    2    0    5    B   7
4  Mary      0    1    1    0    0    A   2

I have this solution but I don't like it, is there a more eloquent way of writing this code?

Temporary Solution:

(
     df
     .assign(Total = lambda x: x['2018']   x['2019']   x['2020']   x['2021'] x['2022'])
)

CodePudding user response：

Assuming that the only columns with int values will be the year columns, you can just do df['Total'] = df.sum(axis = 1):

columns = ['name', '2018', '2019', '2020', '2021', '2022', 'type']
values = [['John',0,1,0,0,2,'A'], ['Bill',1,5,4,0,0,'B'], ['Tom',0,0,2,0,5,'B'], ['Mary',0,1,1,0,0,'A']]

df = pd.DataFrame(values, columns = columns)

df['Total'] = df.sum(axis = 1)
df

Output:

name       2018    2019    2020   2021    2022    type    Total
0   John    0       1       0      0       2       A        3
1   Bill    1       5       4      0       0       B        10
2   Tom     0       0       2      0       5       B        7
3   Mary    0       1       1      0       0       A        2

CodePudding user response：

Use DataFrame.loc for select columns between 2018 and 2022:

df = df.assign(Total = df.loc[:, '2018':'2022'].sum(axis=1))
print (df)
   id  name  2018  2019  2020  2021  2022 type  Total
0   1  John     0     1     0     0     2    A      3
1   2  Bill     1     5     4     0     0    B     10
2   3   Tom     0     0     2     0     5    B      7
3   4  Mary     0     1     1     0     0    A      2

Another idea is filter columns names by mask - here convert to numeric and test if column names are larger like 2000:

df = df.assign(Total=df.loc[:, pd.to_numeric(df.columns,errors='coerce')>2000].sum(axis=1))
print (df)
   id  name  2018  2019  2020  2021  2022 type  Total
0   1  John     0     1     0     0     2    A      3
1   2  Bill     1     5     4     0     0    B     10
2   3   Tom     0     0     2     0     5    B      7
3   4  Mary     0     1     1     0     0    A      2

CodePudding user response：

You could use assign with filter with a generic regex for 4-digits year:

out = df.assign(Total=lambda d: d.filter(regex='^\d{4}$').sum(axis=1))

Or with drop and the columns to ignore:

out = df.assign(Total=lambda d: d.drop(columns=['id', 'name', 'type', 'Total'],
                                       errors='ignore').sum(axis=1))

Output:

   id  name  2018  2019  2020  2021  2022 type  Total
0   1  John     0     1     0     0     2    A      3
1   2  Bill     1     5     4     0     0    B     10
2   3   Tom     0     0     2     0     5    B      7
3   4  Mary     0     1     1     0     0    A      2

Sorting columns and changing type of all integer columns:

out = (df.assign(Total=lambda d: d.filter(regex='^\d{4}$').sum(axis=1)
         .sort_index(key=lambda x: x.astype(str).str.fullmatch('\d{4}'), axis=1)
         .pipe(lambda d: d.astype({c: float for c in out.select_dtypes('number')}))
       )

Output:

    id  name type  total  2018  2019  2020  2021  2022
0  1.0  John    A    3.0   0.0   1.0   0.0   0.0   2.0
1  2.0  Bill    B   10.0   1.0   5.0   4.0   0.0   0.0
2  3.0   Tom    B    7.0   0.0   0.0   2.0   0.0   5.0
3  4.0  Mary    A    2.0   0.0   1.0   1.0   0.0   0.0