Pandas calculate the number of columns of a given name there are that have a value in a row-CodePudding

I have this dataset where I have some columns (not important to the calculations) and then many columns with same starting name. I want to calculate the sum of those columns per one row which contains else than NaN-value. The set looks something like this:

id	something	number1	number2	number3	number4
1	105	200	NaN	NaN	50
2	300	2	1	1	33
3	20	1	NaN	NaN	NaN

So I want to create new column that contains the length of the number columns that have a value. So the final dataset would look like this:

id	something	number1	number2	number3	number4	sum_columns
1	105	200	NaN	NaN	50	2
2	300	2	1	1	33	4
3	20	1	NaN	NaN	NaN	1

I know I can calculate the length of columns that start by specific name something like this:

df[df.columns[pd.Series(df.columns).str.startswith('number')]]

but I cant figure out, how can I add condition that there has to be other than NaN value and also how to apply it to every row. I think it could be done with lambda? but haven't succeeded yet.

CodePudding user response：

# filter column on 'number' and count
df['sum_columns']=df.filter(like='number').count(axis=1)
df

    id  something   number1     number2     number3     number4     sum_columns
0    1      105         200         NaN         NaN       50.0          2
1    2      300           2         1.0         1.0       33.0          4
2    3       20           1         NaN         NaN        NaN          1

PS: Your first DF and second DF, the NaN count don't match. I used the second DF in the solution

CodePudding user response：

Indeed df[df.columns[df.columns.str.startswith('number')]] will give your dataframe with the columns starting with 'number'. Now we only need to sum the number of values that are not NaN's. This can be done like so:

df['sum_columns'] = (df[df.columns[df.columns.str.startswith('number')]].notnull()).sum(axis=1)

Output:

   id  something  number1  number2  number3  number4  sum_columns
0   1        105      200      NaN      NaN     50.0            2
1   2        300        2      1.0      1.0     33.0            4
2   3         20        1      NaN      NaN      NaN            1

CodePudding user response：

import pandas as pd
import numpy as np

df = {'something':[105, 300,20],
     'number1':[200,2,1],
     'number2':[np.nan,1,np.nan],
     'number3':[np.nan,1,np.nan],
     'number4':[50,33,np.nan]}

df = pd.DataFrame(df)

tmp = df[df.columns[pd.Series(df.columns).str.startswith('number')]]

df['sum_columns'] = tmp.notnull().sum(axis=1).tolist()
df

Output:

something   number1 number2 number3 number4 sum_columns
0   105 200 NaN NaN 50.0    2
1   300 2   1.0 1.0 33.0    4
2   20  1   NaN NaN NaN     1

CodePudding user response：

One can use pandas.DataFrame.iloc to, based on the index of the columns, filter to consider the desired ones, and .count(axis=1), as follows

df['sum_columns'] = df.iloc[:, 2:].count(axis=1)

[Out]:

   id  something  number1  number2  number3  number4  sum_columns
0   1        105      200      NaN      NaN     50.0            2
1   2        300        2      1.0      1.0     33.0            4
2   3         20        1      NaN      NaN      NaN            1