I have this dataset where I have some columns (not important to the calculations) and then many columns with same starting name. I want to calculate the sum of those columns per one row which contains else than NaN-value. The set looks something like this:
id | something | number1 | number2 | number3 | number4 |
---|---|---|---|---|---|
1 | 105 | 200 | NaN | NaN | 50 |
2 | 300 | 2 | 1 | 1 | 33 |
3 | 20 | 1 | NaN | NaN | NaN |
So I want to create new column that contains the length of the number columns that have a value. So the final dataset would look like this:
id | something | number1 | number2 | number3 | number4 | sum_columns |
---|---|---|---|---|---|---|
1 | 105 | 200 | NaN | NaN | 50 | 2 |
2 | 300 | 2 | 1 | 1 | 33 | 4 |
3 | 20 | 1 | NaN | NaN | NaN | 1 |
I know I can calculate the length of columns that start by specific name something like this:
df[df.columns[pd.Series(df.columns).str.startswith('number')]]
but I cant figure out, how can I add condition that there has to be other than NaN value and also how to apply it to every row. I think it could be done with lambda? but haven't succeeded yet.
CodePudding user response:
# filter column on 'number' and count
df['sum_columns']=df.filter(like='number').count(axis=1)
df
id something number1 number2 number3 number4 sum_columns
0 1 105 200 NaN NaN 50.0 2
1 2 300 2 1.0 1.0 33.0 4
2 3 20 1 NaN NaN NaN 1
PS: Your first DF and second DF, the NaN count don't match. I used the second DF in the solution
CodePudding user response:
Indeed df[df.columns[df.columns.str.startswith('number')]]
will give your dataframe with the columns starting with 'number'
. Now we only need to sum the number of values that are not NaN's. This can be done like so:
df['sum_columns'] = (df[df.columns[df.columns.str.startswith('number')]].notnull()).sum(axis=1)
Output:
id something number1 number2 number3 number4 sum_columns
0 1 105 200 NaN NaN 50.0 2
1 2 300 2 1.0 1.0 33.0 4
2 3 20 1 NaN NaN NaN 1
CodePudding user response:
import pandas as pd
import numpy as np
df = {'something':[105, 300,20],
'number1':[200,2,1],
'number2':[np.nan,1,np.nan],
'number3':[np.nan,1,np.nan],
'number4':[50,33,np.nan]}
df = pd.DataFrame(df)
tmp = df[df.columns[pd.Series(df.columns).str.startswith('number')]]
df['sum_columns'] = tmp.notnull().sum(axis=1).tolist()
df
Output:
something number1 number2 number3 number4 sum_columns
0 105 200 NaN NaN 50.0 2
1 300 2 1.0 1.0 33.0 4
2 20 1 NaN NaN NaN 1
CodePudding user response:
One can use pandas.DataFrame.iloc
to, based on the index of the columns, filter to consider the desired ones, and .count(axis=1)
, as follows
df['sum_columns'] = df.iloc[:, 2:].count(axis=1)
[Out]:
id something number1 number2 number3 number4 sum_columns
0 1 105 200 NaN NaN 50.0 2
1 2 300 2 1.0 1.0 33.0 4
2 3 20 1 NaN NaN NaN 1