How to get only the initial NaN values and leading non NaN values from a pandas dataframe?-CodePudding

I have a dataframe where the rows contain NaN values. The df contains original columns namely Heading 1 Heading 2 and Heading 3 and extra columns called Unnamed: 1 Unnamed: 2 and Unnamed: 3 as shown:

Heading 1	Heading 2	Heading 3	Unnamed: 1	Unnamed: 2	Unnamed: 3
NaN	34	24	45	NaN	NaN
NaN	NaN	24	45	11	NaN
NaN	NaN	NaN	45	45	33
4	NaN	24	NaN	NaN	NaN
NaN	NaN	4	NaN	NaN	NaN
NaN	34	24	NaN	NaN	NaN
22	34	24	NaN	NaN	NaN
NaN	34	NaN	45	NaN	NaN

I want to iterate through each row and find out the amount of initial NaN values in original columns (Heading 1 Heading 2 and Heading 3) and the amount of non NaN values in the extra columns (Unnamed: 1 Unnamed: 2 and Unnamed: 3). For each and every row this should be calculated and returned in a dictionary where the key is the index of the row and the value for that key is a list containing the amount of initial NaN values in original columns (Heading 1 Heading 2 and Heading 3) and the second element of the list would the amount of non NaN values in the extra columns (Unnamed: 1 Unnamed: 2 and Unnamed: 3).

So the result for the above dataframe would be:

{0 : [1, 1], 
1 : [2, 2], 
2 : [3, 3], 
3 : [0, 0], 
4 : [2, 0], 
5 : [1, 0],
6 : [0, 0],
7 : [1, 1]}

Notice how in row 3 and row 7 the original columns contain 1 and 2 NaN respectively but only the initial NaN's are counted and not the in between ones!

Thank you!

CodePudding user response：

You can use:

m = df.columns.str.startswith('Unnamed')

out = (df
   .groupby(m, axis=1)
   .apply(lambda g: (g.notna() if g.name else g.isna())
                     .cummin(axis=1).sum(axis=1)
          )
   .set_axis(['named', 'unnamed'], axis=1)
 )

Output:

   named  unnamed
0      1        1
1      2        2
2      3        3
3      0        0
4      2        0
5      1        0
6      0        0
7      1        1

as dictionary

out.T.to_dict('list')

Output:

{0: [1, 1],
 1: [2, 2],
 2: [3, 3],
 3: [0, 0],
 4: [2, 0],
 5: [1, 0],
 6: [0, 0],
 7: [1, 1]}

CodePudding user response：

First

divide dataframe (iloc or filter or and so on)

df1 = df.iloc[:, :3]
df2 = df.iloc[:, 3:]

Second

count initial NaNs in df1 and count notnull in df2

s1 = df1.apply(lambda x: (x.notnull().cumsum() == 0).sum(), axis=1)
s2 = df2.notnull().sum(axis=1)

Last

concat and make dict

pd.concat([s1, s2], axis=1).T.to_dict('list')

result:

{0: [1, 1],
 1: [2, 2],
 2: [3, 3],
 3: [0, 0],
 4: [2, 0],
 5: [1, 0],
 6: [0, 0],
 7: [1, 1]}

Heading 1	Heading 2	Heading 3	Unnamed: 1	Unnamed: 2	Unnamed: 3
NaN	34	24	45	NaN	NaN
NaN	NaN	24	45	11	NaN
NaN	NaN	NaN	45	45	33
4	NaN	24	NaN	NaN	NaN
NaN	NaN	4	NaN	NaN	NaN
NaN	34	24	NaN	NaN	NaN
22	34	24	NaN	NaN	NaN
NaN	34	NaN	45	NaN	NaN

Heading 1	Heading 2	Heading 3	Unnamed: 1	Unnamed: 2	Unnamed: 3
NaN	34	24	45	NaN	NaN
NaN	NaN	24	45	11	NaN
NaN	NaN	NaN	45	45	33
4	NaN	24	NaN	NaN	NaN
NaN	NaN	4	NaN	NaN	NaN
NaN	34	24	NaN	NaN	NaN
22	34	24	NaN	NaN	NaN
NaN	34	NaN	45	NaN	NaN

Heading 1	Heading 2	Heading 3	Unnamed: 1	Unnamed: 2	Unnamed: 3
NaN	34	24	45	NaN	NaN
NaN	NaN	24	45	11	NaN
NaN	NaN	NaN	45	45	33
4	NaN	24	NaN	NaN	NaN
NaN	NaN	4	NaN	NaN	NaN
NaN	34	24	NaN	NaN	NaN
22	34	24	NaN	NaN	NaN
NaN	34	NaN	45	NaN	NaN