Home > Mobile >  How to get only the initial NaN values and leading non NaN values from a pandas dataframe?
How to get only the initial NaN values and leading non NaN values from a pandas dataframe?

Time:12-04

I have a dataframe where the rows contain NaN values. The df contains original columns namely Heading 1 Heading 2 and Heading 3 and extra columns called Unnamed: 1 Unnamed: 2 and Unnamed: 3 as shown:

Heading 1 Heading 2 Heading 3 Unnamed: 1 Unnamed: 2 Unnamed: 3
NaN 34 24 45 NaN NaN
NaN NaN 24 45 11 NaN
NaN NaN NaN 45 45 33
4 NaN 24 NaN NaN NaN
NaN NaN 4 NaN NaN NaN
NaN 34 24 NaN NaN NaN
22 34 24 NaN NaN NaN
NaN 34 NaN 45 NaN NaN

I want to iterate through each row and find out the amount of initial NaN values in original columns (Heading 1 Heading 2 and Heading 3) and the amount of non NaN values in the extra columns (Unnamed: 1 Unnamed: 2 and Unnamed: 3). For each and every row this should be calculated and returned in a dictionary where the key is the index of the row and the value for that key is a list containing the amount of initial NaN values in original columns (Heading 1 Heading 2 and Heading 3) and the second element of the list would the amount of non NaN values in the extra columns (Unnamed: 1 Unnamed: 2 and Unnamed: 3).

So the result for the above dataframe would be:

{0 : [1, 1], 
1 : [2, 2], 
2 : [3, 3], 
3 : [0, 0], 
4 : [2, 0], 
5 : [1, 0],
6 : [0, 0],
7 : [1, 1]}

Notice how in row 3 and row 7 the original columns contain 1 and 2 NaN respectively but only the initial NaN's are counted and not the in between ones!

Thank you!

CodePudding user response:

You can use:

m = df.columns.str.startswith('Unnamed')

out = (df
   .groupby(m, axis=1)
   .apply(lambda g: (g.notna() if g.name else g.isna())
                     .cummin(axis=1).sum(axis=1)
          )
   .set_axis(['named', 'unnamed'], axis=1)
 )

Output:

   named  unnamed
0      1        1
1      2        2
2      3        3
3      0        0
4      2        0
5      1        0
6      0        0
7      1        1

as dictionary

out.T.to_dict('list')

Output:

{0: [1, 1],
 1: [2, 2],
 2: [3, 3],
 3: [0, 0],
 4: [2, 0],
 5: [1, 0],
 6: [0, 0],
 7: [1, 1]}

CodePudding user response:

First

divide dataframe (iloc or filter or and so on)

df1 = df.iloc[:, :3]
df2 = df.iloc[:, 3:]

Second

count initial NaNs in df1 and count notnull in df2

s1 = df1.apply(lambda x: (x.notnull().cumsum() == 0).sum(), axis=1)
s2 = df2.notnull().sum(axis=1)

Last

concat and make dict

pd.concat([s1, s2], axis=1).T.to_dict('list')

result:

{0: [1, 1],
 1: [2, 2],
 2: [3, 3],
 3: [0, 0],
 4: [2, 0],
 5: [1, 0],
 6: [0, 0],
 7: [1, 1]}
  • Related