I have a dataframe where the rows contain NaN values. The df contains original columns namely Heading 1 Heading 2 and Heading 3 and extra columns called Unnamed: 1 Unnamed: 2 and Unnamed: 3 as shown:
Heading 1 | Heading 2 | Heading 3 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 |
---|---|---|---|---|---|
NaN | 34 | 24 | 45 | NaN | NaN |
NaN | NaN | 24 | 45 | 11 | NaN |
NaN | NaN | NaN | 45 | 45 | 33 |
4 | NaN | 24 | NaN | NaN | NaN |
NaN | NaN | 4 | NaN | NaN | NaN |
NaN | 34 | 24 | NaN | NaN | NaN |
22 | 34 | 24 | NaN | NaN | NaN |
NaN | 34 | NaN | 45 | NaN | NaN |
I want to iterate through each row and find out the amount of initial NaN values in original columns (Heading 1 Heading 2 and Heading 3) and the amount of non NaN values in the extra columns (Unnamed: 1 Unnamed: 2 and Unnamed: 3). For each and every row this should be calculated and returned in a dictionary where the key is the index of the row and the value for that key is a list containing the amount of initial NaN values in original columns (Heading 1 Heading 2 and Heading 3) and the second element of the list would the amount of non NaN values in the extra columns (Unnamed: 1 Unnamed: 2 and Unnamed: 3).
So the result for the above dataframe would be:
{0 : [1, 1],
1 : [2, 2],
2 : [3, 3],
3 : [0, 0],
4 : [2, 0],
5 : [1, 0],
6 : [0, 0],
7 : [1, 1]}
Notice how in row 3 and row 7 the original columns contain 1 and 2 NaN respectively but only the initial NaN's are counted and not the in between ones!
Thank you!
CodePudding user response:
You can use:
m = df.columns.str.startswith('Unnamed')
out = (df
.groupby(m, axis=1)
.apply(lambda g: (g.notna() if g.name else g.isna())
.cummin(axis=1).sum(axis=1)
)
.set_axis(['named', 'unnamed'], axis=1)
)
Output:
named unnamed
0 1 1
1 2 2
2 3 3
3 0 0
4 2 0
5 1 0
6 0 0
7 1 1
as dictionary
out.T.to_dict('list')
Output:
{0: [1, 1],
1: [2, 2],
2: [3, 3],
3: [0, 0],
4: [2, 0],
5: [1, 0],
6: [0, 0],
7: [1, 1]}
CodePudding user response:
First
divide dataframe (iloc or filter or and so on)
df1 = df.iloc[:, :3]
df2 = df.iloc[:, 3:]
Second
count initial NaNs in df1 and count notnull in df2
s1 = df1.apply(lambda x: (x.notnull().cumsum() == 0).sum(), axis=1)
s2 = df2.notnull().sum(axis=1)
Last
concat and make dict
pd.concat([s1, s2], axis=1).T.to_dict('list')
result:
{0: [1, 1],
1: [2, 2],
2: [3, 3],
3: [0, 0],
4: [2, 0],
5: [1, 0],
6: [0, 0],
7: [1, 1]}