Get the amount of leading NaN and trailing non NaN values in pandas dataframe-CodePudding

I have a dataframe where the rows contain NaN values. The df contains original columns namely Heading 1 and Heading 2 and extra columns called Unnamed: 1 and Unnamed: 2 as shown:

Heading 1	Heading 2	Heading 3	Unnamed: 1	Unnamed: 2	Unnamed: 3
NaN	34	24	45	NaN	NaN
NaN	NaN	24	45	11	NaN
NaN	NaN	NaN	45	45	33
4	NaN	24	NaN	NaN	NaN
NaN	NaN	4	NaN	NaN	NaN
NaN	34	24	NaN	NaN	NaN
22	34	24	NaN	NaN	NaN
NaN	34	NaN	45	NaN	NaN

I want to iterate through each row and find out the amount of leading NaN values in original columns (Heading 1 and Heading 2) and the amount of non NaN values in the extra columns (Unnamed: 1 and Unnamed: 2). For each and every row this should be calculated and returned in a dictionary where the key is the index of the row and the value for that key is a list containing the amount of leading NaN values in original columns (Heading 1 and Heading 2) and the second element of the list would the amount of non NaN values in the extra columns (Unnamed: 1 and Unnamed: 2).

So the result for the above dataframe would be:

{0 : [1, 1], 
1 : [2, 2], 
2 : [3, 3], 
3 : [0, 0], 
4 : [2, 0], 
5 : [1, 0],
6 : [0, 0],
7 : [1, 1]}

Notice how in row 3 and row 7 the original columns contain 1 and 2 NaN respectively but only the leading NaN's are counted and not the in between ones!

Thank you!

CodePudding user response：

To iterate through each row in a DataFrame and count the number of NaN values in the original columns and the number of non-NaN values in the extra columns, you can do the following:

import pandas as pd

# Define the dataframe
df = pd.DataFrame(
    {
        "Heading 1": [np.nan, np.nan, 5, 5, np.nan, np.nan],
        "Heading 2": [34, np.nan, np.nan, 7, np.nan, np.nan],
        "Unnamed: 1": [24, 44, np.nan, np.nan, 13, np.nan],
        "Unnamed: 2": [np.nan, np.nan, np.nan, np.nan, 77, 18]
    }
)

# Define the original columns and the extra columns
original_cols = ["Heading 1", "Heading 2"]
extra_cols = ["Unnamed: 1", "Unnamed: 2"]

# Create a dictionary to store the counts
counts = {}

# Iterate through each row in the DataFrame
for index, row in df.iterrows():
    # Count the number of NaN values in the original columns
    original_nan_count = sum(row[col].isna() for col in original_cols)
    
    # Count the number of non-NaN values in the extra columns
    extra_non_nan_count = sum(not row[col].isna() for col in extra_cols)
    
    # Add the counts to the dictionary
    counts[index] = [original_nan_count, extra_non_nan_count]

# Print the dictionary of counts
print(counts)

This will iterate through each row in the DataFrame, count the number of NaN values in the original columns and the number of non-NaN values in the extra columns, and store the counts in a dictionary where the keys are the row indexes and the values are lists containing the counts. The resulting dictionary will look like this:

{0: [1, 1],
 1: [2, 1],
 2: [1, 0],
 3: [0, 0],
 4: [2, 2],
 5: [2, 1]}

CodePudding user response：

As an alternative:

df['Count'] = df[['Heading 1', 'Heading 2']].apply(lambda x: sum(x.isnull()), axis=1)
df['Count2'] = df[['Unnamed: 1', 'Unnamed: 2']].apply(lambda x: sum(x.notnull()), axis=1)
df['total']=df[['Count','Count2']].values.tolist()

output=dict(zip(df.index, df.total))
'''
{0: [1, 1], 1: [2, 1], 2: [1, 0], 3: [0, 0], 4: [2, 2], 5: [2, 1]}
'''

mask=list(map(list, zip(df[['Heading 1', 'Heading 2']].isnull().sum(axis=1), df[['Unnamed: 1', 'Unnamed: 2']].notnull().sum(axis=1))))
output=dict(zip(df.index,mask))
#{0: [1, 1], 1: [2, 1], 2: [1, 0], 3: [0, 0], 4: [2, 2], 5: [2, 1]}

CodePudding user response：

The .isna() (in Cyzanfar's answer) raises an exception for me:

AttributeError: 'numpy.float64' object has no attribute 'isna'

You could instead try the following:

counts = {}

for index, row in df.iterrows():
# Count the number of NaN values in the original columns
num_nan_orig = np.sum(np.isnan(row[['Heading 1', 'Heading 2']]))

# Count the number of non-NaN values in the extra columns
num_non_nan_extra = np.sum(~np.isnan(row[['Unnamed: 1', 'Unnamed: 2']]))

counts[index] = [num_nan_orig, num_non_nan_extra]
print(counts)

Outputs the following:

# {0: [1, 1], 1: [2, 1], 2: [1, 0], 3: [0, 0], 4: [2, 2], 5: [2, 1]}

^{The ~ operator (third last line in the code) is the bitwise negation operator in Python, which inverts the boolean value of its operand. In this case, it is inverts the boolean values produced by np.isnan() method, so that the non-NaN values can be counted.}

Heading 1	Heading 2	Heading 3	Unnamed: 1	Unnamed: 2	Unnamed: 3
NaN	34	24	45	NaN	NaN
NaN	NaN	24	45	11	NaN
NaN	NaN	NaN	45	45	33
4	NaN	24	NaN	NaN	NaN
NaN	NaN	4	NaN	NaN	NaN
NaN	34	24	NaN	NaN	NaN
22	34	24	NaN	NaN	NaN
NaN	34	NaN	45	NaN	NaN

Heading 1	Heading 2	Heading 3	Unnamed: 1	Unnamed: 2	Unnamed: 3
NaN	34	24	45	NaN	NaN
NaN	NaN	24	45	11	NaN
NaN	NaN	NaN	45	45	33
4	NaN	24	NaN	NaN	NaN
NaN	NaN	4	NaN	NaN	NaN
NaN	34	24	NaN	NaN	NaN
22	34	24	NaN	NaN	NaN
NaN	34	NaN	45	NaN	NaN

Heading 1	Heading 2	Heading 3	Unnamed: 1	Unnamed: 2	Unnamed: 3
NaN	34	24	45	NaN	NaN
NaN	NaN	24	45	11	NaN
NaN	NaN	NaN	45	45	33
4	NaN	24	NaN	NaN	NaN
NaN	NaN	4	NaN	NaN	NaN
NaN	34	24	NaN	NaN	NaN
22	34	24	NaN	NaN	NaN
NaN	34	NaN	45	NaN	NaN