Cannot calculate mean in datasetHow to iterate a pandas column and update contents-CodePudding

I have a dataset from here:

I am trying to iterate through the gender (sex) column and replace the unknown value (?) with some sensible value. The value will be a calculated value of either "M" or "F" - depending upon some other algorithm that is not important to the question.

I am new to Pandas and for some reason this is proving more difficult than I ever could imagine.

What is the best way to iterate over the column series and test

Because there are many unknown values I have first replaced ? with np.NaN

# Replace with NaN so many of the Pandas functions will work.
ht_df = ht_df.replace('?', np.NaN)

This let me update all the numeric missing values very nicely with the mean value (not important to this question except to explain why I replaced everything with NaN):

# Replace the NaN's of the numeric columns with the mean
ht_df["TSH"] = ht_df["TSH"].fillna(mean["TSH"])
ht_df["T3"] = ht_df["TSH"].fillna(mean["T3"])
ht_df["TT4"] = ht_df["TSH"].fillna(mean["TT4"])
ht_df["FTI"] = ht_df["TSH"].fillna(mean["FTI"])

But now I am left with iterating down the "sex" column to replace and I cannot iterate over it nicely.

I used the following code to help me understand what is going on. I have only included a sample of the output.

for item in ht_df["sex"]:
   print(f"{item} {type(item)}")

Output:

F <class 'str'>
F <class 'str'>
... <snip> ...
F <class 'str'>
F <class 'str'>
M <class 'str'>
F <class 'str'>
nan <class 'float'>
F <class 'str'>

The nan is a float, which makes sense. But I am unable to test for it like this:

for item in ht_df["sex"]:
   if item == np.NaN:
      print(f"{item} is NaN\n")
   print(f"{item} {type(item)}")

The if condition is never triggered.

How can I test the value for NaN as I iterate over it and then update that cell with a new value?

A full test code is here:

import pandas as pd
import numpy as np
import ssl

from pandas.core.arrays import string_
from pandas.core.frame import DataFrame


def main():
    ssl._create_default_https_context = ssl._create_unverified_context

    url = 'https://raw.githubusercontent.com/bryonbaker/datasets/main/SIT720/Ass1/hypothyroid.csv'
    fullht_df = pd.read_csv(url)

    print(fullht_df.head(n=100))

    # Get the first 500 rows from the dataset and use that for the rest of the assignment.
    ht_df = fullht_df.head(n=500)

    # Display the dataset's dimension
    print(f"Working dataset dimension is: {ht_df.shape}\n")

    # Get the first 500 rows from the dataset and use that for the rest of the assignment.
    ht_df = fullht_df.head(n=500)

    # Cells with missing data have a '?' in them. 
    # First replace ? with np.NaN so we can utilise some other nice Pandas dataframe methods. We can use a global replace because, upon dataset ins[ection, the unknown ('?') only exists in the numeric columns.
    # Convert the value columns from text to numeric.
    # Calculate the median value for the numeric-data coluimns
    # Replace the NaN values with a reasonable value. For this exercise we have chosen the mean for the column
    # Recalculate the median value for the numeric-data coluimns

    # Prepare the data so it is calculable
    ht_df = ht_df.replace('?', np.NaN)                                                        # Replace with NaN so many of the Pandas functions will work.
    ht_df[["TSH","T3","TT4","FTI"]] = ht_df[["TSH","T3","TT4","FTI"]].apply(pd.to_numeric)    # CSV loads as text. Convert the cells to numeric

    # Calculate the Mean and Median prior to replacing missing values
    mean = ht_df[["TSH","T3","TT4","FTI"]].mean(skipna=True)
    median = ht_df[["TSH","T3","TT4","FTI"]].median(skipna=True)

    # Replace the NaN's of the numeric columns with the mean
    ht_df["TSH"] = ht_df["TSH"].fillna(mean["TSH"])
    ht_df["T3"] = ht_df["TSH"].fillna(mean["T3"])
    ht_df["TT4"] = ht_df["TSH"].fillna(mean["TT4"])
    ht_df["FTI"] = ht_df["TSH"].fillna(mean["FTI"])

    # Replace the M/F missing values with the most frequently occuring gender provided "pregnant" is false. Otherwise set the value to F.
    print("@@@@@@@@@@@@@@")
    for item in ht_df["sex"]:
        if item == np.NaN:
            print(f"{item} is NaN\n")
        print(f"{item} {type(item)}")
    print("@@@@@@@@@@@@@@")

if __name__ == "__main__":
    main()

CodePudding user response：

I'm not sure why you want to use iteration to print out each item. But if all you want is a print out of the rows where the 'sex' column is np.nan then:

print(ht_df["sex"].isna())

will show "True" or "False" for every np.nan

If you want to just see the dataframe elements with those rows, you can try something like:

print(ht_df.loc[ht_df["sex"].isna(), ["sex", "pregnant", "TSH"]])

which would print all the rows that are np.nan in the sex column and then the values of those three columns (which I picked arbitrarily, you can fill with any list you like).

Lastly, if you want to create a formula to guess at whether a np.nan row should be M or F, I'd usually create a "sex_predict" column, fill that using whatever algo you are using, and then use fillna

ht_df["sex"] = ht_df["sex"].fillna(ht_df["sex_predict"])

CodePudding user response：

You can't check item == np.NaN, you have to use pd.isna(item):

for item in ht_df["sex"]:
    if pd.isna(item):
        print(f"{item} is NaN\n")
    print(f"{item} {type(item)}")

Output:

...
M <class 'str'>
F <class 'str'>
nan is NaN

nan <class 'float'>
F <class 'str'>
...

CodePudding user response：

Thanks everyone (esp @Tom and @Corralien as they were both correct) for the answers. They were all very illuminating as to how I should be working with data and pandas. I combined the two into my solution below

tmp_col = "sex-predict"
ht_df[tmp_col] = ht_df["sex"]
for (index, row_series) in ht_df.iterrows():
    if pd.isna(row_series["sex"]):
        ht_df.at[index, tmp_col] = calc_gender().   # Calculate the value for the missing value.

# Copy over any NaN values in the sex column using the value from the temporary column
ht_df["sex"] = ht_df["sex"].fillna(ht_df[tmp_col])
ht_df = ht_df.drop([tmp_col], axis=1)       # Drop the temporary column