I am trying to iterate through the gender (sex) column and replace the unknown value (?) with some sensible value. The value will be a calculated value of either "M" or "F" - depending upon some other algorithm that is not important to the question.
I am new to Pandas and for some reason this is proving more difficult than I ever could imagine.
What is the best way to iterate over the column series and test
Because there are many unknown values I have first replaced ? with np.NaN
# Replace with NaN so many of the Pandas functions will work.
ht_df = ht_df.replace('?', np.NaN)
This let me update all the numeric missing values very nicely with the mean value (not important to this question except to explain why I replaced everything with NaN):
# Replace the NaN's of the numeric columns with the mean
ht_df["TSH"] = ht_df["TSH"].fillna(mean["TSH"])
ht_df["T3"] = ht_df["TSH"].fillna(mean["T3"])
ht_df["TT4"] = ht_df["TSH"].fillna(mean["TT4"])
ht_df["FTI"] = ht_df["TSH"].fillna(mean["FTI"])
But now I am left with iterating down the "sex" column to replace and I cannot iterate over it nicely.
I used the following code to help me understand what is going on. I have only included a sample of the output.
for item in ht_df["sex"]:
print(f"{item} {type(item)}")
Output:
F <class 'str'>
F <class 'str'>
... <snip> ...
F <class 'str'>
F <class 'str'>
M <class 'str'>
F <class 'str'>
nan <class 'float'>
F <class 'str'>
The nan is a float, which makes sense. But I am unable to test for it like this:
for item in ht_df["sex"]:
if item == np.NaN:
print(f"{item} is NaN\n")
print(f"{item} {type(item)}")
The if condition is never triggered.
How can I test the value for NaN as I iterate over it and then update that cell with a new value?
A full test code is here:
import pandas as pd
import numpy as np
import ssl
from pandas.core.arrays import string_
from pandas.core.frame import DataFrame
def main():
ssl._create_default_https_context = ssl._create_unverified_context
url = 'https://raw.githubusercontent.com/bryonbaker/datasets/main/SIT720/Ass1/hypothyroid.csv'
fullht_df = pd.read_csv(url)
print(fullht_df.head(n=100))
# Get the first 500 rows from the dataset and use that for the rest of the assignment.
ht_df = fullht_df.head(n=500)
# Display the dataset's dimension
print(f"Working dataset dimension is: {ht_df.shape}\n")
# Get the first 500 rows from the dataset and use that for the rest of the assignment.
ht_df = fullht_df.head(n=500)
# Cells with missing data have a '?' in them.
# First replace ? with np.NaN so we can utilise some other nice Pandas dataframe methods. We can use a global replace because, upon dataset ins[ection, the unknown ('?') only exists in the numeric columns.
# Convert the value columns from text to numeric.
# Calculate the median value for the numeric-data coluimns
# Replace the NaN values with a reasonable value. For this exercise we have chosen the mean for the column
# Recalculate the median value for the numeric-data coluimns
# Prepare the data so it is calculable
ht_df = ht_df.replace('?', np.NaN) # Replace with NaN so many of the Pandas functions will work.
ht_df[["TSH","T3","TT4","FTI"]] = ht_df[["TSH","T3","TT4","FTI"]].apply(pd.to_numeric) # CSV loads as text. Convert the cells to numeric
# Calculate the Mean and Median prior to replacing missing values
mean = ht_df[["TSH","T3","TT4","FTI"]].mean(skipna=True)
median = ht_df[["TSH","T3","TT4","FTI"]].median(skipna=True)
# Replace the NaN's of the numeric columns with the mean
ht_df["TSH"] = ht_df["TSH"].fillna(mean["TSH"])
ht_df["T3"] = ht_df["TSH"].fillna(mean["T3"])
ht_df["TT4"] = ht_df["TSH"].fillna(mean["TT4"])
ht_df["FTI"] = ht_df["TSH"].fillna(mean["FTI"])
# Replace the M/F missing values with the most frequently occuring gender provided "pregnant" is false. Otherwise set the value to F.
print("@@@@@@@@@@@@@@")
for item in ht_df["sex"]:
if item == np.NaN:
print(f"{item} is NaN\n")
print(f"{item} {type(item)}")
print("@@@@@@@@@@@@@@")
if __name__ == "__main__":
main()
CodePudding user response:
I'm not sure why you want to use iteration to print out each item. But if all you want is a print out of the rows where the 'sex'
column is np.nan
then:
print(ht_df["sex"].isna())
will show "True" or "False" for every np.nan
If you want to just see the dataframe elements with those rows, you can try something like:
print(ht_df.loc[ht_df["sex"].isna(), ["sex", "pregnant", "TSH"]])
which would print all the rows that are np.nan
in the sex
column and then the values of those three columns (which I picked arbitrarily, you can fill with any list you like).
Lastly, if you want to create a formula to guess at whether a np.nan
row should be M or F, I'd usually create a "sex_predict"
column, fill that using whatever algo you are using, and then use fillna
ht_df["sex"] = ht_df["sex"].fillna(ht_df["sex_predict"])
CodePudding user response:
You can't check item == np.NaN
, you have to use pd.isna(item)
:
for item in ht_df["sex"]:
if pd.isna(item):
print(f"{item} is NaN\n")
print(f"{item} {type(item)}")
Output:
...
M <class 'str'>
F <class 'str'>
nan is NaN
nan <class 'float'>
F <class 'str'>
...
CodePudding user response:
Thanks everyone (esp @Tom and @Corralien as they were both correct) for the answers. They were all very illuminating as to how I should be working with data and pandas. I combined the two into my solution below
tmp_col = "sex-predict"
ht_df[tmp_col] = ht_df["sex"]
for (index, row_series) in ht_df.iterrows():
if pd.isna(row_series["sex"]):
ht_df.at[index, tmp_col] = calc_gender(). # Calculate the value for the missing value.
# Copy over any NaN values in the sex column using the value from the temporary column
ht_df["sex"] = ht_df["sex"].fillna(ht_df[tmp_col])
ht_df = ht_df.drop([tmp_col], axis=1) # Drop the temporary column