Function I have created:
#Create a function that identifies blank values
def GPID_blank(df, variable):
df = df.loc[df['GPID'] == variable]
return df
Test:
variable = ''
test = GPID_blank(df, variable)
test
Goal: Create a function that can filter any dataframe column 'GPID' to see all of the rows where GPID has missing data.
I have tried running variable = 'NaN'
and still no luck. However, I know the function works, as if I use a real-life variable "OH82CD85" the function filters my dataset accordingly.
Therefore, why doesn't it filter out the blank cells variable = 'NaN'
? I know for my dataset, there are 5 rows with GPID missing data.
Example df:
df = pd.DataFrame({'Client': ['A','B','C'], 'GPID':['BRUNS2','OH82CD85','']})
Client GPID
0 A BRUNS2
1 B OH82CD85
2 C
Sample of GPID column:
0 OH82CD85
1 BW07TI20
2 OW36HW81
3 PE56TA73
4 CT46SX81
5 OD79AU80
6 GF46DB60
7 OL07ST01
8 VP38SM57
9 AH90AE61
10 PG86KO78
11 NaN
12 NaN
13 SO21GR72
14 DY85IN90
15 KW80CV02
16 CM15QP83
17 VC38FP82
18 DA36RX05
19 DD74HD38
CodePudding user response:
You can't use ==
with NaN. NaN != NaN
.
Instead, you can modify your function a little to check if the parameter is NaN using pd.isna()
(or np.isnan()
):
def GPID_blank(df, variable):
if pd.isna(variable):
return df.loc[df['GPID'].isna()]
else:
return df.loc[df['GPID'] == variable]
CodePudding user response:
It's not working because with variable = 'NaN'
you're looking for a string which content is 'NaN', not for missing values.
You can try:
import pandas as pd
def GPID_blank(df):
# filtered dataframe with NaN values in GPID column
blanks = df[df['GPID'].isnull()].copy()
return blanks
filtered_df = GPID_blank(df)
CodePudding user response:
You can't really search for NaN
values like an expression. Also, in your example dataframe, ''
is not NaN
, but is str
, and can be searched like an expression.
Instead, you need to check when you want to filter for NaN
, and filter differently:
def GPID_blank(df, variable):
if pd.isna(variable):
df = df.loc[df['GPID'].isna()]
else:
df = df.loc[df['GPID'] == variable]
return df