I have a panel data set that generally looks like this (it is much larger).
df['fips'] = ['01001', '01001', '01001', '01003', '01003', '01003', '01005', '01005', '01005']
df['year'] = [2001, 2002, 2003, 2001, 2002, 2003, 2001, 2002, 2003]
df['var1'] = [1, 2, 3, 4, 5, 6, 7, 8, 9]
df['var2'] = [1, 2, 3, 4, np.nan, 6, 7, 8, 9]
print(df)
Resulting dataframe:
fips year var1 var2
0 01001 2001 1 1.0
1 01001 2002 2 2.0
2 01001 2003 3 3.0
3 01003 2001 4 4.0
4 01003 2002 5 NaN
5 01003 2003 6 6.0
6 01005 2001 7 7.0
7 01005 2002 8 8.0
8 01005 2003 9 9.0
What I am hoping to do is remove the entire 'fips' panel that contains a nan value so it would look like this:
fips year var1 var2
0 01001 2001 1 1.0
1 01001 2002 2 2.0
2 01001 2003 3 3.0
3 01005 2001 7 7.0
4 01005 2002 8 8.0
5 01005 2003 9 9.0
I have tried creating a Multi-Index on 'fips' and 'year' and using dropna, but I am only able to pull out the row that actually contains the na value. Thanks in advance for any help!
CodePudding user response:
You could check to make sure none of the var2
are null within the group
df.groupby('fips').filter(lambda x: ~x['var2'].isnull().any())
If you wanted to check both var cols you could try:
null_fips = df.loc[df[['var1','var2']].isnull().any(1)]['fips'].unique()
df.loc[~df['fips'].isin(null_fips)]
CodePudding user response:
Here is an approach:
def remove_panel(df: pd.DataFrame, pnlColumn: str) -> pd.DataFrame:
# remove all references to entries in pnlColumn where at least one is np.nan
dfo = pd.DataFrame(columns = df.columns.to_list())
pnls = list(df[pnlColumn].unique())
for pnl in pnls:
if len( np.argwhere(np.isnan(df[df[pnlColumn] == pnl]['var2'].to_numpy())) ) == 0:
dfo = pd.concat([dfo, df[df[pnlColumn] == pnl]])
return dfo
Given your initial dataframe running
remove_panel(df, 'fips')
produces:
fips year var1 var2
0 01001 2001 1 1.0
1 01001 2002 2 2.0
2 01001 2003 3 3.0
6 01005 2001 7 7.0
7 01005 2002 8 8.0
8 01005 2003 9 9.0