Home > Blockchain >  How to remove entire panel if one value in panel is missing using python pandas?
How to remove entire panel if one value in panel is missing using python pandas?

Time:09-16

I have a panel data set that generally looks like this (it is much larger).

df['fips'] = ['01001', '01001', '01001', '01003', '01003', '01003', '01005', '01005', '01005']

df['year'] = [2001, 2002, 2003, 2001, 2002, 2003, 2001, 2002, 2003]

df['var1'] = [1, 2, 3, 4, 5, 6, 7, 8, 9]

df['var2'] = [1, 2, 3, 4, np.nan, 6, 7, 8, 9]

print(df)

Resulting dataframe:

    fips  year  var1  var2
0  01001  2001     1   1.0
1  01001  2002     2   2.0
2  01001  2003     3   3.0
3  01003  2001     4   4.0
4  01003  2002     5   NaN
5  01003  2003     6   6.0
6  01005  2001     7   7.0
7  01005  2002     8   8.0
8  01005  2003     9   9.0

What I am hoping to do is remove the entire 'fips' panel that contains a nan value so it would look like this:

    fips  year  var1  var2
0  01001  2001     1   1.0
1  01001  2002     2   2.0
2  01001  2003     3   3.0
3  01005  2001     7   7.0
4  01005  2002     8   8.0
5  01005  2003     9   9.0

I have tried creating a Multi-Index on 'fips' and 'year' and using dropna, but I am only able to pull out the row that actually contains the na value. Thanks in advance for any help!

CodePudding user response:

You could check to make sure none of the var2 are null within the group

df.groupby('fips').filter(lambda x: ~x['var2'].isnull().any())

If you wanted to check both var cols you could try:

null_fips = df.loc[df[['var1','var2']].isnull().any(1)]['fips'].unique()
df.loc[~df['fips'].isin(null_fips)]

CodePudding user response:

Here is an approach:

def remove_panel(df: pd.DataFrame, pnlColumn: str) -> pd.DataFrame:
    # remove all references to entries in pnlColumn where at least one is np.nan
    dfo = pd.DataFrame(columns = df.columns.to_list())
    pnls = list(df[pnlColumn].unique())
    for pnl in pnls:                                                     
        if len( np.argwhere(np.isnan(df[df[pnlColumn] == pnl]['var2'].to_numpy())) ) == 0:
            dfo = pd.concat([dfo, df[df[pnlColumn] == pnl]])
    return dfo

Given your initial dataframe running

remove_panel(df, 'fips')  

produces:

fips    year    var1    var2
0   01001   2001    1   1.0
1   01001   2002    2   2.0
2   01001   2003    3   3.0
6   01005   2001    7   7.0
7   01005   2002    8   8.0
8   01005   2003    9   9.0
  • Related