Python to return dataframe values masked before NaN-CodePudding

I am trying to work on a requirement where I am computing the Zscore and want to compare with individual values in the rows. If Zscore>1 , I mark them as NaN for those specific values. I am marking it as NaN, so that I could fill those values by appropriate techniques and then I also want to return a dataframe which would tell me what were the original NaN values.

I have the below code:

s={'2014':[1,1,2,2],'2015':[12,22,33,44],'2016':[55,66,77,88],'2017':[2,3,4,5]}
p=pd.DataFrame(data=s)

     2014 2015 2016 2017
   0    1   12  55   2
   1    1   22  66   3
   2    2   33  77   4
   3    2   44  88   5

I have computed zscore as -

df_zscore = (p - p.mean())/p.std()

       2014       2015        2016       2017
0   -0.866025   -1.139879   -1.161895   -1.161895
1   -0.866025   -0.416146   -0.387298   -0.387298
2   0.866025    0.379960    0.387298    0.387298
3   0.866025    1.176065    1.161895    1.161895


out = p.mask(df_zscore > 1)

If Zscore>1, then the output comes like this-

       2014       2015        2016       2017
0      1          12          55         2
1      1          22          66         3
2      2          33          77         4
3      2          NaN         NaN       NaN

(They are marked as NaN, since Zscore was >1)

I want to write a code, which would basically do a tracking of values, i.e. values that were not NaN, but became NaN due to Zscore limit, so ideally I should get the below dataframe.

Desired O/P-

   2015       2016       2017
3    44       88          5

How would I be able to do this?

I know since my dataset is small, I can easily figure this out by just looking at the data, but what about when the dataset is huge, that's when I really need to see it.

CodePudding user response：

You can filter rows and columns by compare for greater like 1 and then pass DataFrame.any for test at least one match per rows and columns and filter in DataFrame.loc:

m = df_zscore.gt(1)
out = p.loc[m.any(axis=1), m.any()]
print (out)
   2015  2016  2017
3    44    88     5

But if match multiple rows and columns solution filter all rows and columns where is at least one match:

s={'2014':[10,1,2,2],'2015':[12,22,33,44],'2016':[55,66,77,88],'2017':[2,3,4,5]}
p=pd.DataFrame(data=s)

df_zscore = (p - p.mean())/p.std()

print (p.mask(df_zscore > 1))
   2014  2015  2016  2017
0   NaN  12.0  55.0   2.0
1   1.0  22.0  66.0   3.0
2   2.0  33.0  77.0   4.0
3   2.0   NaN   NaN   NaN

m = df_zscore.gt(1)
out = p.loc[m.any(axis=1), m.any(axis=0)]
print (out)
   2014  2015  2016  2017
0    10    12    55     2
3     2    44    88     5

So if need only matched values is necessary different format:

m = df_zscore.gt(1)
out = p.stack()[m.stack()].rename_axis(['idx','year']).reset_index(name='val')
print (out)
   idx  year  val
0    0  2014   10
1    3  2015   44
2    3  2016   88
3    3  2017    5

Or non matched values convert to missing values:

m = df_zscore.gt(1)
out = p.loc[m.any(axis=1), m.any(axis=0)].where(m)
print (out)
   2014  2015  2016  2017
0  10.0   NaN   NaN   NaN
3   NaN  44.0  88.0   5.0