I am trying to work on a requirement where I am computing the Zscore and want to compare with individual values in the rows. If Zscore>1 , I mark them as NaN
for those specific values. I am marking it as NaN
, so that I could fill those values by appropriate techniques and then I also want to return a dataframe which would tell me what were the original NaN values.
I have the below code:
s={'2014':[1,1,2,2],'2015':[12,22,33,44],'2016':[55,66,77,88],'2017':[2,3,4,5]}
p=pd.DataFrame(data=s)
2014 2015 2016 2017
0 1 12 55 2
1 1 22 66 3
2 2 33 77 4
3 2 44 88 5
I have computed zscore as -
df_zscore = (p - p.mean())/p.std()
2014 2015 2016 2017
0 -0.866025 -1.139879 -1.161895 -1.161895
1 -0.866025 -0.416146 -0.387298 -0.387298
2 0.866025 0.379960 0.387298 0.387298
3 0.866025 1.176065 1.161895 1.161895
out = p.mask(df_zscore > 1)
If Zscore>1, then the output comes like this-
2014 2015 2016 2017
0 1 12 55 2
1 1 22 66 3
2 2 33 77 4
3 2 NaN NaN NaN
(They are marked as NaN, since Zscore was >1)
I want to write a code, which would basically do a tracking of values, i.e. values that were not NaN, but became NaN due to Zscore limit, so ideally I should get the below dataframe.
Desired O/P-
2015 2016 2017
3 44 88 5
How would I be able to do this?
I know since my dataset is small, I can easily figure this out by just looking at the data, but what about when the dataset is huge, that's when I really need to see it.
CodePudding user response:
You can filter rows and columns by compare for greater like 1
and then pass DataFrame.any
for test at least one match per rows and columns and filter in DataFrame.loc
:
m = df_zscore.gt(1)
out = p.loc[m.any(axis=1), m.any()]
print (out)
2015 2016 2017
3 44 88 5
But if match multiple rows and columns solution filter all rows and columns where is at least one match:
s={'2014':[10,1,2,2],'2015':[12,22,33,44],'2016':[55,66,77,88],'2017':[2,3,4,5]}
p=pd.DataFrame(data=s)
df_zscore = (p - p.mean())/p.std()
print (p.mask(df_zscore > 1))
2014 2015 2016 2017
0 NaN 12.0 55.0 2.0
1 1.0 22.0 66.0 3.0
2 2.0 33.0 77.0 4.0
3 2.0 NaN NaN NaN
m = df_zscore.gt(1)
out = p.loc[m.any(axis=1), m.any(axis=0)]
print (out)
2014 2015 2016 2017
0 10 12 55 2
3 2 44 88 5
So if need only matched values is necessary different format:
m = df_zscore.gt(1)
out = p.stack()[m.stack()].rename_axis(['idx','year']).reset_index(name='val')
print (out)
idx year val
0 0 2014 10
1 3 2015 44
2 3 2016 88
3 3 2017 5
Or non matched values convert to missing values:
m = df_zscore.gt(1)
out = p.loc[m.any(axis=1), m.any(axis=0)].where(m)
print (out)
2014 2015 2016 2017
0 10.0 NaN NaN NaN
3 NaN 44.0 88.0 5.0