I would like to clean data based on the range from different data frame.
Range data looks like
labs_range= pd.DataFrame({"Lab":['Albumin','Sodium','Potassium'],'low': [2,2,3],
'high':[4,5,4]})
Lab low high
0 Albumin 2 4
1 Sodium 2 5
2 Potassium 3 4
and data I would like to clean looks like :
labs = pd.DataFrame({"Albumin":[1,1,2,1,2,3,4,5],'Calcium':[1,1,2,1,2,3,4,5],'Sodium':[1,1,2,1,2,3,4,5]})
Albumin Calcium Sodium
0 1 1 1
1 1 1 1
2 2 2 2
3 1 1 1
4 2 2 2
5 3 3 3
6 4 4 4
7 5 5 5
I can do it by using a loop but it would take too long if we have data that is big.
for i in labs_range['Lab']:
try:
labs[i] = np.where((labs[i]<labs_range.loc[labs_range['Lab']==str(i),'low'].values[0]) | (labs[i]>labs_range.loc[labs_range['Lab']==str(i),'high'].values[0]),np.nan,labs[i])
except:
print(i)
The result that we get from this loops is correct but would like it to be more efficient.
Albumin Calcium Sodium
0 nan 1 nan
1 nan 1 nan
2 2.00000 2 2.00000
3 nan 1 nan
4 2.00000 2 2.00000
5 3.00000 3 3.00000
6 4.00000 4 4.00000
7 nan 5 5.00000
CodePudding user response:
You can use pd.isin
labs = pd.DataFrame({"Albumin":[1,1,2,1,2,3,4,5],'Calcium':[1,1,2,1,2,3,4,5],'Sodium':[1,1,2,1,2,3,4,5]})
Albumin Calcium Sodium
0 1 1 1
1 1 1 1
2 2 2 2
3 1 1 1
4 2 2 2
5 3 3 3
6 4 4 4
7 5 5 5
cond = labs.isin({'Albumin': range(2, 5),
'Calcium': range(2, 6),
'Sodimu': range(3, 5)})
labs.where(cond, np.nan)
Albumin Calcium Sodium
0 NaN NaN NaN
1 NaN NaN NaN
2 2.0 2.0 NaN
3 NaN NaN NaN
4 2.0 2.0 NaN
5 3.0 3.0 3.0
6 4.0 4.0 4.0
7 NaN 5.0 NaN
CodePudding user response:
Since you have column dose not intersection we need find the overlap columns 1st
colmatch = labs.columns.intersection(labs_range.Lab)
labs_range = labs_range.set_index('Lab').loc[colmatch]
labs[colmatch] = labs.where(labs.sub(labs_range.low).ge(0) & labs.sub(labs_range.high).le(0))[colmatch]
labs
Out[246]:
Albumin Calcium Sodium
0 NaN 1 NaN
1 NaN 1 NaN
2 2.0 2 2.0
3 NaN 1 NaN
4 2.0 2 2.0
5 3.0 3 3.0
6 4.0 4 4.0
7 NaN 5 5.0