The following is the dataset I'm working on
As you can see there are some missing values (NaN) which need to be replaced, on certain conditions:
If Solar.R < 50 then the missing value of Ozone needs to be replaced by the value = 30.166667
If Solar.R < 100 then the missing value of Ozone needs to be replaced by the value = 21.181818
If Solar.R < 150 then the missing value of Ozone needs to be replaced by the value = 53. 13043
If Solar.R < 200 then the missing value of Ozone needs to be replaced by the value = 59. 840000
If Solar.R < 250 then the missing value of Ozone needs to be replaced by the value = 59. 840000
If Solar.R < 300 then the missing value of Ozone needs to be replaced by the value = 50. 115385
If Solar.R < 350 then the missing value of Ozone needs to be replaced by the value = 26. 571429
Is there any way to do this using pandas and if-else? I've tried using loc() but it resulted in the non - NaN values getting modified too.
PS: This is the code using loc()
while (s['Ozone'].isna() == True):
s.loc[(s['Solar.R'] < 50), 'Ozone'] = '30.166667'
s.loc[(s['Solar.R'] < 100), 'Ozone'] = '21.181818'
s.loc[(s['Solar.R'] < 150), 'Ozone'] = '53.13043'
s.loc[(s['Solar.R'] < 200), 'Ozone'] = '59.840000'
s.loc[(s['Solar.R'] < 250), 'Ozone'] = '59.840000'
s.loc[(s['Solar.R'] < 300), 'Ozone'] = '50.115385'
s.loc[(s['Solar.R'] < 350), 'Ozone'] = '26.571429'
CodePudding user response:
The function .fillna(value)
can be used. It only changes the NaN
values in a dataframe and not other values. Here is an example for your specific problem:
import pandas as pd
import numpy as np
#example dataset with values for each interval
example = {'Solar.R' : [25, 25, 87, 87, 134, 134, 187, 187, 234, 234, 267, 267, 345, 345],
'Ozone' : [1, np.nan, 1, np.nan, 1, np.nan, 1, np.nan, 1, np.nan, 1, np.nan, 1, np.nan]}
df = pd.DataFrame(example)
#list of pairs of the cutoff and the respective values
#!!! needs to be sorted from smallest cutoff to largest
cut_off_values = [(50, 30.166667), (100, 21.181818), (150, 53.13043),
(200, 59.840000), (250, 59.840000), (300, 50.115385),
(350, 26.571429)]
#iterate the list of pairs and change only the nan values
for pair in cut_off_values:
df[df['Solar.R'] < pair[0]] = df[df['Solar.R'] < pair[0]].fillna(pair[1])
print(df.to_string())
Output:
Solar.R Ozone
0 25 1.000000
1 25 30.166667
2 87 1.000000
3 87 21.181818
4 134 1.000000
5 134 53.130430
6 187 1.000000
7 187 59.840000
8 234 1.000000
9 234 59.840000
10 267 1.000000
11 267 50.115385
12 345 1.000000
13 345 26.571429
CodePudding user response:
Try:
common = df['col_2'].isnull()
all_conditions = [(df['Solar.R'] < 50) & (common),
(df['Solar.R'] > 50) & (df['Solar.R'] < 100) & (common),
(df['Solar.R'] > 100) & (df['Solar.R'] < 150) & (common),
(df['Solar.R'] > 150) & (df['Solar.R'] < 250) & (common),
(df['Solar.R'] > 250) & (df['Solar.R'] < 300) & (common),
(df['Solar.R'] > 300) & (df['Solar.R'] < 350) & (common)]
fill_with = ['30.166667', '21.181818', '53.13043', '59.840000', '50.115385', '26.571429']
df['col_2'] = np.select(all_conditions, fill_with, default=df['col_2'])
CodePudding user response:
As your conditions are linear, you can use floordiv
to select the right values for Ozone
column and mask
to hide other values:
values = [30.166667, 21.181818, 53.13043, 59.840000,
59.840000, 50.115385, 26.571429]
s['Ozone'] = s.mask(~s['Solar.R'].between(0, 350))['Solar.R'] \
.sub(1).floordiv(50).map(pd.Series(values))
print(s)
# Output:
Solar.R Ozone
0 50.0 30.166667
1 NaN NaN
2 450.0 NaN
3 98.0 21.181818
4 348.0 26.571429
5 302.0 26.571429
6 348.0 26.571429
7 279.0 50.115385
8 8.0 30.166667
9 80.0 21.181818
10 140.0 53.130430
11 239.0 59.840000
12 227.0 59.840000
13 93.0 21.181818
14 305.0 26.571429
15 80.0 21.181818
16 104.0 53.130430
17 180.0 59.840000
18 179.0 59.840000
19 59.0 21.181818