Home > Software engineering >  I want to replace missing values based on some conditions in a pandas dataframe
I want to replace missing values based on some conditions in a pandas dataframe

Time:10-03

The following is the dataset I'm working on

Dataset

As you can see there are some missing values (NaN) which need to be replaced, on certain conditions:

  1. If Solar.R < 50 then the missing value of Ozone needs to be replaced by the value = 30.166667

  2. If Solar.R < 100 then the missing value of Ozone needs to be replaced by the value = 21.181818

  3. If Solar.R < 150 then the missing value of Ozone needs to be replaced by the value = 53. 13043

  4. If Solar.R < 200 then the missing value of Ozone needs to be replaced by the value = 59. 840000

  5. If Solar.R < 250 then the missing value of Ozone needs to be replaced by the value = 59. 840000

  6. If Solar.R < 300 then the missing value of Ozone needs to be replaced by the value = 50. 115385

  7. If Solar.R < 350 then the missing value of Ozone needs to be replaced by the value = 26. 571429

Is there any way to do this using pandas and if-else? I've tried using loc() but it resulted in the non - NaN values getting modified too.

PS: This is the code using loc()

while (s['Ozone'].isna() == True):
    s.loc[(s['Solar.R'] < 50), 'Ozone'] = '30.166667'
    s.loc[(s['Solar.R'] < 100), 'Ozone'] = '21.181818'
    s.loc[(s['Solar.R'] < 150), 'Ozone'] = '53.13043'
    s.loc[(s['Solar.R'] < 200), 'Ozone'] = '59.840000'
    s.loc[(s['Solar.R'] < 250), 'Ozone'] = '59.840000'
    s.loc[(s['Solar.R'] < 300), 'Ozone'] = '50.115385'
    s.loc[(s['Solar.R'] < 350), 'Ozone'] = '26.571429'

CodePudding user response:

The function .fillna(value) can be used. It only changes the NaN values in a dataframe and not other values. Here is an example for your specific problem:

import pandas as pd
import numpy as np

#example dataset with values for each interval
example = {'Solar.R' : [25, 25, 87, 87, 134, 134, 187, 187, 234, 234, 267, 267, 345, 345],
           'Ozone' : [1, np.nan, 1, np.nan, 1, np.nan, 1, np.nan, 1, np.nan, 1, np.nan, 1, np.nan]}
df = pd.DataFrame(example)

#list of pairs of the cutoff and the respective values
#!!! needs to be sorted from smallest cutoff to largest
cut_off_values = [(50, 30.166667), (100, 21.181818), (150, 53.13043),
                  (200, 59.840000), (250, 59.840000), (300, 50.115385), 
                  (350, 26.571429)]

#iterate the list of pairs and change only the nan values
for pair in cut_off_values:
    df[df['Solar.R'] < pair[0]] = df[df['Solar.R'] < pair[0]].fillna(pair[1])

print(df.to_string())

Output:

    Solar.R      Ozone
0        25   1.000000
1        25  30.166667
2        87   1.000000
3        87  21.181818
4       134   1.000000
5       134  53.130430
6       187   1.000000
7       187  59.840000
8       234   1.000000
9       234  59.840000
10      267   1.000000
11      267  50.115385
12      345   1.000000
13      345  26.571429

CodePudding user response:

Try:

common = df['col_2'].isnull()
all_conditions = [(df['Solar.R'] < 50) & (common),
                  (df['Solar.R'] > 50) & (df['Solar.R'] < 100) & (common),
                  (df['Solar.R'] > 100) & (df['Solar.R'] < 150) & (common),
                  (df['Solar.R'] > 150) & (df['Solar.R'] < 250) & (common),
                  (df['Solar.R'] > 250) & (df['Solar.R'] < 300) & (common),
                  (df['Solar.R'] > 300) & (df['Solar.R'] < 350) & (common)]

fill_with = ['30.166667', '21.181818', '53.13043', '59.840000', '50.115385', '26.571429']
df['col_2'] = np.select(all_conditions, fill_with, default=df['col_2'])

CodePudding user response:

As your conditions are linear, you can use floordiv to select the right values for Ozone column and mask to hide other values:

values = [30.166667, 21.181818, 53.13043, 59.840000,
          59.840000, 50.115385, 26.571429]
s['Ozone'] = s.mask(~s['Solar.R'].between(0, 350))['Solar.R'] \
              .sub(1).floordiv(50).map(pd.Series(values))
print(s)

# Output:
    Solar.R      Ozone
0      50.0  30.166667
1       NaN        NaN
2     450.0        NaN
3      98.0  21.181818
4     348.0  26.571429
5     302.0  26.571429
6     348.0  26.571429
7     279.0  50.115385
8       8.0  30.166667
9      80.0  21.181818
10    140.0  53.130430
11    239.0  59.840000
12    227.0  59.840000
13     93.0  21.181818
14    305.0  26.571429
15     80.0  21.181818
16    104.0  53.130430
17    180.0  59.840000
18    179.0  59.840000
19     59.0  21.181818
  • Related