Home > database >  How to use for loop to label numeric data in one Data Frame if intervals for labels are defined in a
How to use for loop to label numeric data in one Data Frame if intervals for labels are defined in a

Time:06-19

I have the following data frame regarding the quality of air (each row has values of various air indicators measured at certain day) :

my_data = pd.DataFrame.from_dict({
   'so2': [ 5.6, 4.6, 6.1, 10.7, 2.6],
   'co': [ 9.1, 8.2, 9.1, 18.1, 3.7],
   'o3': [2.9, 11.0, 4.1, 2.1, 18.7],
   'pm25': [124.0, 114.0, 104.0, 190.0, 46.0],
   'no2': [29.8, 22.4, 26.1, 41.6, 11.5]
})

I would like to transform these values into categories - colours (from green to red, 4 categories). I have another Data Frame containing the intervals for each colour.

quality_standards_colours = pd.DataFrame.from_dict({
    'Category': ['green', 'yellow',' orange', 'red'],
    'so2_min': [0, 0.864, 1.824, 4.464],
    'co_min': [0, 13.5, 28.5, 37.5],
    'o3_min': [0, 0.165, 0.213, 0.258],
    'pm25_min': [0, 12.1, 35.5, 55.5],
    'no2_min': [0, 1.296, 2.424, 8.664]
})

I would like to get the following Data Frame:

my_categorized_data = pd.DataFrame.from_dict({
   'so2': ['so2_red', 'so2_red', 'so2_red ', 'so2_red', 'so2_orange'],
   'co': [ 'co_orange', 'co_red', 'co_green ', 'co_red', 'co_yellow'],
   'o3': [.....],
   'pm25': [....],
   'no2': [.....]
})

Currently I know following solution ( I need to apply it for each variable):

bins_so2 = [0, 0.864, 1.824, 4.464, np.inf] #this is the column so2_min from quality_standards_colours
names = ['so2_green', 'so2_yellow','so2_orange', 'so2_red']

my_data['so2'] = pd.cut(my_data['so2'], bins, labels=names)

The problem is that this solution is time consuming - I have to write these lines of code for each variable (and I have many). I would like to write loop to speed up this process.

CodePudding user response:

You can use merge_asof, which will cut according to the key:

# rename so you have the same keys
quality_standards_colours.columns = [x.split('_min')[0] for x in quality_standards_colours.columns]

out = pd.merge_asof(my_data.stack().reset_index(name='value').sort_values('value'),
              quality_standards_colours.melt('Category').sort_values('value'),
              on='value',left_by='level_1', right_by='variable'
              ).pivot('level_0', 'level_1', 'Category')

Output:

level_1      co  no2   o3     pm25      so2
level_0                                    
0         green  red  red      red      red
1         green  red  red      red      red
2         green  red  red      red      red
3        yellow  red  red      red      red
4         green  red  red   orange   orange
  • Related