Home > Net >  RL reward function with unknown range
RL reward function with unknown range

Time:07-17

For the sake of the argument, let's say that I am trying to minimize a number of mathematical functions using Reinforcement Learning, where the minimum can essentially lie anywhere between -inf and inf. (I know that RL would probably not be the most suitable algorithm, but this is just an analogy.)

I want to set up the reward to reflect the "best minimum" found on each step. The problem is that any specific function can have a {min,max} range of {0,100} for example, or {-1000, 9999999}, or {-99999,-10}, or {-9.000000001,-9.000000002}, or any two conceivable values really - and the ranges are not known beforehand. I am therefore unsure how I should normalize the reward to lie between {-1, 1}, because such extreme ranges as before of course won't work directly as a reward.

I assume that some kind of relative improvement formula is needed where the new reward is compared to the old, but this creates problems because something like (x_old - x_new) / x_old would see a change of 1 to 0.5 as a 50% improvement, while the true minimum of the function might just as well lie at -1000.

Maybe there are simply too few constraints to sensibly construct a reward function, but I am sure that analogous problems have been encoutered elsewhere?

Thanks in advance!

CodePudding user response:

Predicting a distribution with limited information is always complicated. To improve your prediction I have an idea that might help or not: Optimize two models: One is rewarded for fulfilling the task with scaled rewards predicted with the values encountered so far. The other is optimzed towards a policy, that is rewarded every time it finds a state that has a value outside the distribution encountered so far. (e.g. if for z we have encountered values in the range {5, 42} the model is rewarded when finding a state that has z = 1 or z = 123). The distribution found by the latter model could then be used to scale the rewards for the former model. I hope this helps or inspires you to a good solution :)

  • Related