Algorithm: how do I calculate a hit/miss cache ratio taking time into account-CodePudding

All right, I'm sure this is a duplicate of a question somewhere (I refuse to believe I'm the only idiot struggling with this), but I can't find it.

It seems so easy on paper, and yet here am I writing this.

So, I have written my own caching system. It's 99% done. I'm lacking the last piece of the puzzle : let the system decide wether it's "profitable" (time wise) to load from the cache or not.

I have 4 variables :

hit : counter of successful loads from cache
miss : counter of unsuccessful loads from cache
time_check : the average time it takes to load from the cache (hit and time_check are updated at the same time)
time_generate : the average time it takes to generate the page from scratch (miss and time_generate are updated at the same time)

From there, my problem sounds easy. I need to figure out a formula or a set of if-elses conditions to decide wether to attempt a load from the cache or not.

It's safe to assume the variables are already there because the algorithm will run only after at least 100 page loads.

One important thing to take into account, is that if the algorithm decides to load from the cache and it MISSES, the overall loading time will be time_check time_generate!

Aaaand... From there I don't know how I should do it. I'm stuck. My mind can't process this.

I need the system to take the right decision for loading content from the cache according the the probability of getting an actual time gain.

I can already say right from the start that if time_generate < time_check, we should not use the cache no matter what hit and miss are. But then, for the rest, I'm getting confused.

By any chance, does anyone reading this has a flash brain and is able to immediately understand what I'm looking for? :)

CodePudding user response：

I have 4 variables :

Your time_check is 2 pieces - checking if the data is in the cache, and then fetching the data from cache. If you check if the data is in the cache and it's not you end up with "(part of time_check) time_generate" and you have no information to deduce "part of time_check".

You need to fix that by splitting time_check into pieces - e.g. a time_check_if_cached and a time_fetch_from_cache.

There is a 5th variable you're overlooking - the time you spend trying to determine if you should check the cache or not. This should also include the time spent tracking information used for making the decision (e.g. incrementing hit and miss). Let's call this time_decision.

Now you have 3 cases:

you decide not to check the cache. It costs time_decision time_generate.
you decide to check the cache and find that the data is not cached. It costs time_decision time_check_if_cached time_generate.
you decide to check the cache and find that the data is cached. It costs time_decision time_check_if_cached time_fetch_from_cache.

Now we can determine some general rules:

a) if time_decision is greater than or equal to time_check_if_cached; then it's always better to skip time_decision and check the cache instead. This is the most likely scenario. The result is that you only have 2 cases - time_check_if_cached time_generate and time_check_if_cached time_fetch_from_cache.

b) if time_decision is only slightly less than time_check_if_cached, and there's even a small chance of cache hit; it's still "better on average" to always skip time_decision.

c) if time_decision is significantly less than time_check_if_cached; you have design flaws and need to fix/optimize time_check_if_cached.

d) if the hit rate (hit/(hit miss)) is so low (and time_check_if_cached is so expensive) that you think it might help to avoid time_check_if_cached and go directly to time_generate; then you have design flaws and need to fix/improve the hit rate.

e) if time_check_if_cached time_fetch_from_cache is greater than or equal to time_generate; you have design flaws and either need to fix/optimize time_check_if_cached time_fetch_from_cache or stop trying to cache something that's already so fast that caches can't help.

CodePudding user response：

I spent some serious amount of time thinking about this, and after numerous attempts I have finally found an answer that satisfies my needs. In case anyone needs it:

Skipping all the calculations and simplifications, the end formula is:

minimum hit rate % = time_check / time_generate

I guess you could somehow find it from the other replies here.

Next, from that base formula we need to distort time to penalize higher times. That is because, in statistics and probabilities, loosing 2.5s over 3.5s is identical to losing 250s over 350s. However, that is not how we "feel" time. We're way more willing to bet to loose 2.5s rather than 250s. If we're betting to loose 250s over 350s, we would expect a very high hit rate. And if we're betting to loose 2.5s over 3.5s: "I mean, it's only 2.5s so it's not that much of an issue, let's try it anyway".

I have tried a lot of options based on random attempts but the one that I kept in the end is:

minimum hit rate % = arctan(time_check/100) / arctan(time_generate/100)

I don't have any justification for it, it just works best for my needs. It is also adaptable by changing the denominator. /50 will make it stricter (higher hit rate required) and /150 will make it more flexible (less hit rate required). It suits all range of values and gives quite expectable results.

I guess it all depends on your particular configuration, here are the other time distorting options I have tried and their issues:

time_check² / time_generate² [too loose on high distanced values]

time_check ^ 0.1 / time_generate ^ 0.1 [too strict overall]

1.05 ^ time_check / 1.05 ^ time_generate [way too strict overall]

1.1 ^ time_check / 1.1 ^ time_generate [way too loose on high distanced values]

log2(time_check) / log2(time_generate) [not suited for values of 1 and below]