I have a table of "outcomes" of scores for a few hundred people on specific days. E.g.
Person | Date | Score |
---|---|---|
1 | 1/1/2021 | 10 |
2 | 1/1/2021 | 15 |
1 | 2/2/2022 | 20 |
3 | 2/2/2022 | 17 |
I will need to repeatedly compare each players' average score for a specific date range. E.g. get each player's average score between 1/1/2021 and 12/31/2021
.
I know that I could query their average using the AVG(score) aggregate function, like SELECT Person, AVG(Score) FROM outcomes WHERE date < ?;
However, since I have hundreds of players with possibly hundreds of outcomes, I am worried that repeatedly doing this query will be produce a lot of row reads. I am considering creating an "averages" table or view where there is an entry for each player on each unique date, but the Score is an average score for the outcomes before that date.
Something like:
Person | EndDate | AVG(Score) |
---|---|---|
1 | 1/2/2021 | 10 |
2 | 1/2/2021 | 15 |
3 | 1/2/2021 | 0 |
1 | 2/3/2022 | 15 |
2 | 2/3/2022 | 15 |
3 | 2/3/2022 | 17 |
I realize that this is essentially at least doubling the amount of storage required, because each outcome will also have the associated "average" entry.
How is this kind of problem often addressed in practice? At what point does creating an "averages" table make sense? When is using the AVG(x) function most appropriate? Should I just add an "average" column to my "outcomes" table?
I was able to implement my query using the aggregate AVG(x) function, but I am considered about the number of row reads that my database quickly started requiring.
CodePudding user response:
What you are describing is a form of denormalization. Storing the result of an aggregation instead of running the query every time you need it.
When to implement this? When running the query cannot be done fast enough to meet your performance goals.
Be cautious about adopting denormalization too soon. It comes with a cost.
The risk is that if your underlying data changes, but your denormalized copy is not updated, then the stored averages will be outdated. You have to decide whether it's acceptable to query outdated aggregate results from the denormalized table, and how often you want to update those stored results. There isn't one answer to this — it's up to your project requirements and your judgment.