Data aggregation in RDBMS-CodePudding

I have a Postgres DB, In which I am storing some analytical data in a table like a user id, city, country, source(referer), device_type(web, ios, android), etc.

I wanted to show insights based on the data in the above table i.e

all distinct cities, countries, or sources or device_type within a date range for a user
top cities, countries, or sources within a date range for a user
total requests within a date range for a user

up until now, the use case was limited to only top requests from any user in a given time range and since the data in the above table could be very large we pre-aggregated the data in a separate table day-wise per user

but now we have to aggregate based on cities, countries, sources, and devices. Creating separate tables for each field doesn't seem like the best possible solution.

Please let us know if there are any easier and more elegant solutions to our problem.

Also, we exploring the NoSQL database to store data as these fields might increase in the future but the data aggregation part is something we want to figure out first.

Thank you

CodePudding user response：

I'd split the data in a request table, a client table and a client stats table. NoSQL would be good for storing requests or raw data assuming you only need the data for analysis.

Storing the 'logs' in a seperate database from the clients and statistics isn't a bad idea anyway. You can optimize the logs database for write performance and the other one for read performance. To deal with data volume in the logs database is very easy because it's physically ordered by datetime, so you can easily partition and optimize for queries on the recent data only.

I've worked with SQL databases like this that hold many TB's but complete aggregate functions in seconds over the last few GB while a sum of an indexed column but without a where taking hours. NoSQL is harder to optimize but can be a less complex solution.

CodePudding user response：

this is a good use case for columnar storage solution. If your postgres instance is self hosted you can use a foreign data wrapper like cdw_store which stores your data in columnar format making your queries faster.

You can also look into solutions like Snowflake if data volumes are huge and you're moving into unstructured data territory