So I've been building an analytics tool in crypto that requires ingesting a few billion rows of data across a couple tables. Sales, Listings, NFTs, Metadata, Contract info, etc. I take that data and machine learn on it, outputting quite a bit of data for every contract every hour into another table. The data is extremely relational and the end result is trying to output large aggregates of data based on complex queries with many joins (think taking every asset in the 500 projects with the most sales in the last 3 days, joining to their metadata attributes, joining those attributes to their most recent mle results, and then slicing that data 20 different ways).
I currently have an 8 vCPU Postgresql RDS with 32 GB RAM and with 5-6 materialized views on it, it runs at 75% average CPU util and takes 10-15 mins to update some of the views, so anything that then queries those views takes an eternity to respond basically half of the time. I've looked into cube, timescale, redshift, etc. The stream based dbs seem quite difficult to set up highly related historical data with and the relational database don't have the concept of incrementally updated materialized views yet.
What should I do? I'd be willing to pay up to $1000/mo to solve the problem in a scalable way (maybe 10-100 billion rows across a few tables that can query from views that don't hold up reads and writes) Hopefully I can up that number soon but for now I'm fronting all the costs with no income from this yet
CodePudding user response:
I'd look hard at Redshift. It's columnar and designed for large analytic queries. It's SQL is based on postgres so the port effort could be manageable. Some aspect of your current solution may not fit well on Redshift but it should be able to hit the cost and analytic performance goals you are stating.
CodePudding user response:
Unfortunately PostGresql does not have technics ans functionnalities to help to have good performances about your problems.
- PostGreSQL does not have "In Memory" tables (that reduce contention by having no access to data disks nor transaction log)
- PostGreSQL does not have native C compiled stored procedure (that accelerate execution of complex logic)
- PostGreSQL does not have "Columnar" indexes (that reduce index size and acces for big tables)
- PostGreSQL is very limited to parallelize queries (only 4 operations supports multi threading and you have to do it manualy...)
- PostGreSQL does not have data compression for tables or indexes (that reduce access time without the need to decompress to access to the data)
- PostGreSQL does not have Indexed View (that are always synchronized to source data and does not need any refresh)
- PostGreSQL does not have a "batch" processing mode to access data globally by "block" instead of rows into indexes
- and finally PostGreSQL does not have a specialized data engine for analytics that automatically process cubes for BI
But thoses technics and functionnalities exists in some other RDBMS like Microsoft SQL Server (auto parallelization of "heavy" queries, ColumStore Indexes, In Memory table, native procedure, indexed views, compression, batch processing mode, SSAS engine for BI...)
Of course you need to pay a licence for all this stuff, but you will have to pay much more to have a super computer for PostGreSQL without having the same performances.
As an example, I have done some performances tests on some very simple queries that demontsrates how slow is PostGreSQL in front of SQL Server :
PostGreSQL vs Microsoft SQL Server – Comparison part 2 : COUNT performances This paper demonstrate that PostGreSQL is on average 90 times slower that SQL Server and in the worst case (with columnstore indexes) 1500 time slower...
PostGreSQL vs Microsoft SQL Server – Comparison part 1 : DBA command performances This paper demonstrate that PostGreSQL is about 20 time sower in basic DBA operations
Ask you why StackOverflow runs on Microsoft SQL Server and not on PostGreSQL ?