Home > Back-end >  what is the best way to store and query large sparse matrix as database?
what is the best way to store and query large sparse matrix as database?

Time:11-09

The matrix has 50 million rows and 100k columns. The rows are mostly sparse, each row only have < 1k non-zero values. All the columns are also very sparse in this case. The values are all 32bit int. The most common operation on these data are:

  • get several rows based on index.
  • obtain index based on query like very simple combination col_a>val_a AND col_b>val_b OR col_c>val_c ... or more complex condition like select top 10 based on col_x col_y ... or even more complex with some weights applied to it like col_x*x col_y*y ... We don't need to support condition based on more than 10 different columns in a single query.

Our frontend guys want me to put the sparse matrix in a database. I barely work with database. But I am aware the postgresql (Though we do not need to use postgresql) do not support more than 1.6k columns, sql do not support more than 1024 columns. So I wonder what is the best practice to do this? Any special database can hold this many columns per table? If there isn't one. How could I handle this in normal sql/postgresql database?

  • Should I segment the row into different tables (1k column per table, 100 tables)? How efficient it would be if we need a lot query like top N of col_x col_y ... across different tables?
  • Maybe I can store each row as array in one column? But then I cannot use sparse column instead I may rely on the server side compression. Again how to efficiently query top N in this case?

My concern is I doubt the database index can help much with our queries except the most simple one col_a > val_a AND .... For the top N query, in theory, we may cache the top 100 data for each combination, but considering the number of possible combinations it is impossible.

The matrix will be read only in most time in our application. Does it even make sense to use database to hold it?

update

After searching with wide data, wide columns. I see Cassandra and HBase can handle millions columns. But they are no-sql and probably hard to setup. Spark in theory does not have column limit but it seems only efficient for long and slim data. So may be there is some database suite my particular needs.

CodePudding user response:

My research concluded that no database supports 100K columns.

One way to solve this would be to have 100 tables with 1K columns each, all of which are linked together with a common key.

That being said... If you do your own research outside StackOverflow, you will likely find the same opinions I found, and hold myself, this is really dumb idea.

CodePudding user response:

The question cannot be answered without knowing what the queries will be. It impossible to pick a good data model without knowing the requirements.

If all you want to do is store and retrieve matrices, you could store them as JSON:

{ "1": { "42": 1234, "1001": 12 },
  "4": { "123": 13 } }

That would be a matrix with values only in row 1 and 4 and a few columns.

  • Related