How can I design a table that stores subdomain metadata to avoid large partitions?-CodePudding

i try to design a table in cassandra, but im getting a lot of large partition messages. Any ideas how i could improve this "design" to prevent overloading and still can use a query like this:

select * from analytics where domain='test' and tld='com'

CREATE TABLE analytics (
    domain text,
    tld text,
    subdomain text,
    a text,
    PRIMARY KEY ((domain, tld), subdomain)
)

Also im loading this table with

update analytics set a='a' where domain='test' and tld='com' and subdomain='b';

Some partitions are over 1million rows

CodePudding user response：

How unique is a? You can include whatever makes the most sense and would give you smaller partitions, then you could create a secondary index on whichever column you leave out of the original PK and need to query. Remember that whatever you include in the PK, you'll need to use when you query records, so only include or add a column that would make sense to include in queries and would give you smaller partitions.

CodePudding user response：

I must be naïve but I'm very surprised to hear that some domains can have a million subdomains. In any case, I suspect that a significant majority of domains would have less than 100 subdomains so for the most part, your current table schema is going to be fine and you just need to deal with the really "large" domains.

This is a common problem for social apps and in Graph Theory it is known as the supernode problem -- a vertex with an incredibly high number of edges. In simpler terms, it's Barack Obama (the vertex or node) with over 133M followers (edges) on Twitter, or Cristiano Ronaldo with over 506M followers on Instagram.

For apps that run into the supernode problem, they typically work around it by handling the supernodes separately from the rest. In your case, you need to implement some logic in your app to detect the "super domains" and store them in a separate table.

A possible table design uses the first 2 characters of the subdomain as a bucket. For example with domain sub.domainsr.us, we use the prefix su for bucketing to make the partitions smaller:

CREATE TABLE subdomains_by_domain_tld_prefix (
    domain text,
    tld text,
    prefix text,
    subdomain text,
    a text,
    PRIMARY KEY ((domain, tld, prefix), subdomain)
)

This is just an example so the prefix doesn't have to be limited to just the first 2 characters. You can adjust it depending on the dataset.

Also if it makes it simpler for your app, you can choose to use this table for all domains. Cheers!