Will an index become less efficient as cardinality decreases for values rarely used?-CodePudding

Given a SqlServer customer table with millions of rows and an index on a last_name column, let's assume that over time due to customer requests to have us forget them we take the approach to "anonymize" their data by replacing last_name (and of course any other relevant data) with a static value like "******". We would do this instead of deleting the data b/c there are related data we need to hold on to for auditing and other justifiable business reasons.

Over time, if we found that we were getting a significant percentage of these rows being anonymized in this way, would there be any potential negative impact on overall system performance, assuming that other than the edge case where someone actually queries where last_name starts with an asterisk, the common case is that queries which use this index would be searching for legitimate last names, e.g. where last_name starts with "H"?

For instance, could the internal data structures of the index be affected such that this growing set of records for this ***** value might create a large memory or page object which could cause I/O bottlenecks or other issues in certain situations such as when the database server is under heavy load?

I get that low-cardinality indexes are not new/uncommon, but if we start off with an index with high cardinality and introduce a growing "tumor" of duplicates of the same value, I'm wondering if this might eventually become a concern?

I'm sure there are other/better ways to solve for this, and so happy to hear them if you'd rather attack the deeper issue, but I'd still like to understand the potential impact on the index.

CodePudding user response：

B-Tree indexes are balanced, and their overall structure ( depth ) depends only on the cardinality of the table, the length of the keys and fill percentage of the pages. Therefore, you won't see structural issues as a column's data distribution changes (assuming you're doing proper index maintenance.)

However, this skewed data distribution will cause issues with statistics.

Consider this query: "select ... from Customer where LastName = @p" There is no best plan for all possible values of @p. Some values will return a few rows, some values will return millions.

A filtered index CREATE IX ON CUSTOMER (LastName) WHERE LastName <> '***' partially addresses this issue. The index will only contain interesting rows, hence will be smaller. Some query changes may be required to ensure this new index is actually used... for example select ... from Customer where LastName = @p and LastName <> '***' or select ... from Customer where LastName = @p (option recompile).

SQL Server 2022 (currently unreleased) will introduce "Parameter Sensitive Plan Optimization" which attempts to address this issue as well.