How can I get the most common value on a SQL array?-CodePudding

I'm working with a food delivery dataset which contains the user_id and states where my clients have a registered address. Let's assume that the states that appers more often are the ones where the client lives and the others are places they visited or lived for a short period of time.

>>> ADDRESSES
  USER_ID  | STATE                            |
_______________________________________________
0   100    |   ["AL", "AL", "AL"]             |
1   101    |   ["LA", "LA", "MI"]             |
2   102    |   ["NY", "NY", "KY", "NY", "NY"] |
3   103    |   ["DE", "DE", "CO", "CO", "DE"] |
4   104    |   ["CA"]                         |
5   105    |   ["FL", "GA", "LA", "LA"]       |
6   106    |   ["ID", "ID", "ID", "DE"]       |
7   107    |   ["TX", "TX", "VT"]             |
8   108    |   ["RI"]                         |
9   109    |   ["TN", "TN", "OK", "VA"]       |

That being said, how can I get the most common state for each client so I can get the result bellow?

>>> ADDRESSES
  USER_ID  | STATE |
____________________
0   100    |   AL  |
1   101    |   LA  |
2   102    |   NY  |
3   103    |   DE  |
4   104    |   CA  |
5   105    |   LA  |
6   106    |   ID  |
7   107    |   TX  |
8   108    |   RI  |
9   109    |   TN  |

CodePudding user response：

Consider (BigQuery) below approach (assuming that state column is array of strings)

select user_id, 
  ( select st from t.state st
    group by st order by count(1) desc 
    limit 1 ) state
from your_table t

if applied to sample data in your question - output is

CodePudding user response：

There are no arrays in standard SQL. There might be such a thing in the DBMS you're using, but not in general.

What I see in

0   100    |   ["AL", "AL", "AL"]             |

is 2 columns: an integer, and a string. I'm guessing that's not really it; I'm guessing the tool you're using represents an "array" in your DBMS using Python syntax: [] to indicate an array, and "" to indicate strings. So an array 3 strings. But that's just a guess because you don't say.

The standard answer to your question is prefixed by don't do that. Since before SQL was invented, First Normal Form (1NF) proscribed repeating groups as a datatype. That is to say, the table is not in 1NF. To answer the question you asked with SQL as she is spoke, you first need to eliminate the repeating groups.

How? For that row, you need something like this:

0   100    |   1  | ["AL"             |
0   100    |   2  | ["AL"             |
0   100    |   3  | ["AL"             |

How you get there, I don't know. It depends on what you actually have, and what your DBMS offers. But, once you're there, the query almost writes itself:

select X.user_id, min(X.state) as state
from (
    select user_id, state, count(state) as N
    from table_1nf
    group by user_id, state ) as X
join 
    select user_id, max(N) as N
    from (
        select user_id, state, count(state) as N
        from table_1nf
       group by user_id, state ) as foo
) as T
on X.user_id = T.user_id and X.N = T.N
group by X.user_id

That query ensures that, in the event of a tie, you get only the first state having the highest count.