Find the most frequent value per group in a table column-CodePudding

I need to find most frequent value of object_of_search for each ethnicity. How can I achieve this? Subqueries in the SELECT clause and correlated subqueries are not allowed. Something similar to this:

mode() WITHIN GROUP (ORDER BY stopAndSearches.object_of_search) AS "Most frequent object of search"

But this does not aggregate and gives me many rows for each ethnicity and object_of_search:

 officer_defined_ethnicity | Sas for ethnicity |   Arrest rate    | Most frequent object of search
--------------------------- ------------------- ------------------ --------------------------------
 ethnicity2                |                 3 | 66.6666666666667 | Stolen goods
 ethnicity3                |                 2 |              100 | Fireworks
 ethnicity1                |                 5 |               60 | Firearms
 ethnicity3                |                 2 |              100 | Firearms
 ethnicity1                |                 5 |               60 | Cat
 ethnicity1                |                 5 |               60 | Dog
 ethnicity2                |                 3 | 66.6666666666667 | Firearms
 ethnicity1                |                 5 |               60 | Psychoactive substances
 ethnicity1                |                 5 |               60 | Fireworks

And should be something like this:

 officer_defined_ethnicity | Sas for ethnicity |   Arrest rate    | Most frequent object of search
--------------------------- ------------------- ------------------ --------------------------------
 ethnicity2                |                 3 | 66.6666666666667 | Stolen goods
 ethnicity3                |                 2 |              100 | Fireworks
 ethnicity1                |                 5 |               60 | Firearms

Table on fiddle.
Query:

SELECT DISTINCT
    stopAndSearches.officer_defined_ethnicity,
    count(stopAndSearches.sas_id) OVER(PARTITION BY stopAndSearches.officer_defined_ethnicity) AS "Sas for ethnicity",
    sum(case when stopAndSearches.outcome = 'Arrest' then 1 else 0 end)
       OVER (PARTITION BY stopAndSearches.officer_defined_ethnicity)::float /
       count(stopAndSearches.sas_id) OVER(PARTITION BY stopAndSearches.officer_defined_ethnicity)::float * 100 AS "Arrest rate",
    mode() WITHIN GROUP (ORDER BY stopAndSearches.object_of_search) AS "Most frequent object of search"
FROM stopAndSearches
GROUP BY stopAndSearches.sas_id, stopAndSearches.officer_defined_ethnicity;

Table:

CREATE TABLE IF NOT EXISTS stopAndSearches(
    "sas_id" bigserial PRIMARY KEY,
    "officer_defined_ethnicity" VARCHAR(255),
    "object_of_search" VARCHAR(255),
    "outcome" VARCHAR(255)
);

CodePudding user response：

Updated: Fiddle

This should address the specific "which object per ethnicity" question.

Note, this doesn't address ties in the count. That wasn't part of the question / request.

Adjust your SQL to include this logic, to provide that detail:

WITH cte AS (
        SELECT officer_defined_ethnicity
             , object_of_search
             , COUNT(*) AS n
             , ROW_NUMBER() OVER (PARTITION BY officer_defined_ethnicity ORDER BY COUNT(*) DESC) AS rn
          FROM stopAndSearches
         GROUP BY officer_defined_ethnicity, object_of_search
     )
SELECT * FROM cte
 WHERE rn = 1
;

Result:

officer_defined_ethnicity	object_of_search	n	rn
ethnicity1	Cat	1	1
ethnicity2	Stolen goods	2	1
ethnicity3	Fireworks	1	1

CodePudding user response：

SELECT DISTINCT ON (1)
       officer_defined_ethnicity, object_of_search, count(*) AS ct
FROM   stop_and_searches
GROUP  BY 1, 2
ORDER  BY 1, 3 DESC, 2;

Or more explicitly:

SELECT DISTINCT ON (officer_defined_ethnicity)
       officer_defined_ethnicity, object_of_search, count(*) AS ct
FROM   stop_and_searches
GROUP  BY officer_defined_ethnicity, object_of_search
ORDER  BY officer_defined_ethnicity, ct DESC, object_of_search;

 officer_defined_ethnicity | object_of_search | ct
--------------------------- ------------------ ----
 ethnicity1                | Cat              | 1
 ethnicity2                | Stolen goods     | 2
 ethnicity3                | Firearms         | 1

db<>fiddle here

Since DISTINCT ON is applied after GROUP BY we only need a single query level.

Aggregate to get counts per (officer_defined_ethnicity, object_of_search) with GROUP BY.
Pick the row with the highest count per officer_defined_ethnicity with DISTINCT ON.

I added object_of_search as third ORDER BY item to act as tiebreaker and produce a deterministic result:
In case of ties, pick the earliest object_of_search according to alphabetical sort order.
Adapt to your needs.

See:

Simpler and typically faster than a subquery with row_number():

Select first row in each GROUP BY group? - Benchmarks