Home > Software design >  SQL: how can I exclude certain lines from an aggregated result?
SQL: how can I exclude certain lines from an aggregated result?

Time:06-16

In the query I built, the result shows something like below:

SELECT name
      ,ARRAY_AGG(fruits ORDER BY time ASC) AS all_fruits
FROM table_fruits
name all_fruits
Person A Apple, Banana, Apple, Apple, Apple, Apple
Person B Apple, Apple, Apple, Banana, Apple, Banana
Person C Banana, Banana, Apple, Banana, Apple, Apple

I want to add one more column which shows the count of apples. However, I do not want to count apples that are followed by bananas. Therefore, the additional column should look like below.

name all_fruits count_of_apple
Person A Apple, Banana, Apple, Apple, Apple, Apple 4
Person B Apple, Apple, Apple, Banana, Apple, Banana 2
Person C Banana, Banana, Apple, Banana, Apple, Apple 2

How would I do this in SQL? The source includes time for when the fruit was eaten.

CodePudding user response:

You can check:

  • for each row you have in your parent table, what "fruits" value follows that row with the LEAD window function
  • if the row in check is the last one, it won't have a next value (it will be NULL), so the COALESCE function will replace this NULL value with the current "fruits" value
  • hence you can assign 1 to your new column when the current "fruits" value is "Apple" and your next value is not "Banana", inside a CASE statement
SELECT *,
       CASE WHEN fruits = 'Apple'
             AND COALESCE(LEAD(fruits) OVER(
                              PARTITION BY name 
                              ORDER     BY time), 
                          fruits)                  <> 'Banana'
            THEN 1 
       END AS apples_not_after_bananas
FROM table_fruits

After this step, you can use your own code and add

  • the GROUP BY clause you missed, to aggregate over the "name" field
  • the SUM aggregation function over the previously generated 1s when apples were not followed by bananas.
WITH cte AS (
    SELECT *,
           CASE WHEN fruits = 'Apple'
                 AND COALESCE(LEAD(fruits) OVER(
                                  PARTITION BY name 
                                  ORDER     BY time), 
                              fruits)                  <> 'Banana'
                THEN 1 
           END AS apples_not_after_bananas
    FROM table_fruits
)
SELECT name,
       ARRAY_AGG(fruits ORDER BY time ASC) AS all_fruits,
       SUM(apples_not_after_bananas)       AS count_of_apple
FROM cte
GROUP BY name

Check the demo here.


Edit: the banana came more than 1 day later

If you want to add this specific condition, or in general any conditions, you need to work inside the CASE statement, which currently has two conditions, one on the current fruit and one on the next fruit.

Checking whether the banana came more than 1 day later just means to add something like this:

           CASE WHEN fruits = 'Apple'
                 AND COALESCE(LEAD(fruits) OVER(
                                  PARTITION BY name 
                                  ORDER     BY time), 
                              fruits)                  <> 'Banana'
               --AND <if difference between the current next time value is greater than 1 day>
                THEN 1 
           END AS apples_not_after_bananas
  • Related