Home > Net >  BIGQUERY SQL code with GROUP BY and HAVING to filter out observations
BIGQUERY SQL code with GROUP BY and HAVING to filter out observations

Time:01-19

I'll first use R code to describe the problem and the desired output. My goal is to perform the same operations in BigQuery using SQL.

Suppose I have the following 3 tables:

library(data.table)
library(tidyverse)
library(dplyr)
student <- data.table(student_id = c(1, 3, 5, 7), age = c(21, 22, 21, 18))
class <- data.table(student_id = c(1, 1, 1, 3, 3, 5, 5, 5, 5, 5, 7),
                    class_id = c(100, 100, 101, 101, 120, 40, 30, 30, 40, 30, 90))
subject <- data.table(class_id = c(100, 101, 120, 40, 50, 30, 90),
                      subject_name = c("Math", "English", "Latin", "Dance", "Health",
                                       "Science", "History"))

student

> student
   student_id age
1:          1  21
2:          3  22
3:          5  21
4:          7  18

class

> class
    student_id class_id
 1:          1      100
 2:          1      100
 3:          1      101
 4:          3      101
 5:          3      120
 6:          5       40
 7:          5       30
 8:          5       30
 9:          5       40
10:          5       30
11:          7       90

subject

> subject
   class_id subject_name
1:      100         Math
2:      101      English
3:      120        Latin
4:       40        Dance
5:       50       Health
6:       30      Science
7:       90      History

I would like to merge these tables into a single table and then only keep the entries where a student is in the same class at least twice.

Here is the merged table:

student_class <- class[student, on = "student_id"]
student_class_subject <- subject[student_class, on = "class_id"]
> student_class_subject
    class_id subject_name student_id age
 1:      100         Math          1  21
 2:      100         Math          1  21
 3:      101      English          1  21
 4:      101      English          3  22
 5:      120        Latin          3  22
 6:       40        Dance          5  21
 7:       30      Science          5  21
 8:       30      Science          5  21
 9:       40        Dance          5  21
10:       30      Science          5  21
11:       90      History          7  18

Then I would like to keep the entries where a student is in the same class (i.e., same class_id at least twice). For example, student_id = 1 is in Math class twice but English once. So I will only keep the Math entry for student_id = 1. The desired output is this:

> student_class_subject %>% group_by(student_id, class_id) %>% filter(n() >= 2) %>% distinct()
  class_id subject_name student_id   age
     <dbl> <chr>             <dbl> <dbl>
1      100 Math                  1    21
2       40 Dance                 5    21
3       30 Science               5    21

Now I want to do the exact same thing that I did using the above R code but using BigQuery SQL code. Here's my attempt:

SELECT 
    student_id,
    age,
    class_id,
    subject_name
FROM
    student.student
JOIN
    class.class
ON
    student.student_id = class.student_id
JOIN
    subject.subject
ON
    class.class_id = subject.class_id
GROUP BY
    student.student_id, class.class_id
HAVING
    COUNT(class.class_id >= 2)

However, I wasn't able to get the desired output.

CodePudding user response:

You can join all tables, then extract only those records that have duplicates in the "class" table. You can get that using the ROW_NUMBER window function by partitioning on <student_id, class_id>. If this ranking value ever gets at least the value of 2, it will mean that you have a duplicate among those couples, hence record needed in your output.

SELECT subject.*,
       student.*
FROM       class     
INNER JOIN student ON class.student_id = student.student_id
INNER JOIN subject ON class.class_id = subject.class_id
QUALIFY ROW_NUMBER() OVER(PARTITION BY class.student_id, class.class_id 
                          ORDER     BY class.class_id                  ) = 2
  • Related