BIGQUERY SQL code with GROUP BY and HAVING to filter out observations-CodePudding

I'll first use R code to describe the problem and the desired output. My goal is to perform the same operations in BigQuery using SQL.

Suppose I have the following 3 tables:

library(data.table)
library(tidyverse)
library(dplyr)
student <- data.table(student_id = c(1, 3, 5, 7), age = c(21, 22, 21, 18))
class <- data.table(student_id = c(1, 1, 1, 3, 3, 5, 5, 5, 5, 5, 7),
                    class_id = c(100, 100, 101, 101, 120, 40, 30, 30, 40, 30, 90))
subject <- data.table(class_id = c(100, 101, 120, 40, 50, 30, 90),
                      subject_name = c("Math", "English", "Latin", "Dance", "Health",
                                       "Science", "History"))

student

> student
   student_id age
1:          1  21
2:          3  22
3:          5  21
4:          7  18

class

> class
    student_id class_id
 1:          1      100
 2:          1      100
 3:          1      101
 4:          3      101
 5:          3      120
 6:          5       40
 7:          5       30
 8:          5       30
 9:          5       40
10:          5       30
11:          7       90

subject

> subject
   class_id subject_name
1:      100         Math
2:      101      English
3:      120        Latin
4:       40        Dance
5:       50       Health
6:       30      Science
7:       90      History

I would like to merge these tables into a single table and then only keep the entries where a student is in the same class at least twice.

Here is the merged table:

student_class <- class[student, on = "student_id"]
student_class_subject <- subject[student_class, on = "class_id"]
> student_class_subject
    class_id subject_name student_id age
 1:      100         Math          1  21
 2:      100         Math          1  21
 3:      101      English          1  21
 4:      101      English          3  22
 5:      120        Latin          3  22
 6:       40        Dance          5  21
 7:       30      Science          5  21
 8:       30      Science          5  21
 9:       40        Dance          5  21
10:       30      Science          5  21
11:       90      History          7  18

Then I would like to keep the entries where a student is in the same class (i.e., same class_id at least twice). For example, student_id = 1 is in Math class twice but English once. So I will only keep the Math entry for student_id = 1. The desired output is this:

> student_class_subject %>% group_by(student_id, class_id) %>% filter(n() >= 2) %>% distinct()
  class_id subject_name student_id   age
     <dbl> <chr>             <dbl> <dbl>
1      100 Math                  1    21
2       40 Dance                 5    21
3       30 Science               5    21

Now I want to do the exact same thing that I did using the above R code but using BigQuery SQL code. Here's my attempt:

SELECT 
    student_id,
    age,
    class_id,
    subject_name
FROM
    student.student
JOIN
    class.class
ON
    student.student_id = class.student_id
JOIN
    subject.subject
ON
    class.class_id = subject.class_id
GROUP BY
    student.student_id, class.class_id
HAVING
    COUNT(class.class_id >= 2)

However, I wasn't able to get the desired output.

CodePudding user response：

You can join all tables, then extract only those records that have duplicates in the "class" table. You can get that using the ROW_NUMBER window function by partitioning on <student_id, class_id>. If this ranking value ever gets at least the value of 2, it will mean that you have a duplicate among those couples, hence record needed in your output.

SELECT subject.*,
       student.*
FROM       class     
INNER JOIN student ON class.student_id = student.student_id
INNER JOIN subject ON class.class_id = subject.class_id
QUALIFY ROW_NUMBER() OVER(PARTITION BY class.student_id, class.class_id 
                          ORDER     BY class.class_id                  ) = 2