I'll first use R code to describe the problem and the desired output. My goal is to perform the same operations in BigQuery using SQL.
Suppose I have the following 3 tables:
library(data.table)
library(tidyverse)
library(dplyr)
student <- data.table(student_id = c(1, 3, 5, 7), age = c(21, 22, 21, 18))
class <- data.table(student_id = c(1, 1, 1, 3, 3, 5, 5, 5, 5, 5, 7),
class_id = c(100, 100, 101, 101, 120, 40, 30, 30, 40, 30, 90))
subject <- data.table(class_id = c(100, 101, 120, 40, 50, 30, 90),
subject_name = c("Math", "English", "Latin", "Dance", "Health",
"Science", "History"))
student
> student
student_id age
1: 1 21
2: 3 22
3: 5 21
4: 7 18
class
> class
student_id class_id
1: 1 100
2: 1 100
3: 1 101
4: 3 101
5: 3 120
6: 5 40
7: 5 30
8: 5 30
9: 5 40
10: 5 30
11: 7 90
subject
> subject
class_id subject_name
1: 100 Math
2: 101 English
3: 120 Latin
4: 40 Dance
5: 50 Health
6: 30 Science
7: 90 History
I would like to merge these tables into a single table and then only keep the entries where a student is in the same class at least twice.
Here is the merged table:
student_class <- class[student, on = "student_id"]
student_class_subject <- subject[student_class, on = "class_id"]
> student_class_subject
class_id subject_name student_id age
1: 100 Math 1 21
2: 100 Math 1 21
3: 101 English 1 21
4: 101 English 3 22
5: 120 Latin 3 22
6: 40 Dance 5 21
7: 30 Science 5 21
8: 30 Science 5 21
9: 40 Dance 5 21
10: 30 Science 5 21
11: 90 History 7 18
Then I would like to keep the entries where a student is in the same class (i.e., same class_id
at least twice). For example, student_id = 1
is in Math
class twice but English
once. So I will only keep the Math
entry for student_id = 1
. The desired output is this:
> student_class_subject %>% group_by(student_id, class_id) %>% filter(n() >= 2) %>% distinct()
class_id subject_name student_id age
<dbl> <chr> <dbl> <dbl>
1 100 Math 1 21
2 40 Dance 5 21
3 30 Science 5 21
Now I want to do the exact same thing that I did using the above R code but using BigQuery SQL code. Here's my attempt:
SELECT
student_id,
age,
class_id,
subject_name
FROM
student.student
JOIN
class.class
ON
student.student_id = class.student_id
JOIN
subject.subject
ON
class.class_id = subject.class_id
GROUP BY
student.student_id, class.class_id
HAVING
COUNT(class.class_id >= 2)
However, I wasn't able to get the desired output.
CodePudding user response:
You can join all tables, then extract only those records that have duplicates in the "class" table. You can get that using the ROW_NUMBER
window function by partitioning on <student_id, class_id>. If this ranking value ever gets at least the value of 2, it will mean that you have a duplicate among those couples, hence record needed in your output.
SELECT subject.*,
student.*
FROM class
INNER JOIN student ON class.student_id = student.student_id
INNER JOIN subject ON class.class_id = subject.class_id
QUALIFY ROW_NUMBER() OVER(PARTITION BY class.student_id, class.class_id
ORDER BY class.class_id ) = 2