Home > OS >  Replicate a table for each group in SQL Server, like R
Replicate a table for each group in SQL Server, like R

Time:05-21

I am doing this operation in R;

library(data.table)
library(dplyr)


iris_groups <- iris %>% 
    select(Species) %>% 
    unique %>% 
    group_by(Species) %>% 
    group_split()

col_to_replicate <- mtcars %>% select(carb)

binder_fun <- function(x){
  cbind(x,col_to_replicate) %>% as.data.table
}

populated_data <- iris_groups %>% lapply(binder_fun) %>% rbindlist

As shown on the example, I want to duplicate a column (carb column of mtcars in this example) for each value in the left table (iris Species in this example)

A better explanation of ones who don't know R.

The first table is ;

|Species    |
|:----------|
|setosa     |
|versicolor |
|virginica  |

The second table is;

| carb|
|----:|
|    4|
|    4|
|    1|
|    1|

If I duplicate second table for each row in the first table it should look like ;

|Species    | carb|
|:----------|----:|
|setosa     |    4|
|setosa     |    4|
|setosa     |    1|
|setosa     |    1|
|versicolor |    4|
|versicolor |    4|
|versicolor |    1|
|versicolor |    1|
|virginica  |    4|
|virginica  |    4|
|virginica  |    1|
|virginica  |    1|

I am new to SQL and don't know how to do this.

Thanks in advance.

CodePudding user response:

First, I would suggest looking into the package dbplyr as it is meant to help do this translation. If you're connecting to the database with R, you don't even necessarily need to see the SQL. There's also the show_query() that will print the SQL. That being said it mainly translates tidyverse so you would actually need to change your R to get it to work.

As @Lamu mentioned, you are looking for a CROSS JOIN. R does some automatic replication under the hood when your columns are different lengths, but SQL you have to be more explicit. Assuming you have the tables created in the database already:

SELECT Species FROM iris
CROSS JOIN (
 SELECT carbs FROM mtcars
)

Depending on which SQL database or query engine you are using, there can be slightly different dialects of SQL. It might also be referred to as a CARTESIAN JOIN. Realize for a large table, this can be a very time intensive operation. In SQL you can also use constants in the query. As a result, you could test if this is the correct syntax with this query which should give you back the example result from above

SELECT  
    *
FROM (
    VALUES
        ('setosa'),
        ('versicolor'),
        ('virginica')
) AS iris (Species)
CROSS JOIN (
    SELECT  
    *
    FROM (
        VALUES
            (4),
            (4),
            (1),
            (1)
    ) AS mtcars (carbs)
) 
ORDER BY Species

CodePudding user response:

For the sake of completeness and because the question is tagged data.table: The data.table package has a convenient cross join function CJ().

data.table::CJ(Species = unique(iris$Species), carb = head(mtcars$carb, 4L))
       Species  carb
 1:     setosa     1
 2:     setosa     1
 3:     setosa     4
 4:     setosa     4
 5: versicolor     1
 6: versicolor     1
 7: versicolor     4
 8: versicolor     4
 9:  virginica     1
10:  virginica     1
11:  virginica     4
12:  virginica     4

Note that we start right away with the original datasets iris and mtcars. All intermediate steps to extract a vector of unique Species as well as the first 4 elements of carb are included in the code above.


As the OP asked for the equivalent SQL statement:

sqldf::sqldf("
SELECT Species, carb FROM (SELECT DISTINCT Species FROM iris)
  CROSS JOIN (SELECT carb FROM mtcars LIMIT 4)
")

which returns the same result.

Again, we have started right away with the original datasets and use 2 subqueries for the intermediate steps.

Note that the LIMIT clause depends on the SQL dialect.

  • Related