dbplyr select not duplicate columns-CodePudding

I am working with the R programming language. I have the following table that is located on a server:

age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)

head(testframe)

  age height height2 gender gender2
1  18   76.1    76.1      M       M
2  19   77.0    77.0      F       F
3  20   78.1    78.1      M       M
4  21   78.2    78.2      M       M
5  22   78.8    78.8      F       F
6  23   79.7    79.7      F       F

My goal is to remove columns from the above tables that are identical, even if they have different names. Thus, the end product would look like this:

  age height gender
1  18   76.1      M
2  19   77.0      F
3  20   78.1      M
4  21   78.2      M
5  22   78.8      F
6  23   79.7      F

My Problem: I want to remove these duplicate columns from the table directly on the server. I have been reading about the "dbplyr" library in R, and I am not sure if this is able to do this. I tried the two following methods:

Method 1:

library(RODBC)
library(sqldf)
library(dplyr)
library(dbplyr)
library(odbc)

con = odbcConnect("some name", uid = "some id", pwd = "abc")

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, testframe)



final_1 = testframe %>%
  dplyr::select(., which(as.list(.) %>%
                           duplicated %>%
                           `!`))

Method 2:

mtcars2 <- tbl(con, "testframe")


final_2 = mtcars2 %>%
    dplyr::select(., which(as.list(.) %>%
                               duplicated %>%
                               `!`))

In the above examples, the data ("testframe") is in my local environment, but I tried to use the copy_con statement in an attempt to see if these statements would work if the data was on a server.

My Question: The code from the above examples seemed to have worked - but I am not sure if the above code can only perform these statements on data in the local environment - or it can also perform these same statements on data located on a server. Can someone please comment on this?

Thanks!

CodePudding user response：

dbplyr works by translating dplyr commands into the corresponding database language, before passing these commands to the database. I strongly suspect it will struggle to translate which(as.list(.) %>% duplicated %>% '!') as you are asking it to translate the meaning of these instructions, not a literal command-for-command translation.

You can test dbplyr translation using simulated database connections. For example:

library(dplyr)
library(dbplyr)

age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)

remote_table = tbl_lazy(testframe , con = simulate_mssql()) # the simulation

remote_table %>%
  mutate(height3 = height2) %>%
  select(height, height2, height3) %>%
  show_query()

If this produces valid database code then you know that your dplyr commands are working as intended. But if this produces unexpected database commands (or R commands appear in the translation) then dbplyr has not been able to translate your instructions into the database language.

For this type of problem I might take an approach like the following:

all_column_names = colnames(remotetable)
num_cols = length(all_column_names)
is_dupe = rep(0, num_cols)

for(ii in 1:(num_cols - 1)){
  for(jj in 2:num_cols) {
    colA = all_column_names[ii]
    colB = all_column_names[jj]

    this_check = remotetable %>%
      mutate(compare = ifelse(!!sym(colA) == !!sym(colB), 1, 0)) %>%
      ungroup() %>%
      summarise(num = n(), matches = sum(compare)) %>%
      collect()

    if(this_check$num == this_check$matches)
      is_dupe[jj] = 1
  }
}

The idea is to iterate through all pairs of columns, and using basic dplyr commands check if every value is equal. If the number of equal values is equal to the total number of values then match the column as a duplicate. You can then use the output to select the non-duplicate columns.

Notes

!!sym(colA) takes the text stored inside that variable colA and turns it into a column name that dbplyr can handle.
collect() is the command to pull data from the database into local R memory. Only do this once you are confident that the data will fit in local memory.