Home > Back-end >  tidyverse , plyr and dplyr
tidyverse , plyr and dplyr

Time:02-12

I used to use dplyr everywhere, but have included some plyr functions . To be honest I do not know what the difference is or why things have changed . The same code results in different dataframes depending on whether I have imported plyr or tidyverse . What I wanted to have is a dataframe named newborn_stat containing unique pids in each row and a new c_pos column.

raw_file_contents<- data.frame( pid=c(1,2,2,3,3), C_SYMP=c("Y","N","Y","N","N"))
newborn_stat<- raw_file_contents %>%
        group_by(pid) %>%
        summarise(c_pos= any( C_SYMP == "Y", na.rm = TRUE))

instead I would get a data frame ,newborn_stat, of 1 rows with c_pos equal to TRUE. If I use dplyr:: with the group_by and summarise , I think I get the right answer. Why does this happen? I have been using a Rmd notebook, so when I tried to run the previous chunk containing this line, things didn't work.

CodePudding user response:

So I assumed this was due to the functions of dplyr and plyr working differently. It was a mostly correct assumption as but both dplyr and plyr has summarize functions, while dplyr has group_by , but plyr doesn't. If you import plyr at a later chunk and then rerun the expression shown in the question, summarize is assumed to be the one from plyr namespace. so you are running dplyr::group_by plyr::summarize in that expression.

I wish they used a different name for summarize in plyr. or keep the same function names throughout , if that makes sense.

CodePudding user response:

Exactly as you explain in your answer, the issue is the difference in the behavior of plyr::summarize() and dplyr::summarize(). To make this clear I've just put the results of your example here.

library(plyr)
library(dplyr)

data.frame(pid = c(1, 2, 2, 3, 3),
           C_SYMP = c("Y", "N", "Y", "N", "N")) %>% 
  dplyr::group_by(pid) %>%
  plyr::summarise(c_pos = any(C_SYMP == "Y", na.rm = TRUE))
#>   c_pos
#> 1  TRUE

data.frame(pid = c(1, 2, 2, 3, 3),
           C_SYMP = c("Y", "N", "Y", "N", "N")) %>% 
  dplyr::group_by(pid) %>%
  dplyr::summarise(c_pos = any(C_SYMP == "Y", na.rm = TRUE))
#> # A tibble: 3 x 2
#>     pid c_pos
#>   <dbl> <lgl>
#> 1     1 TRUE 
#> 2     2 TRUE 
#> 3     3 FALSE

Created on 2022-02-11 by the reprex package (v2.0.1)

  • Related