I used to use dplyr everywhere, but have included some plyr functions . To be honest I do not know what the difference is or why things have changed . The same code results in different dataframes depending on whether I have imported plyr or tidyverse . What I wanted to have is a dataframe named newborn_stat containing unique pids in each row and a new c_pos column.
raw_file_contents<- data.frame( pid=c(1,2,2,3,3), C_SYMP=c("Y","N","Y","N","N"))
newborn_stat<- raw_file_contents %>%
group_by(pid) %>%
summarise(c_pos= any( C_SYMP == "Y", na.rm = TRUE))
instead I would get a data frame ,newborn_stat, of 1 rows with c_pos equal to TRUE. If I use dplyr:: with the group_by and summarise , I think I get the right answer. Why does this happen? I have been using a Rmd notebook, so when I tried to run the previous chunk containing this line, things didn't work.
CodePudding user response:
So I assumed this was due to the functions of dplyr
and plyr
working differently. It was a mostly correct assumption as but both dplyr
and plyr
has summarize
functions, while dplyr
has group_by
, but plyr
doesn't. If you import plyr
at a later chunk and then rerun the expression shown in the question, summarize
is assumed to be the one from plyr
namespace. so you are running dplyr::group_by
plyr::summarize
in that expression.
I wish they used a different name for summarize in plyr
. or keep the same function names throughout , if that makes sense.
CodePudding user response:
Exactly as you explain in your answer, the issue is the difference in the behavior of plyr::summarize()
and dplyr::summarize()
. To make this clear I've just put the results of your example here.
library(plyr)
library(dplyr)
data.frame(pid = c(1, 2, 2, 3, 3),
C_SYMP = c("Y", "N", "Y", "N", "N")) %>%
dplyr::group_by(pid) %>%
plyr::summarise(c_pos = any(C_SYMP == "Y", na.rm = TRUE))
#> c_pos
#> 1 TRUE
data.frame(pid = c(1, 2, 2, 3, 3),
C_SYMP = c("Y", "N", "Y", "N", "N")) %>%
dplyr::group_by(pid) %>%
dplyr::summarise(c_pos = any(C_SYMP == "Y", na.rm = TRUE))
#> # A tibble: 3 x 2
#> pid c_pos
#> <dbl> <lgl>
#> 1 1 TRUE
#> 2 2 TRUE
#> 3 3 FALSE
Created on 2022-02-11 by the reprex package (v2.0.1)