I am trying to use the dplyr %>%
pipeline to round and format the numeric values to have 2 digits after the decimal point (e.g. 2.43, 1.05). However, the format()
function have different behaviors in the below two data examples (df_summarize
and df_groupby
). The output of the df_summarize
is correct, but the output of df_groupby
is not correct, I wonder what's the reason for the wired behavior. Is that because of the grouping effect or something else?
Any suggestions are appreciated.
library(tidyverse)
df <- data.frame(cate = sample(c("A", "B"), size = 5, replace = T),
v1 = runif(5, 1.001, 5.005),
v2 = runif(5, 1.001, 5.005))
df_summarize <- df %>% summarise(mean_v1 = mean(v1),
mean_v2 = mean(v2)) %>%
round(2) %>% format(nsmall = 2)
# mean_v1 mean_v2
# 1 3.94 2.08
df_groupby <- df %>% group_by(cate) %>%
summarise(p1 = mean(v1), p2 = mean(v2)) %>%
select(-cate) %>% ungroup() %>%
round(2) %>% format(nsmall = 2)
# [1] "\033[38;5;246m# A tibble: 2 × 2\033[39m"
# [2] " p1 p2"
# [3] " \033[3m\033[38;5;246m<dbl>\033[39m\033[23m \033[3m\033[38;5;246m<dbl>\033[39m\033[23m"
# [4] "\033[38;5;250m1\033[39m 4.01 2.17"
# [5] "\033[38;5;250m2\033[39m 3.67 1.69"
CodePudding user response:
The reason is that there is a data.frame
method for format
which is different from the format.tbl
method (which calls pillar::format_tbl
where it is mentioned the 'x' as Object to format or print.
) for format
with tibble
. In the first case, when there is no group_by
, it didn't change the data to tibble
, whereas with group_by
, it changes to tibble
and this causes the issue
library(dplyr)
df %>%
group_by(cate) %>%
summarise(p1 = mean(v1), p2 = mean(v2)) %>%
select(-cate) %>%
round(2) %>%
as.data.frame %>% # add the `as.data.frame`
format(nsmall = 2)
-output
p1 p2
1 3.02 2.86
2 3.20 3.74
In the first case, check the str
> df %>%
summarise(mean_v1 = mean(v1),
mean_v2 = mean(v2)) %>%
round(2)%>%
str
'data.frame': 1 obs. of 2 variables:
$ mean_v1: num 3.09
$ mean_v2: num 3.21
whereas with group_by
> df %>%
group_by(cate) %>%
summarise(p1 = mean(v1), p2 = mean(v2)) %>%
select(-cate) %>%
round(2) %>%
str
tibble [2 × 2] (S3: tbl_df/tbl/data.frame)
$ p1: num [1:2] 3.02 3.2
$ p2: num [1:2] 2.86 3.74
It is also mentioned in the documentation of ?group_by
A grouped data frame with class grouped_df, unless the combination of ... and add yields a empty set of grouping columns, in which case a tibble will be returned.
If we want to use format
on tibble
, try with across
df %>%
group_by(cate) %>%
summarise(p1 = mean(v1), p2 = mean(v2)) %>%
select(-cate) %>%
round(2) %>%
mutate(across(everything(), format, nsmall = 2))
-output
# A tibble: 2 × 2
p1 p2
<chr> <chr>
1 3.02 2.86
2 3.20 3.74
CodePudding user response:
It's because the object fed into the function format() is in data frame format in case of df_summarize and the same is in the tibble format in case of df_groupby.
df_summarize <- df %>%
summarise(mean_v1 = mean(v1),
mean_v2 = mean(v2)) %>%
round(2)
df_groupby <- df %>% group_by(cate) %>%
summarise(p1 = mean(v1), p2 = mean(v2)) %>%
select(-cate) %>% ungroup() %>%
round(2)
str(df_summarize)
'data.frame': 1 obs. of 2 variables:
$ mean_v1: num 3.34
$ mean_v2: num 2.74
str(df_groupby)
tibble [2 x 2] (S3: tbl_df/tbl/data.frame)
$ p1: num [1:2] 4.35 3.08
$ p2: num [1:2] 4.21 2.37
The use of group_by() function results in a tibble.
Let me convert df_summarize into a tibble and then apply the format.
df_summarize <- df %>%
summarise(mean_v1 = mean(v1),
mean_v2 = mean(v2)) %>%
round(2) %>% as.tibble() %>% format(nsmall = 2)
df_summarize
mean_v1 mean_v2
1 3.34 2.74
Now, Let me convert df_groupby into a data frame and then apply the format
df_groupby <- df %>% group_by(cate) %>%
summarise(p1 = mean(v1), p2 = mean(v2)) %>%
select(-cate) %>% ungroup() %>%
round(2) %>% as.data.frame() %>% format(nsmall = 2)
p1 p2
1 4.35 4.21
2 3.08 2.37