Home > Back-end >  why the weird behavior of using `format()` with `dplyr` in R?
why the weird behavior of using `format()` with `dplyr` in R?

Time:11-14

I am trying to use the dplyr %>% pipeline to round and format the numeric values to have 2 digits after the decimal point (e.g. 2.43, 1.05). However, the format() function have different behaviors in the below two data examples (df_summarize and df_groupby). The output of the df_summarize is correct, but the output of df_groupby is not correct, I wonder what's the reason for the wired behavior. Is that because of the grouping effect or something else?

Any suggestions are appreciated.

library(tidyverse)

df <- data.frame(cate = sample(c("A", "B"), size = 5, replace = T),
                 v1 = runif(5, 1.001, 5.005), 
                 v2 = runif(5, 1.001, 5.005))

df_summarize <- df %>% summarise(mean_v1 = mean(v1),
                                 mean_v2 = mean(v2)) %>% 
  round(2) %>% format(nsmall = 2)

#   mean_v1 mean_v2
# 1    3.94    2.08

df_groupby <- df %>% group_by(cate) %>% 
  summarise(p1 = mean(v1), p2 = mean(v2)) %>% 
  select(-cate) %>% ungroup() %>% 
  round(2) %>% format(nsmall = 2)

# [1] "\033[38;5;246m# A tibble: 2 × 2\033[39m"                                                
# [2] "     p1    p2"                                                                          
# [3] "  \033[3m\033[38;5;246m<dbl>\033[39m\033[23m \033[3m\033[38;5;246m<dbl>\033[39m\033[23m"
# [4] "\033[38;5;250m1\033[39m  4.01  2.17"                                                    
# [5] "\033[38;5;250m2\033[39m  3.67  1.69" 

CodePudding user response:

The reason is that there is a data.frame method for format which is different from the format.tbl method (which calls pillar::format_tbl where it is mentioned the 'x' as Object to format or print.) for format with tibble. In the first case, when there is no group_by, it didn't change the data to tibble, whereas with group_by, it changes to tibble and this causes the issue

library(dplyr)
df %>%
    group_by(cate) %>% 
    summarise(p1 = mean(v1), p2 = mean(v2)) %>% 
    select(-cate) %>% 
    round(2) %>% 
    as.data.frame %>% # add the `as.data.frame`
    format(nsmall = 2)

-output

  p1   p2
1 3.02 2.86
2 3.20 3.74

In the first case, check the str

> df %>%
    summarise(mean_v1 = mean(v1),
                                 mean_v2 = mean(v2)) %>% 
  round(2)%>%
  str
'data.frame':   1 obs. of  2 variables:
 $ mean_v1: num 3.09
 $ mean_v2: num 3.21

whereas with group_by

> df %>%
     group_by(cate) %>% 
     summarise(p1 = mean(v1), p2 = mean(v2)) %>% 
     select(-cate) %>% 
     round(2) %>%
 str
tibble [2 × 2] (S3: tbl_df/tbl/data.frame)
 $ p1: num [1:2] 3.02 3.2
 $ p2: num [1:2] 2.86 3.74

It is also mentioned in the documentation of ?group_by

A grouped data frame with class grouped_df, unless the combination of ... and add yields a empty set of grouping columns, in which case a tibble will be returned.


If we want to use format on tibble, try with across

df %>%
    group_by(cate) %>% 
    summarise(p1 = mean(v1), p2 = mean(v2)) %>% 
    select(-cate) %>% 
    round(2) %>%
    mutate(across(everything(), format, nsmall = 2))

-output

# A tibble: 2 × 2
  p1    p2   
  <chr> <chr>
1 3.02  2.86 
2 3.20  3.74 

CodePudding user response:

It's because the object fed into the function format() is in data frame format in case of df_summarize and the same is in the tibble format in case of df_groupby.

df_summarize <- df %>% 
  summarise(mean_v1 = mean(v1),
            mean_v2 = mean(v2)) %>% 
  round(2) 

df_groupby <- df %>% group_by(cate) %>% 
  summarise(p1 = mean(v1), p2 = mean(v2)) %>% 
  select(-cate) %>% ungroup() %>% 
  round(2)

str(df_summarize)
'data.frame':   1 obs. of  2 variables:
 $ mean_v1: num 3.34
 $ mean_v2: num 2.74

str(df_groupby)

tibble [2 x 2] (S3: tbl_df/tbl/data.frame)
 $ p1: num [1:2] 4.35 3.08
 $ p2: num [1:2] 4.21 2.37

The use of group_by() function results in a tibble.

Let me convert df_summarize into a tibble and then apply the format.

df_summarize <- df %>% 
  summarise(mean_v1 = mean(v1),
            mean_v2 = mean(v2)) %>% 
  round(2) %>% as.tibble() %>% format(nsmall = 2)

df_summarize

  mean_v1 mean_v2
1    3.34    2.74

Now, Let me convert df_groupby into a data frame and then apply the format

df_groupby <- df %>% group_by(cate) %>% 
  summarise(p1 = mean(v1), p2 = mean(v2)) %>% 
  select(-cate) %>% ungroup() %>% 
  round(2) %>% as.data.frame() %>% format(nsmall = 2)

    p1   p2
1 4.35 4.21
2 3.08 2.37
  • Related