save mutate output under dplyr-CodePudding

I'm computing the frequency by group under dplyr. But the output is not automatically saved as a dataframe and only shows the first 10 rows. Does anyone know how to do that? I need to use all rows of data for further analyses. THANKS!

library(dplyr)
data01 %>%
  group_by(Country, relsta) %>%
  summarize(Freq=n()) %>%
  mutate (married = Freq/sum(Freq))

Output

   Country relsta  Freq married
     <int> <chr>  <int>   <dbl>
 1       1 1         15  0.176 
 2       1 3          1  0.0118
 3       1 4         28  0.329 
 4       1 5          6  0.0706
 5       1 6         22  0.259 
 6       1 7          1  0.0118
 7       1 99        12  0.141 
 8       2 NA       273  1     
 9       3 NA       129  1     
10       4 2          9  0.0796
# ... with 115 more rows

CodePudding user response：

The summarize function always returns just one row per group. mutate will keep all the rows here. Try:

library(dplyr)
data02 = data01 %>%
  group_by(Country, relsta) %>%
  mutate(Freq=n()) %>%
  mutate (married = Freq/sum(Freq))

CodePudding user response：

dplyr throws tibbles, the output is just hidden from you. Here an example using iris

library(dplyr)
res1 <- iris %>%
  group_by(Sepal.Length, Species) %>%
  summarize(Freq=n()) %>%
  mutate(foo = Freq/sum(Freq))

res1
# Sepal.Length Species     Freq   foo
# <dbl> <fct>      <int> <dbl>
#   1          4.3 setosa         1 1    
# 2          4.4 setosa         3 1    
# 3          4.5 setosa         1 1    
# 4          4.6 setosa         4 1    
# 5          4.7 setosa         2 1    
# 6          4.8 setosa         5 1    
# 7          4.9 setosa         4 0.667
# 8          4.9 versicolor     1 0.167
# 9          4.9 virginica      1 0.167
# 10          5   setosa         8 0.8  
# # … with 47 more rows

Notice the … with 47 more rows. You may also check the dimensions:

dim(res1)
# [1] 57  4

Also,

class(res1)
# [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

whereas:

class(iris)
# [1] "data.frame"

To see more data, use as.data.frame(). If the data is too large, rows also get omitted. You may customize that with e.g. options(max.print=3000) where default is 1000.

as.data.frame(res1)
# Sepal.Length    Species Freq       foo
# 1           4.3     setosa    1 1.0000000
# 2           4.4     setosa    3 1.0000000
# 3           4.5     setosa    1 1.0000000
# [...]
# 55          7.6  virginica    1 1.0000000
# 56          7.7  virginica    4 1.0000000
# 57          7.9  virginica    1 1.0000000

You could also consider using base R. Since following line already gives you the "Freq" column,

as.data.frame.table(with(iris, table(Sepal.Length, Species)))

you could do this:

res2 <- with(iris, table(Sepal.Length, Species)) |>
  as.data.frame.table() |>
  transform(foo=ave(Freq, Sepal.Length, FUN=\(x) x/sum(x))) |>
  subset(Freq > 0)
res2
#     Sepal.Length    Species Freq       foo
# 1            4.3     setosa    1 1.0000000
# 2            4.4     setosa    3 1.0000000
# 3            4.5     setosa    1 1.0000000
# [...]
# 103          7.6  virginica    1 1.0000000
# 104          7.7  virginica    4 1.0000000
# 105          7.9  virginica    1 1.0000000

Where:

dim(res2)
# [1] 57  4

class(res2)
# [1] "data.frame"

Note: R >= 4.1 used