Home > Software engineering >  Mean of a variable after nesting the dataframe
Mean of a variable after nesting the dataframe

Time:10-27

I am trying to find the mean of the variable disp in mtcars dataset after nesting it by cyl. I am able to get the result after nest_by but not with group_nest. Please explain what the rowwise is doing it differently here.

library(pacman)
#> Warning: package 'pacman' was built under R version 4.2.1
p_load(tidyverse)
#working
mtcars %>% nest_by(cyl) %>% mutate(avg = mean(data$disp))
#> # A tibble: 3 × 3
#> # Rowwise:  cyl
#>     cyl                data   avg
#>   <dbl> <list<tibble[,10]>> <dbl>
#> 1     4           [11 × 10]  105.
#> 2     6            [7 × 10]  183.
#> 3     8           [14 × 10]  353.

#notworking
mtcars %>% group_nest(cyl) %>% 
  mutate(avg = mean(data$disp))
#> Error in `mutate()`:
#> ! Problem while computing `avg = mean(data$disp)`.
#> Caused by error:
#> ! Corrupt x: no names

#> Backtrace:
#>      ▆
#>   1. ├─mtcars %>% group_nest(cyl) %>% mutate(avg = mean(data$disp))
#>   2. ├─dplyr::mutate(., avg = mean(data$disp))
#>   3. ├─dplyr:::mutate.data.frame(., avg = mean(data$disp))
#>   4. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), caller_env = caller_env())
#>   5. │   ├─base::withCallingHandlers(...)
#>   6. │   └─mask$eval_all_mutate(quo)
#>   7. ├─base::mean(data$disp)
#>   8. ├─data$disp
#>   9. ├─vctrs:::`$.vctrs_list_of`(data, disp)
#>  10. └─base::.handleSimpleError(`<fn>`, "Corrupt x: no names", base::quote(NULL))
#>  11.   └─dplyr (local) h(simpleError(msg, call))
#>  12.     └─rlang::abort(...)

Created on 2022-10-26 with reprex v2.0.2

CodePudding user response:

rowwise changes the behavior of subsequent verbs, namely instead of operating on an entire column they will now operate only on values in a given row.

This works because the data in mutate refers to a single dataframe (due to rowwise provided by nest_by)

library(dplyr)
library(purrr)

mtcars %>% nest_by(cyl) %>% mutate(avg = mean(data$disp))
#> # A tibble: 3 × 3
#> # Rowwise:  cyl
#>     cyl                data   avg
#>   <dbl> <list<tibble[,10]>> <dbl>
#> 1     4           [11 × 10]  105.
#> 2     6            [7 × 10]  183.
#> 3     8           [14 × 10]  353.

This will not work because data refers to a list of dataframes, and disp is not a name in that list

mtcars %>% group_nest(cyl) %>%  mutate(avg = mean(data$disp))
#> Error in `mutate()`:
#> ! Problem while computing `avg = mean(data$disp)`.
#> Caused by error:
#> ! Corrupt x: no names

#> Backtrace:
#>      ▆
#>   1. ├─mtcars %>% group_nest(cyl) %>% mutate(avg = mean(data$disp))
#>   2. ├─dplyr::mutate(., avg = mean(data$disp))
#>   3. ├─dplyr:::mutate.data.frame(., avg = mean(data$disp))
#>   4. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), caller_env = caller_env())
#>   5. │   ├─base::withCallingHandlers(...)
#>   6. │   └─mask$eval_all_mutate(quo)
#>   7. ├─base::mean(data$disp)
#>   8. ├─data$disp
#>   9. ├─vctrs:::`$.vctrs_list_of`(data, disp)
#>  10. └─base::.handleSimpleError(`<fn>`, "Corrupt x: no names", base::quote(NULL))
#>  11.   └─dplyr (local) h(simpleError(msg, call))
#>  12.     └─rlang::abort(...)

You may obtain an equivalent calculation by e.g. mapping over the list of dataframes, to apply a function to each dataframe in the list

mtcars %>% group_nest(cyl) %>% mutate(avg = map_dbl(data, ~ mean(.x$disp)))

#> # A tibble: 3 × 3
#>     cyl                data   avg
#>   <dbl> <list<tibble[,10]>> <dbl>
#> 1     4           [11 × 10]  105.
#> 2     6            [7 × 10]  183.
#> 3     8           [14 × 10]  353.

Created on 2022-10-26 with reprex v2.0.2

CodePudding user response:

We could use map to loop over the list as there is no rowwise grouping with group_nest

library(dplyr)
library(purrr)
mtcars %>%
    group_nest(cyl) %>% 
     mutate(avg = map_dbl(data, ~ mean(.x$disp)))

-output

# A tibble: 3 × 3
    cyl                data   avg
  <dbl> <list<tibble[,10]>> <dbl>
1     4           [11 × 10]  105.
2     6            [7 × 10]  183.
3     8           [14 × 10]  353.

According to ?group_nest

The primary use case for group_nest() is with already grouped data frames, typically a result of group_by().

where as with ?nest_by

nest_by() is closely related to group_by(). However, instead of storing the group structure in the metadata, it is made explicit in the data, giving each group key a single row along with a list-column of data frames that contain all the other data.

CodePudding user response:

> library(pacman)
> p_load(tidyverse)
> # working
> mtcars %>% nest_by(cyl) %>% class()
[1] "rowwise_df" "tbl_df"     "tbl"        "data.frame"
> mtcars %>% nest_by(cyl) %>% mutate(avg = mean(data$disp))
# A tibble: 3 × 3
# Rowwise:  cyl
    cyl                data   avg
  <dbl> <list<tibble[,10]>> <dbl>
1     4           [11 × 10]  105.
2     6            [7 × 10]  183.
3     8           [14 × 10]  353.
> # not working
> mtcars %>% group_nest(cyl) %>% class()
[1] "tbl_df"     "tbl"        "data.frame"
> mtcars %>% group_nest(cyl) %>% mutate(avg = mean(data$disp))
Error in `mutate()`:
! Problem while computing `avg = mean(data$disp)`.
Caused by error:
! Corrupt x: no names
Run `rlang::last_error()` to see where the error occurred.

The nest_by call yields a rowwise_df which is amenable to the next step in the pipe, whereas group_nest yields a plain old tbl_df, hence the difference

  •  Tags:  
  • r
  • Related