Home > other >  Problem with custom function in summarize() after group_by() - results identical for all groups
Problem with custom function in summarize() after group_by() - results identical for all groups

Time:11-20

I have some clustered data and am trying to find the number of points and area of each cluster. I have written a function that calculates the area for the convex hull of all points in that cluster. However, it is not working properly when I try to pass it through dataframe %>% group_by() %>% summarize(). Instead of calculating the area for each cluster, it calculates the area as if all the points in the dataset belong to a single cluster and returns this as the area of every cluster.

Example dataset

x <- c(0, 0, 5, 5, 8, 8, 10, 10 )
y <- c(0, 5, 5, 0, 0, 5, 5, 0)
cluster <- c(1, 1, 1, 1, 2, 2, 2, 2)
dat.clustered <- data.frame(x, y, cluster)

## 4 points in each cluster

Cluster 1 is a 5x5 square and has an area of 25, and cluster 2 is a 2x5 rectangle with an area of 10. If all the points in the dataset are assumed to be a single cluster, they form a 5x10 rectangle with an area of 50 (this will be important later).

Optionally, add a few points inside each cluster. This is unnecessary, but makes it more realistic for a clustering scenario and demonstrates that n() is working properly inside summarize() later on.

dat.clustered[nrow(dat.clustered)   1,] <- c(2, 3, 1) # inside cluster 1
dat.clustered[nrow(dat.clustered)   1,] <- c(4, 3, 1) # inside cluster 1
dat.clustered[nrow(dat.clustered)   1,] <- c(9, 1, 2) # inside cluster 2

## there are now a total of 6 points in cluster 1 and 5 points in cluster 2

I have written a function to calculate the area of the convex hull of the points in the cluster. The function Polygon() is from the package sp. The function is based on code I found here.

area.calc <- function(data){
  area = chull(data) %>%
    c(., .[1]) %>%
    data[.,] %>%
    .[, 1:2] %>%
    sp::Polygon(., hole = F) %>%
    .@area
  return(area)
}

Demonstrating that the function correctly calculates the areas for each cluster.

area.calc(filter(dat.clustered, cluster == 1)) # 25
area.calc(filter(dat.clustered, cluster == 2)) # 10
area.calc(dat.clustered) # 50 # entire dataset as a single cluster

It works perfectly when handling the data from each cluster individually (or when treating the whole dataset as a single cluster), but it gives incorrect results when I use it inside group_by() %>% summarize().

clust.smy <- dat.clustered %>%
  group_by(cluster) %>%
  summarize(count = n(),
            area = area.calc(data = .))

This gives the following results. Note that it shows both clusters having the same area, and since the reported area is 50, my function is clearly returning the area of treating the whole dataset as a single large cluster. The count of points in each cluster from n() is correct.

# A tibble: 2 × 3
  cluster count  area
    <dbl> <int> <dbl>
1       1     6    50
2       2     5    50

The results should be

cluster count area
1 6 25
2 5 10

I found a couple similar questions here and here, but the problems in these questions were caused by referring to df$variable in summarize(), which isn't what I'm doing. That being said, the fact that my function is returning the area for all points treated as a single cluster makes me think something similar may be going on. This similar question was solved by using cur_data() inside summarize(), but when I try doing that instead of using the . placeholder, I get the following error.

clust.smy <- dat.clustered %>%
  group_by(cluster) %>%
  summarize(count = n(),
            area = area.calc(data = cur_data()))

Error

Error in `summarize()`:
! Problem while computing `area = area.calc(data = cur_data())`.
ℹ The error occurred in group 1: cluster = 1.
Caused by error:
! error in evaluating the argument 'obj' in selecting a method for function 'coordinates': Can't subset elements past the end.
ℹ Locations 4, 2, 3, and 4 don't exist.
ℹ There is only 1 element.
Run `rlang::last_error()` to see where the error occurred.

Alternatively, I can try this just using a bunch of pipes in summarize() and not use my area.calc() function at all.

clust.smy2 <- dat.clustered %>%
  group_by(cluster) %>%
  summarize(count = n(),
            area = (.) %>%
              chull() %>%
              c(., .[1]) %>%
              dat.clustered[.,] %>%
              .[, 1:2] %>%
              sp::Polygon(., hole = F) %>%
              .@area)

This produces the same incorrect results I had above.

# A tibble: 2 × 3
  cluster count  area
    <dbl> <int> <dbl>
1       1     6    50
2       2     5    50

Interestingly, if I pass cur_data() (instead of (.)) into this, the calculated areas are both the same, but this time they're both the correct area of cluster 1.

clust.smy2 <- dat.clustered %>%
  group_by(cluster) %>%
  summarize(count = n(),
            area = cur_data() %>%
              chull() %>%
              c(., .[1]) %>%
              dat.clustered[.,] %>%
              .[, 1:2] %>%
              sp::Polygon(., hole = F) %>%
              .@area)
> clust.smy2
# A tibble: 2 × 3
  cluster count  area
    <dbl> <int> <dbl>
1       1     6    25
2       2     5    25

There's clearly some funny business going on with how the grouping is getting passed through to summarize(). I have a feeling that I'm somehow misusing the . placeholder, but I'm relatively new to dplyr and writing functions and haven't been able figure out how or find and adapt a solution. Any help is appreciated!

Session Info:

> sessionInfo()
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.0.9 plyr_1.8.7 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3     rstudioapi_0.13  magrittr_2.0.3   tidyselect_1.1.2 munsell_0.5.0    lattice_0.20-45  colorspace_2.0-3 R6_2.5.1        
 [9] rlang_1.0.6      factoextra_1.0.7 fansi_1.0.3      tools_4.2.0      grid_4.2.0       gtable_0.3.0     utf8_1.2.2       cli_3.4.1       
[17] DBI_1.1.3        ellipsis_0.3.2   assertthat_0.2.1 tibble_3.1.7     lifecycle_1.0.3  crayon_1.5.1     purrr_0.3.4      ggplot2_3.4.0   
[25] vctrs_0.5.1      ggrepel_0.9.2    glue_1.6.2       sp_1.5-1         compiler_4.2.0   pillar_1.7.0     generics_0.1.2   scales_1.2.0    
[33] pkgconfig_2.0.3

CodePudding user response:

One approach would be to use tidyr::nest to sub-split your dataframe rather than group. Your function takes a whole dataframe each time, so splitting would keep right data going to the right places:

library(tidyverse)

dat.clustered %>%
  nest(data = -cluster) %>%
  summarise(
    cluster = cluster,
    n = map_int(data, nrow),
    area = map_dbl(data, area.calc)
  ) 
#> # A tibble: 2 × 3
#>   cluster     n  area
#>     <dbl> <int> <dbl>
#> 1       1     6    25
#> 2       2     5    10

Other alternative is altering your area.calc function to accept x and y vectors separately:

area.calc <- function(x, y){
  data  <-  data.frame(x, y)
  area  <-  chull(x, y) %>%
    c(., .[1]) %>%
    data[.,] %>%
    .[, 1:2] %>%
    sp::Polygon(., hole = F) %>%
    .@area
  return(area)
}

dat.clustered %>%
  group_by(cluster) %>%
  summarise(n = n(),
         area = area.calc(x, y))
#> # A tibble: 2 × 3
#>   cluster     n  area
#>     <dbl> <int> <dbl>
#> 1       1     6    25
#> 2       2     5    10
  • Related