Home > Software engineering >  slice() on a long skinny list with multiple entries
slice() on a long skinny list with multiple entries

Time:02-10

I have a data frame in R which is comprised like this:

year region age population_count cumulative_count* middle_value*
2001 Region x 0 10 10 50
2001 Region x 1 10 20 50
2001 Region x 2 10 30 50
2001 Region x 3 10 40 50
2001 Region x 4 10 50 50
2001 Region x 5 10 60 50
...2020 Region y 1 10 20 50

For each year and region combination I have a discrete cumulative_count (derived from population_count by age) and middle_value (derived from the cumulative_count), again discrete for each year and region combination.

I want to extract from this the row for each region and year combination where the cumulative_count is closest to the middle_value in each instance. (in the example above this would be age 4 in region x where culmulative_count = 50 and middle_value=50).

I have tried slice from dplyr:

slice(which.min(abs(table$cumulative_count - table$middle_value)))

but this only returns the first instance of the row where there is a match, not all the subsequent year and region combinations.

group_by(year,region) doesn't return all the possible year and region combinations either.

I feel I should be looping through the data frame for all possible year and region combinations and then slicing out the rows that meet the criteria.

Any thoughts?

CodePudding user response:

You could de-mean the groups and look where the value is zero. You probably will have ties, depends on what you want, you could simply use the first one by subsetting with [1, ].

by(dat, dat[c('year', 'region')], \(x) 
   x[x$cumulative_count - mean(x$cumulative_count) == 0, ][1, ]) |>
  do.call(what=rbind)
#    year   region age population_count cumulative_count middle_value
# 2  2001 Region x   1               10               20           50
# 5  2002 Region x   1               10               20           50
# 8  2001 Region y   1               10               20           50
# 10 2002 Region y   0               10               30           50

Note: R >= 4.1 used.


Data:

dat <- structure(list(year = c(2001L, 2001L, 2001L, 2002L, 2002L, 2002L, 
2001L, 2001L, 2001L, 2002L, 2002L, 2002L), region = c("Region x", 
"Region x", "Region x", "Region x", "Region x", "Region x", "Region y", 
"Region y", "Region y", "Region y", "Region y", "Region y"), 
    age = c(0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L), 
    population_count = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
    10L, 10L, 10L, 10L), cumulative_count = c(10L, 20L, 30L, 
    10L, 20L, 30L, 10L, 20L, 30L, 30L, 30L, 30L), middle_value = c(50L, 
    50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L)), class = "data.frame", row.names = c(NA, 
-12L))

CodePudding user response:

I suggest to use rank as it ranks from low to high. So if you rank on the absolute difference your grouped ranks are per definition 1 for the smallest difference. You can simply filter on that value. It also allows to set the ties with tie.methods.

include ties

dat %>%
  group_by(year, region) %>%
  filter(rank(abs(cumulative_count - middle_value), ties.method = "min") == 1)

# # A tibble: 6 x 6
# # Groups:   year, region [4]
#    year region     age population_count cumulative_count middle_value
#   <int> <chr>    <int>            <int>            <int>        <int>
# 1  2001 Region x     2               10               30           50
# 2  2002 Region x     2               10               30           50
# 3  2001 Region y     2               10               30           50
# 4  2002 Region y     0               10               30           50
# 5  2002 Region y     1               10               30           50
# 6  2002 Region y     2               10               30           50

show first one only

dat %>%
  group_by(year, region) %>%
  filter(rank(abs(cumulative_count - middle_value), ties.method = "first") == 1)

# # A tibble: 4 x 6
# # Groups:   year, region [4]
#    year region     age population_count cumulative_count middle_value
#   <int> <chr>    <int>            <int>            <int>        <int>
# 1  2001 Region x     2               10               30           50
# 2  2002 Region x     2               10               30           50
# 3  2001 Region y     2               10               30           50
# 4  2002 Region y     0               10               30           50

other options include: rank(x, na.last = TRUE, ties.method = c("average", "first", "last", "random", "max", "min"))

using data.table instead of dplyr

library(data.table)
setDT(dat) # make dat a data.table

dat[, .SD[rank(abs(cumulative_count - middle_value), ties.method = "min") == 1], by = c("year", "region")]

data

dat <- structure(list(year = c(2001L, 2001L, 2001L, 2002L, 2002L, 2002L, 
2001L, 2001L, 2001L, 2002L, 2002L, 2002L), region = c("Region x", 
"Region x", "Region x", "Region x", "Region x", "Region x", "Region y", 
"Region y", "Region y", "Region y", "Region y", "Region y"), 
    age = c(0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L), 
    population_count = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
    10L, 10L, 10L, 10L), cumulative_count = c(10L, 20L, 30L, 
    10L, 20L, 30L, 10L, 20L, 30L, 30L, 30L, 30L), middle_value = c(50L, 
    50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L)), class = "data.frame", row.names = c(NA, 
-12L))
  • Related