slice() on a long skinny list with multiple entries-CodePudding

I have a data frame in R which is comprised like this:

year	region	age	population_count	cumulative_count*	middle_value*
2001	Region x	0	10	10	50
2001	Region x	1	10	20	50
2001	Region x	2	10	30	50
2001	Region x	3	10	40	50
2001	Region x	4	10	50	50
2001	Region x	5	10	60	50
...2020	Region y	1	10	20	50

For each year and region combination I have a discrete cumulative_count (derived from population_count by age) and middle_value (derived from the cumulative_count), again discrete for each year and region combination.

I want to extract from this the row for each region and year combination where the cumulative_count is closest to the middle_value in each instance. (in the example above this would be age 4 in region x where culmulative_count = 50 and middle_value=50).

I have tried slice from dplyr:

slice(which.min(abs(table$cumulative_count - table$middle_value)))

but this only returns the first instance of the row where there is a match, not all the subsequent year and region combinations.

group_by(year,region) doesn't return all the possible year and region combinations either.

I feel I should be looping through the data frame for all possible year and region combinations and then slicing out the rows that meet the criteria.

Any thoughts?

CodePudding user response：

You could de-mean the groups and look where the value is zero. You probably will have ties, depends on what you want, you could simply use the first one by subsetting with [1, ].

by(dat, dat[c('year', 'region')], \(x) 
   x[x$cumulative_count - mean(x$cumulative_count) == 0, ][1, ]) |>
  do.call(what=rbind)
#    year   region age population_count cumulative_count middle_value
# 2  2001 Region x   1               10               20           50
# 5  2002 Region x   1               10               20           50
# 8  2001 Region y   1               10               20           50
# 10 2002 Region y   0               10               30           50

Note: R >= 4.1 used.

Data:

dat <- structure(list(year = c(2001L, 2001L, 2001L, 2002L, 2002L, 2002L, 
2001L, 2001L, 2001L, 2002L, 2002L, 2002L), region = c("Region x", 
"Region x", "Region x", "Region x", "Region x", "Region x", "Region y", 
"Region y", "Region y", "Region y", "Region y", "Region y"), 
    age = c(0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L), 
    population_count = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
    10L, 10L, 10L, 10L), cumulative_count = c(10L, 20L, 30L, 
    10L, 20L, 30L, 10L, 20L, 30L, 30L, 30L, 30L), middle_value = c(50L, 
    50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L)), class = "data.frame", row.names = c(NA, 
-12L))

CodePudding user response：

I suggest to use rank as it ranks from low to high. So if you rank on the absolute difference your grouped ranks are per definition 1 for the smallest difference. You can simply filter on that value. It also allows to set the ties with tie.methods.

include ties

dat %>%
  group_by(year, region) %>%
  filter(rank(abs(cumulative_count - middle_value), ties.method = "min") == 1)

# # A tibble: 6 x 6
# # Groups:   year, region [4]
#    year region     age population_count cumulative_count middle_value
#   <int> <chr>    <int>            <int>            <int>        <int>
# 1  2001 Region x     2               10               30           50
# 2  2002 Region x     2               10               30           50
# 3  2001 Region y     2               10               30           50
# 4  2002 Region y     0               10               30           50
# 5  2002 Region y     1               10               30           50
# 6  2002 Region y     2               10               30           50

show first one only

dat %>%
  group_by(year, region) %>%
  filter(rank(abs(cumulative_count - middle_value), ties.method = "first") == 1)

# # A tibble: 4 x 6
# # Groups:   year, region [4]
#    year region     age population_count cumulative_count middle_value
#   <int> <chr>    <int>            <int>            <int>        <int>
# 1  2001 Region x     2               10               30           50
# 2  2002 Region x     2               10               30           50
# 3  2001 Region y     2               10               30           50
# 4  2002 Region y     0               10               30           50

other options include: rank(x, na.last = TRUE, ties.method = c("average", "first", "last", "random", "max", "min"))

using data.table instead of dplyr

library(data.table)
setDT(dat) # make dat a data.table

dat[, .SD[rank(abs(cumulative_count - middle_value), ties.method = "min") == 1], by = c("year", "region")]

data

dat <- structure(list(year = c(2001L, 2001L, 2001L, 2002L, 2002L, 2002L, 
2001L, 2001L, 2001L, 2002L, 2002L, 2002L), region = c("Region x", 
"Region x", "Region x", "Region x", "Region x", "Region x", "Region y", 
"Region y", "Region y", "Region y", "Region y", "Region y"), 
    age = c(0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L), 
    population_count = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
    10L, 10L, 10L, 10L), cumulative_count = c(10L, 20L, 30L, 
    10L, 20L, 30L, 10L, 20L, 30L, 30L, 30L, 30L), middle_value = c(50L, 
    50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L)), class = "data.frame", row.names = c(NA, 
-12L))