I have a data frame in R which is comprised like this:
year | region | age | population_count | cumulative_count* | middle_value* |
---|---|---|---|---|---|
2001 | Region x | 0 | 10 | 10 | 50 |
2001 | Region x | 1 | 10 | 20 | 50 |
2001 | Region x | 2 | 10 | 30 | 50 |
2001 | Region x | 3 | 10 | 40 | 50 |
2001 | Region x | 4 | 10 | 50 | 50 |
2001 | Region x | 5 | 10 | 60 | 50 |
...2020 | Region y | 1 | 10 | 20 | 50 |
For each year and region combination I have a discrete cumulative_count (derived from population_count by age) and middle_value (derived from the cumulative_count), again discrete for each year and region combination.
I want to extract from this the row for each region and year combination where the cumulative_count is closest to the middle_value in each instance. (in the example above this would be age 4 in region x where culmulative_count = 50 and middle_value=50).
I have tried slice
from dplyr
:
slice(which.min(abs(table$cumulative_count - table$middle_value)))
but this only returns the first instance of the row where there is a match, not all the subsequent year and region combinations.
group_by(year,region)
doesn't return all the possible year and region combinations either.
I feel I should be looping through the data frame for all possible year and region combinations and then slicing out the rows that meet the criteria.
Any thoughts?
CodePudding user response:
You could de-mean the groups and look where the value is zero. You probably will have ties, depends on what you want, you could simply use the first one by subsetting with [1, ]
.
by(dat, dat[c('year', 'region')], \(x)
x[x$cumulative_count - mean(x$cumulative_count) == 0, ][1, ]) |>
do.call(what=rbind)
# year region age population_count cumulative_count middle_value
# 2 2001 Region x 1 10 20 50
# 5 2002 Region x 1 10 20 50
# 8 2001 Region y 1 10 20 50
# 10 2002 Region y 0 10 30 50
Note: R >= 4.1 used.
Data:
dat <- structure(list(year = c(2001L, 2001L, 2001L, 2002L, 2002L, 2002L,
2001L, 2001L, 2001L, 2002L, 2002L, 2002L), region = c("Region x",
"Region x", "Region x", "Region x", "Region x", "Region x", "Region y",
"Region y", "Region y", "Region y", "Region y", "Region y"),
age = c(0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L),
population_count = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 10L), cumulative_count = c(10L, 20L, 30L,
10L, 20L, 30L, 10L, 20L, 30L, 30L, 30L, 30L), middle_value = c(50L,
50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L)), class = "data.frame", row.names = c(NA,
-12L))
CodePudding user response:
I suggest to use rank
as it ranks from low to high. So if you rank on the absolute difference your grouped ranks are per definition 1
for the smallest difference. You can simply filter on that value. It also allows to set the ties with tie.methods
.
include ties
dat %>%
group_by(year, region) %>%
filter(rank(abs(cumulative_count - middle_value), ties.method = "min") == 1)
# # A tibble: 6 x 6
# # Groups: year, region [4]
# year region age population_count cumulative_count middle_value
# <int> <chr> <int> <int> <int> <int>
# 1 2001 Region x 2 10 30 50
# 2 2002 Region x 2 10 30 50
# 3 2001 Region y 2 10 30 50
# 4 2002 Region y 0 10 30 50
# 5 2002 Region y 1 10 30 50
# 6 2002 Region y 2 10 30 50
show first one only
dat %>%
group_by(year, region) %>%
filter(rank(abs(cumulative_count - middle_value), ties.method = "first") == 1)
# # A tibble: 4 x 6
# # Groups: year, region [4]
# year region age population_count cumulative_count middle_value
# <int> <chr> <int> <int> <int> <int>
# 1 2001 Region x 2 10 30 50
# 2 2002 Region x 2 10 30 50
# 3 2001 Region y 2 10 30 50
# 4 2002 Region y 0 10 30 50
other options include: rank(x, na.last = TRUE, ties.method = c("average", "first", "last", "random", "max", "min"))
using data.table instead of dplyr
library(data.table)
setDT(dat) # make dat a data.table
dat[, .SD[rank(abs(cumulative_count - middle_value), ties.method = "min") == 1], by = c("year", "region")]
data
dat <- structure(list(year = c(2001L, 2001L, 2001L, 2002L, 2002L, 2002L,
2001L, 2001L, 2001L, 2002L, 2002L, 2002L), region = c("Region x",
"Region x", "Region x", "Region x", "Region x", "Region x", "Region y",
"Region y", "Region y", "Region y", "Region y", "Region y"),
age = c(0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L),
population_count = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 10L), cumulative_count = c(10L, 20L, 30L,
10L, 20L, 30L, 10L, 20L, 30L, 30L, 30L, 30L), middle_value = c(50L,
50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L)), class = "data.frame", row.names = c(NA,
-12L))