I have two dataframes were I want to match age and height to the percentile they fall within (according to WHO guidelines). So if the ages in df_per and df_height match, find the percentile column in df_per that the height falls in from df_height. The percentiles (columns P3, P15, P50, P85, P97) contain heights (cm). I think I may need to create a min and max column for each percentile so there is a set range for the heights to fall between. So if the age is 0 days and the height 49 cm, the percentile would be P15 as it is > 47.217 and <49.148.
Age <- c(0,1,2,3,4,5,6)
P3 <- c(45.644,45.808,45.971,46.134,46.297,46.461,46.624)
P15 <- c(47.217,47.383,47.549,47.714,47.88,48.046,48.212)
P50 <- c(49.148,49.317,49.485,49.654,49.823,49.992,50.161)
P85 <- c(51.078,51.25,51.422,51.594,51.766,51.938,52.11)
P97 <- c(52.651,52.825,53,53.175,53.349,53.524,53.698)
df_per <- data.frame(Age, P3, P15, P50, P85, P97)
df_per
Age P3 P15 P50 P85 P97
1 0 45.644 47.217 49.148 51.078 52.651
2 1 45.808 47.383 49.317 51.250 52.825
3 2 45.971 47.549 49.485 51.422 53.000
4 3 46.134 47.714 49.654 51.594 53.175
5 4 46.297 47.880 49.823 51.766 53.349
6 5 46.461 48.046 49.992 51.938 53.524
7 6 46.624 48.212 50.161 52.110 53.698
Age <- c(0,1,2,2,4,6,6)
Height <- c(49,50.4,48.8,51.5,52.0,46.8,49)
df_height <- data.frame(Age,Height)
df_height
Age Height
1 0 49.0
2 1 50.4
3 2 48.8
4 2 51.5
5 4 52.0
6 6 46.8
7 6 49.0
Output that I'd like to get
Age Height Percentile
1 0 49.0 P15
2 1 50.4 P50
3 2 48.8 P15
4 2 51.5 P85
5 4 52.0 P85
6 6 46.8 P3
7 6 49.0 P15
Thank you in advance for any help!
CodePudding user response:
After pivoting to long, you can use fuzzyjoin
to do an overlap join:
# First, some data manipulation (pivot df_per to long)
library(dplyr)
df_per <- df_per %>%
pivot_longer(-Age, names_to = "percentile", values_to = "start") %>%
group_by(Age) %>%
mutate(end = lead(start, default = Inf))
# # A tibble: 35 × 4
# # Groups: Age [7]
# Age percentile start end
# <dbl> <chr> <dbl> <dbl>
# 1 0 P3 45.6 47.2
# 2 0 P15 47.2 49.1
# 3 0 P50 49.1 51.1
# 4 0 P85 51.1 52.7
# 5 0 P97 52.7 Inf
# 6 1 P3 45.8 47.4
# 7 1 P15 47.4 49.3
# 8 1 P50 49.3 51.2
# 9 1 P85 51.2 52.8
# # … with 26 more rows
#Then, the actual join
library(fuzzyjoin)
df_per %>%
fuzzy_inner_join(df_height, .,
by = c('Age', 'Height' = 'start', 'Height' = 'end'),
match_fun = list(`==`, `>=`, `<=`)) %>%
select(Age = Age.x, Height, percentile)
# Age Height percentile
# 1 0 49.0 P15
# 2 1 50.4 P50
# 3 2 48.8 P15
# 4 2 51.5 P85
# 5 4 52.0 P85
# 6 6 46.8 P3
# 7 6 49.0 P15
Note that overlap join are already supported in dplyr
's development version (as of 2022-09-09) and soon be available on CRAN. You could do:
#devtools::install_github("tidyverse/dplyr")
library(dplyr)
inner_join(df_height, df_per, by = join_by(Age, between(Height, start, end)))
# Age Height percentile
# 1 0 49.0 P15
# 2 1 50.4 P50
# 3 2 48.8 P15
# 4 2 51.5 P85
# 5 4 52.0 P85
# 6 6 46.8 P3
# 7 6 49.0 P15