If two columns in separate dataframes match then pull numbers from another column r-CodePudding

I have two dataframes were I want to match age and height to the percentile they fall within (according to WHO guidelines). So if the ages in df_per and df_height match, find the percentile column in df_per that the height falls in from df_height. The percentiles (columns P3, P15, P50, P85, P97) contain heights (cm). I think I may need to create a min and max column for each percentile so there is a set range for the heights to fall between. So if the age is 0 days and the height 49 cm, the percentile would be P15 as it is > 47.217 and <49.148.

Age <- c(0,1,2,3,4,5,6)
P3 <- c(45.644,45.808,45.971,46.134,46.297,46.461,46.624)
P15 <- c(47.217,47.383,47.549,47.714,47.88,48.046,48.212)
P50 <- c(49.148,49.317,49.485,49.654,49.823,49.992,50.161)
P85 <- c(51.078,51.25,51.422,51.594,51.766,51.938,52.11)
P97 <- c(52.651,52.825,53,53.175,53.349,53.524,53.698)
    
df_per <- data.frame(Age, P3, P15, P50, P85, P97)
df_per
  Age     P3    P15    P50    P85    P97
1   0 45.644 47.217 49.148 51.078 52.651
2   1 45.808 47.383 49.317 51.250 52.825
3   2 45.971 47.549 49.485 51.422 53.000
4   3 46.134 47.714 49.654 51.594 53.175
5   4 46.297 47.880 49.823 51.766 53.349
6   5 46.461 48.046 49.992 51.938 53.524
7   6 46.624 48.212 50.161 52.110 53.698

Age <- c(0,1,2,2,4,6,6)
Height <- c(49,50.4,48.8,51.5,52.0,46.8,49)

df_height <- data.frame(Age,Height)
df_height
  Age Height
1   0   49.0
2   1   50.4
3   2   48.8
4   2   51.5
5   4   52.0
6   6   46.8
7   6   49.0

Output that I'd like to get

  Age Height Percentile
1   0   49.0        P15
2   1   50.4        P50
3   2   48.8        P15
4   2   51.5        P85
5   4   52.0        P85
6   6   46.8         P3
7   6   49.0        P15

Thank you in advance for any help!

CodePudding user response：

After pivoting to long, you can use fuzzyjoin to do an overlap join:

# First, some data manipulation (pivot df_per to long)
library(dplyr)
df_per <- df_per %>%
  pivot_longer(-Age, names_to = "percentile", values_to = "start") %>%
  group_by(Age) %>% 
  mutate(end = lead(start, default = Inf))

# # A tibble: 35 × 4
# # Groups:   Age [7]
#     Age percentile start   end
#   <dbl> <chr>      <dbl> <dbl>
# 1     0 P3          45.6  47.2
# 2     0 P15         47.2  49.1
# 3     0 P50         49.1  51.1
# 4     0 P85         51.1  52.7
# 5     0 P97         52.7 Inf  
# 6     1 P3          45.8  47.4
# 7     1 P15         47.4  49.3
# 8     1 P50         49.3  51.2
# 9     1 P85         51.2  52.8
# # … with 26 more rows

#Then, the actual join
library(fuzzyjoin)
df_per %>% 
  fuzzy_inner_join(df_height, .,
                by = c('Age', 'Height' = 'start', 'Height' = 'end'),
                match_fun = list(`==`, `>=`, `<=`)) %>% 
  select(Age = Age.x, Height, percentile)

#   Age Height percentile
# 1   0   49.0        P15
# 2   1   50.4        P50
# 3   2   48.8        P15
# 4   2   51.5        P85
# 5   4   52.0        P85
# 6   6   46.8         P3
# 7   6   49.0        P15

Note that overlap join are already supported in dplyr's development version (as of 2022-09-09) and soon be available on CRAN. You could do:

#devtools::install_github("tidyverse/dplyr")
library(dplyr)

inner_join(df_height, df_per, by = join_by(Age, between(Height, start, end)))

#   Age Height percentile
# 1   0   49.0        P15
# 2   1   50.4        P50
# 3   2   48.8        P15
# 4   2   51.5        P85
# 5   4   52.0        P85
# 6   6   46.8         P3
# 7   6   49.0        P15