Home > Software design >  How would one merge 2 data frames based on a category and numeric in df1 - matching the category and
How would one merge 2 data frames based on a category and numeric in df1 - matching the category and

Time:12-05

I want to merge df1 and df2, by pulling in only the numerics from df2 into df1? This would be a nested (Lookup(xlookup) in excel, but Im having difficulty working this in r? Any info would be much appreciated.

df1 <- data.frame(Names = c("A","B","C","D","E"),
                  Rank = c("R1","R3","R4","R2","R5"),
                  Time_in_rank = c(2,4.25,3,1.5,5))

df2 <- data.frame(Time_in_rank =c(0,1,2,3,4),
                  R1 =c(20000,25000,30000,35000,40000),
                  R2 =c(45000,50000,55000,60000,65000),
                  R3 =c(70000,75000,80000,85000,90000),
                  R4 =c(95000,96000,97000,98000,100000),
                  R5 =c(105000,107000,109000,111000,112000))

Desired output
Names Time_in_rank  Rank   Salary 
A     2             R1     30000
B     4.25          R3     90000
C     3             R4     98000

Close but no cigar - Merge two data frames considering a range match between key columns

Close but still no cigar - Complex non-equi merge in R

CodePudding user response:

data.table

library(data.table)
setDT(df1)
setDT(df2)
df1[setnames(melt(df2, id.vars = "Time_in_rank"), 1, "tir"
       )[, tir2 := shift(tir, type = "lead", fill = Inf), by = variable],
    Salary := value,
    on = .(Rank == variable, Time_in_rank >= tir, Time_in_rank < tir2)]
df1
#     Names   Rank Time_in_rank Salary
#    <char> <char>        <num>  <num>
# 1:      A     R1         2.00  30000
# 2:      B     R3         4.25  90000
# 3:      C     R4         3.00  98000
# 4:      D     R2         1.50  50000
# 5:      E     R5         5.00 112000

CodePudding user response:

You could use a tidyverse approach:

library(dplyr)
library(tidyr)

df1 %>% 
  mutate(time_floor = floor(Time_in_rank)) %>% 
  left_join(df2 %>% 
              pivot_longer(-Time_in_rank, names_to = "Rank", values_to = "Salary"),
            by = c("Rank", "time_floor" = "Time_in_rank")) %>% 
  select(-time_floor)

This returns

  Names Rank Time_in_rank Salary
1     A   R1         2.00  30000
2     B   R3         4.25  90000
3     C   R4         3.00  98000
4     D   R2         1.50  50000
5     E   R5         5.00     NA

The main ideas are:

  • df1's Time_in_rank rounded down corresponds to Time_in_rank used in df2.
  • df2 is converted into a long format to join by time and rank.
  • Related