Home > Blockchain >  How to match elements between 2 data frames subject to multiple conditions using base R or dplyr?
How to match elements between 2 data frames subject to multiple conditions using base R or dplyr?

Time:09-16

I have two data frames, myDF and index, and I efficiently match between the two as shown below where I create a new column in data frame myDF called alloc_1 which is based on row matches it finds in data frame Index (where row number in data frame index matches eleCnt in data frame myDF):

enter image description here

Preferably using my existing mutate() as a starting point where I especially like index$index[eleCnt] for identifying the index row number, how can I further restrict matching from data frame index to only those rows where cumGrp = 1? Resulting in the below pretend output:

  eleCnt concat_1 alloc_1  explanation
1      1      1.0    10.0
2      2      2.0      NA  since index row 2 cumGrp <> 1, don't match it
3      1      1.0    10.0 
4      2      2.1      NA  since index row 2 cumGrp <> 1, don't match it
5      3      2.2      NA

Below is code for myDF, index, and the dplyr mutate():

myDF <- data.frame(
  eleCnt = c(1,2,1,2,3),
  concat_1 = c(1,2,1,2.1,2.2)
  )

index <- data.frame(
  index = c(10,2.1),
  cumGrp = c(1,2)
  )

library(dplyr)
myDF %>% mutate(alloc_1 = index$index[eleCnt])

CodePudding user response:

myDF %>%
  mutate(alloc_1 = index$index[index$cumGrp == 1][eleCnt])
#   eleCnt concat_1 alloc_1
# 1      1      1.0      10
# 2      2      2.0      NA
# 3      1      1.0      10
# 4      2      2.1      NA
# 5      3      2.2      NA

This is close to a "join" operation, with just a little nudge (adding eleCnt to index). A join operation can be faster, perhaps easier to understand/debug (once you really "grok" join/merge, see How to join (merge) data frames (inner, outer, left, right), What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?), and allows for other methods/join efficiencies as well.

index$eleCnt <- seq_len(nrow(index)) # or whatever they should be
### simple join, not filtered on cumGrp
myDF %>%
  left_join(index, by = "eleCnt")
#   eleCnt concat_1 index cumGrp
# 1      1      1.0  10.0      1
# 2      2      2.0   2.1      2
# 3      1      1.0  10.0      1
# 4      2      2.1   2.1      2
# 5      3      2.2    NA     NA
### improved
index %>%
  filter(cumGrp == 1) %>%
  left_join(myDF, ., by = "eleCnt")
#   eleCnt concat_1 index cumGrp
# 1      1      1.0    10      1
# 2      2      2.0    NA     NA
# 3      1      1.0    10      1
# 4      2      2.1    NA     NA
# 5      3      2.2    NA     NA

In the join/merge case, note that all other columns are brought in as well. This can be good (so that you don't have to repeat index$index[eleCnt] for each column you want to import), and can easily be controlled by filtering or de-selecting before/after the join.

  • Related