Home > Back-end >  Check and Match each row by two variables in R
Check and Match each row by two variables in R

Time:06-15

The data looks like this

   A   B   C   D
1  a   a   1   3
2  b   c   1   2
3  d   d   0   1
4  e   f   2   3

But I would like to match Column A and B (both characters) and shift down the entire row if unmatched, replacing the empty cells with NA. The final data looks like this

   A   B   C   D
1  a   a   1   3
2  b   b  NA  NA
3  c   c   1   2
4  d   d   0   1
5  e   e  NA  NA
6  f   f   2   3

Thanks for your help!

CodePudding user response:

A few data.table solutions:

library(data.table)

dt <- data.table(A = c("a","b","d","e"), B = c("a","c","d","f"), C = c(1,1,0,2), D = c(3,2,1,3))
f1 <- function(dt) dt[, if(A == B) .SD else .(c(A, B), c(NA, C), c(NA, D)), A][, A := B]
f2 <- function(dt) dt[, if (A == B) .SD else rbindlist(.(.(A), .SD), fill = TRUE), A][, A := B]
f3 <- function(dt) setorder(rbindlist(list(copy(dt)[B != A, `:=`(B = A, C = NA, D = NA)], dt[A != B][, A := B])), A)
f4 <- function(dt) data.table(A = unique(c(t(dt[,A:B]))))[, B := A][dt, `:=`(C = C, D = D), on = .(A = B)]
f1(dt)[]
#>    A B  C  D
#> 1: a a  1  3
#> 2: b b NA NA
#> 3: c c  1  2
#> 4: d d  0  1
#> 5: e e NA NA
#> 6: f f  2  3
microbenchmark::microbenchmark(f1 = f1(dt),
                               f2 = f2(dt),
                               f3 = f3(dt),
                               f4 = f4(dt),
                               check = "equal")
#> Unit: microseconds
#>  expr     min       lq     mean    median        uq      max neval
#>    f1 441.401  471.501  502.336  496.7015  520.2015  616.902   100
#>    f2 521.801  566.901  651.711  593.9005  640.1510 3147.901   100
#>    f3 592.901  635.751  736.932  663.2010  688.4510 3327.202   100
#>    f4 955.100 1012.501 1112.545 1053.3005 1116.5510 3996.102   100

The first option is the fastest, but the second option will work regardless the number of columns after A and B.

CodePudding user response:

A tidyverse solution:

library(tidyverse)

df %>%
  complete(A = c(A, B)) %>%
  mutate(B = A, .keep = "used") %>%
  left_join(
    mutate(df, key = pmax(A, B), .keep = "unused"),
    by = c("A" = "key")
  )

# # A tibble: 6 × 4
#   A     B         C     D
#   <chr> <chr> <int> <int>
# 1 a     a         1     3
# 2 b     b        NA    NA
# 3 c     c         1     2
# 4 d     d         0     1
# 5 e     e        NA    NA
# 6 f     f         2     3

If you want to shift "up" the entire row if unmatched, just replace pmax() with pmin().


Data
df <- structure(
  list(A = c("a", "b", "d", "e"), B = c("a", "c", "d", "f"),
       C = c(1L, 1L, 0L, 2L), D = c(3L, 2L, 1L, 3L)),
  class = "data.frame", row.names = c(NA, -4L)
)
  • Related