The data looks like this
A B C D
1 a a 1 3
2 b c 1 2
3 d d 0 1
4 e f 2 3
But I would like to match Column A and B (both characters) and shift down the entire row if unmatched, replacing the empty cells with NA. The final data looks like this
A B C D
1 a a 1 3
2 b b NA NA
3 c c 1 2
4 d d 0 1
5 e e NA NA
6 f f 2 3
Thanks for your help!
CodePudding user response:
A few data.table
solutions:
library(data.table)
dt <- data.table(A = c("a","b","d","e"), B = c("a","c","d","f"), C = c(1,1,0,2), D = c(3,2,1,3))
f1 <- function(dt) dt[, if(A == B) .SD else .(c(A, B), c(NA, C), c(NA, D)), A][, A := B]
f2 <- function(dt) dt[, if (A == B) .SD else rbindlist(.(.(A), .SD), fill = TRUE), A][, A := B]
f3 <- function(dt) setorder(rbindlist(list(copy(dt)[B != A, `:=`(B = A, C = NA, D = NA)], dt[A != B][, A := B])), A)
f4 <- function(dt) data.table(A = unique(c(t(dt[,A:B]))))[, B := A][dt, `:=`(C = C, D = D), on = .(A = B)]
f1(dt)[]
#> A B C D
#> 1: a a 1 3
#> 2: b b NA NA
#> 3: c c 1 2
#> 4: d d 0 1
#> 5: e e NA NA
#> 6: f f 2 3
microbenchmark::microbenchmark(f1 = f1(dt),
f2 = f2(dt),
f3 = f3(dt),
f4 = f4(dt),
check = "equal")
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> f1 441.401 471.501 502.336 496.7015 520.2015 616.902 100
#> f2 521.801 566.901 651.711 593.9005 640.1510 3147.901 100
#> f3 592.901 635.751 736.932 663.2010 688.4510 3327.202 100
#> f4 955.100 1012.501 1112.545 1053.3005 1116.5510 3996.102 100
The first option is the fastest, but the second option will work regardless the number of columns after A
and B
.
CodePudding user response:
A tidyverse
solution:
library(tidyverse)
df %>%
complete(A = c(A, B)) %>%
mutate(B = A, .keep = "used") %>%
left_join(
mutate(df, key = pmax(A, B), .keep = "unused"),
by = c("A" = "key")
)
# # A tibble: 6 × 4
# A B C D
# <chr> <chr> <int> <int>
# 1 a a 1 3
# 2 b b NA NA
# 3 c c 1 2
# 4 d d 0 1
# 5 e e NA NA
# 6 f f 2 3
If you want to shift "up" the entire row if unmatched, just replace pmax()
with pmin()
.
Data
df <- structure(
list(A = c("a", "b", "d", "e"), B = c("a", "c", "d", "f"),
C = c(1L, 1L, 0L, 2L), D = c(3L, 2L, 1L, 3L)),
class = "data.frame", row.names = c(NA, -4L)
)