I have a tibble with NA and "1" values, and I need to add in a "1" for all values in a row that are between two "1" values less than 4 columns apart (i.e. 3 or fewer columns apart). For example, take this example tibble:
# Example Tibble
ex_input <- tibble( "A" = c(1, NA, NA, NA),
"B" = c(NA, NA, 1, 1),
"C" = c(1, 1, NA, NA),
"D" = c(1, NA, NA, NA),
"E" = c(1, NA, NA, NA),
"F" = c(1, NA, NA, NA),
"G" = c(1, 1, NA, NA),
"H" = c(1, NA, NA, 1),
"I" = c(1, NA, NA, NA),
"J" = c(1, NA, 1, 1))
Which looks like:
> print(ex_input)
# A tibble: 4 x 10
A B C D E F G H I J
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA 1 1 1 1 1 1 1 1
2 NA NA 1 NA NA NA 1 NA NA NA
3 NA 1 NA NA NA NA NA NA NA 1
4 NA 1 NA NA NA NA NA 1 NA 1
What I need at the end is for an output in which a "1" is added - in this example - to B1, D2, C2, E2, and I2 because those are all in a row between two "1" values less than 4 columns apart. Like so:
> print(output)
# A tibble: 4 x 10
A B C D E F G H I J
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1 1 1 1
2 NA NA 1 1 1 1 1 NA NA NA
3 NA 1 NA NA NA NA NA NA NA 1
4 NA 1 NA NA NA NA NA 1 1 1
Thank you in advance for your help!
CodePudding user response:
We can identify the position of run lengths of NA that are less or equal to 4, not at the beginning or end of a row, and assign a "1" to ex_input
based on those. First I change base rle
a bit to return a data.frame which is a bit easier to work with.
rlen <- function (x) {
if (!is.vector(x) && !is.list(x)) stop("'x' must be a vector of an atomic type")
n <- length(x)
if (n == 0L) return(data.frame(lengths = integer(), values = x))
y <- x[-1L] != x[-n]
i <- c(which(y | is.na(y)), n)
within(
data.frame(
lengths = diff(c(0L, i)),
values = x[i]), {
end = cumsum(lengths)
start = c(1, end)[1:length(end)]
})
}
is.na(ex_input)
converts the data.frame to TRUE
and FALSE
, circumventing some annoyances with rle
. After the apply
step, we have vectors of positions to be replaced, which can be NULL
.
Using imap
, we can access the list index, and insert that in the row slot of [
, returning invisibly because we are after the side effects.
library(tidyverse)
y <- apply(is.na(ex_input), 1, function(x){
ids <- rlen(x) %>%
mutate(rnum = seq_along(lengths)) %>%
filter(rnum != nrow(.) & rnum != 1 & values & end-start <= 4)
if(nrow(ids) != 0) ids$start:ids$end
})
invisible(imap(y, ~ if(!is.null(.x)) ex_input[.y, .x] <<- 1))
ex_input
# A tibble: 4 x 10
A B C D E F G H I J
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1 1 1 1
2 NA NA 1 1 1 1 1 NA NA NA
3 NA 1 NA NA NA NA NA NA NA 1
4 NA 1 NA NA NA NA NA 1 1 1
CodePudding user response:
Here is one possible solution (though not so elegant in places). I first put the data in a long format. Next, I split the data into a list of dataframes by group (i.e., each row), then I use shift
from data.table
to get values from up to 2 rows ahead and 2 rows behind, which then is bound together, then I get the sum of the values. The logic is that if a given cell is NA
, then there should be at least a sum of 2 from both directions to be able to fill that cell in with a 1
. Then, I unlist
all of the sums, then bind back to the long form of the data, df_long
. The first case_when
is used to find any sums
that is greater than 2, if so, then change value
to 1. The second case_when
is to deal with the D2 and F2, as these would not meet your first criteria of having a 1 in each direction within 4 columns. But we can use lag
and lead
to look ahead and behind and if there is a 1 on both sides, then we can change to 1.
library(tidyverse)
library(data.table)
df_long <- ex_input %>%
mutate(row = row_number()) %>%
pivot_longer(-row)
df <- df_long %>%
group_split(row) %>%
map(., function(x) rowSums(do.call(cbind, shift(x$value, -2:2)), na.rm = TRUE)) %>%
unlist() %>%
bind_cols(df_long, sums = .) %>%
group_by(row) %>%
mutate(value = case_when(value == 1 ~ value,
sums >= 2 ~ 1,
TRUE ~ NA_real_),
value = case_when(value == 1 ~ value,
lead(value) == 1 & lag(value) == 1 ~ 1,
TRUE ~ NA_real_)) %>%
ungroup() %>%
select(-sums) %>%
pivot_wider(names_from = name, values_from = value) %>%
select(-row)
Output
A B C D E F G H I J
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1 1 1 1
2 NA NA 1 1 1 1 1 NA NA NA
3 NA 1 NA NA NA NA NA NA NA 1
4 NA 1 NA NA NA NA NA 1 1 1
CodePudding user response:
This works in base-R (not counting everything being a tibble).
for(i in seq(nrow(ex_input))){
r <- ex_input[i,]
for(cl in seq(ncol(r))){
if(cl 4 > ncol(r)){break()}
r2 <- r[cl:c(cl 4)]
if(sum(r2, na.rm = T) >= 2){
colms <- which(colnames(r2) %in% names(r2[which(!is.na(r2))]))
r[seq(min(colms cl-1), max(colms cl-1))] <- 1
ex_input[i,] <- r
}
}
}
ex_input
# A tibble: 4 x 10
A B C D E F G H I J
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1 1 1 1
2 NA NA 1 1 1 1 1 NA NA NA
3 NA 1 NA NA NA NA NA NA NA 1
4 NA 1 NA NA NA NA NA 1 1 1