How select and remove rows based on position for a specific range in R-CodePudding

Suppose I have two data frames like this:

df1 <- data.frame(a = c(1,2,4,0,0),
             b = c(0,3,5,5,0),
             c = c(0,0,6,7,6))

df2 <- data.frame(a = c(3,6,8,0,0),
             b = c(0,9,10,4,0),
             c = c(0,0,1,4,9))

And then I joint it, like

df3 <- full_join(df1, df2)
print(df3)

   a  b c
1  1  0 0
2  2  3 0
3  4  5 6
4  0  5 7
5  0  0 6
6  3  0 0
7  6  9 0
8  8 10 1
9  0  4 4
10 0  0 9

Note that I have always the same pattern, with zeros in rows 1 and 2; and in rows 9 and 10. And I also have zeros between rows 4 and 7. I want to remove, only, the zeros between rows 4 and 7. So, I can solve it, like:

df3[4,1] <- NA
df3[5,1] <- NA
df3[5,2] <- NA
df3[6,2] <- NA
df3[6,3] <- NA
df3[7,3] <- NA

new.df3 <-  as.data.frame(lapply(df3, na.omit))
print(new.df3)

  a  b c
1 1  0 0
2 2  3 0
3 4  5 6
4 3  5 7
5 6  9 6
6 8 10 1
7 0  4 4
8 0  0 9

But it is not elegant and very time-consuming. Any thoughts? I really appreciate it, thanks in advance. Best!

CodePudding user response：

First, you find which one is zero between rows 4 and 7.

to_remove <- apply(df3[4:7, ], 1, function(x) which(x == 0))

Then, you substitute them by NAs.

for(i in seq(length(to_remove))){
  df3[as.numeric(names(to_remove))[i], to_remove[[i]]] <- NA
}

And, finally, drop them.

new.df3 <-  as.data.frame(lapply(df3, na.omit))
print(new.df3)

CodePudding user response：

Here is a tidyverse solution. Note that this will fail, if not every variable in the data frame has the same number of zeros to be omitted in the specified range.

library(tidyverse)

df3 |>
  # Define the range of rows for omission
  mutate(flag = row_number() %in% 4:7) |>
  # Group variable to preserve order, increments when flag changes
  group_by(grp = (flag - lag(flag)) |>
             replace_na(0) |>
             abs() |>
             cumsum()) |>
  summarize(across(a:c,
                   # when flag then omit zeros
                   \(x, flag) if (first(flag)) x[x != 0] else x,
                   flag = flag),
            .groups = "drop"
            ) |>
  select(-grp)

#> # A tibble: 8 × 3
#>       a     b     c
#>   <dbl> <dbl> <dbl>
#> 1     1     0     0
#> 2     2     3     0
#> 3     4     5     6
#> 4     3     5     7
#> 5     6     9     6
#> 6     8    10     1
#> 7     0     4     4
#> 8     0     0     9

If you don't care about preserving the order of the rows, then it can be shortened:

df3 |>
  # Define the range of rows for omission
  group_by(flag = row_number() %in% 4:7) |>
  summarize(across(a:c,
                   # when flag then omit zeros
                   \(x, flag) if (first(flag)) x[x != 0] else x,
                   flag = flag),
            .groups = "drop") |> 
  select(-flag)
#> # A tibble: 8 × 3
#>       a     b     c
#>   <dbl> <dbl> <dbl>
#> 1     1     0     0
#> 2     2     3     0
#> 3     4     5     6
#> 4     8    10     1
#> 5     0     4     4
#> 6     0     0     9
#> 7     3     5     7
#> 8     6     9     6

^{Created on 2022-02-08 by the reprex package (v2.0.1)}

CodePudding user response：

Here's a different approach:

mask <- !(seq(nrow(df3)) %in% 4:7 & df3 == 0)
df.lst <- lapply(1:3, function(x) df3[mask[, x], x])
sapply(df.lst, length)
# [1] 8 8 8  # Check to make sure the columns are the same length
names(df.lst) <- colnames(df3)
(new.df3 <- as.data.frame(df.lst))
#   a  b c
# 1 1  0 0
# 2 2  3 0
# 3 4  5 6
# 4 3  5 7
# 5 6  9 6
# 6 8 10 1
# 7 0  4 4
# 8 0  0 9

CodePudding user response：

df3 %>%
  mutate(rn = between(row_number(), 4, 7)) %>%
  summarise(across(-rn, ~.x[!(.x == 0 & rn)]))
  a  b c
1 1  0 0
2 2  3 0
3 4  5 6
4 3  5 7
5 6  9 6
6 8 10 1
7 0  4 4
8 0  0 9