Home > Enterprise >  Generate new database with some criteria in R
Generate new database with some criteria in R

Time:10-05

I would like to generate a new df1 database, which checks row by row, and excludes columns with zeros in a row and which has no later numbers.

Example: In line 1, it has DR08, DR09 and DR10 with zeros, but it has DR11 equal to 4, so I need to consider these DR because it has this later number, which is 4, however DR12, DR013 and DR14, I exclude, as it has no value afterwards. In line 2, in this case, it goes to DR07, because the rest is 0. I think I get the idea, right?

    df1 <- structure(
  list(date= c("2021-06-28","2021-06-28","2021-06-28","2021-06-28","2021-06-28",
       "2021-06-28","2021-06-28","2021-06-28"),
        DR01 = c(4,1,4,3,3,4,3,6), DR02= c(4,2,6,7,3,2,7,4),DR03= c(9,5,4,3,3,2,1,5),
       DR04 = c(5,4,3,3,6,2,1,9),DR05 = c(5,4,5,3,6,2,1,9),
       DR06 = c(2,4,3,3,5,6,7,8),DR07 = c(2,5,4,4,9,4,7,8),
       DR08 = c(0,0,0,1,2,0,0,0),DR09 = c(0,0,0,0,0,0,0,0),DR010 = c(0,0,0,0,0,0,0,0),DR011 = c(4,0,0,0,0,0,0,0), 
       DR012 = c(0,0,0,3,0,0,0,5),DR013 = c(0,0,1,0,0,0,2,0),DR014 = c(0,0,0,1,0,2,0,0)),
  class = "data.frame", row.names = c(NA, -8L))

        date DR01 DR02 DR03 DR04 DR05 DR06 DR07 DR08 DR09 DR010 DR011 DR012 DR013 DR014
1 2021-06-28    4    4    9    5    5    2    2    0    0     0     4     0     0     0
2 2021-06-28    1    2    5    4    4    4    5    0    0     0     0     0     0     0
3 2021-06-28    4    6    4    3    5    3    4    0    0     0     0     0     1     0
4 2021-06-28    3    7    3    3    3    3    4    1    0     0     0     3     0     1
5 2021-06-28    3    3    3    6    6    5    9    2    0     0     0     0     0     0
6 2021-06-28    4    2    2    2    2    6    4    0    0     0     0     0     0     2
7 2021-06-28    3    7    1    1    1    7    7    0    0     0     0     0     2     0
8 2021-06-28    6    4    5    9    9    8    8    0    0     0     0     5     0     0

CodePudding user response:

I am not entirely sure what the OP wants. If we want to replace rowwise trailing zeroes for these selected columns with NAs, we can pivot_longer, group the data by original row, and use data.table::rleid to replace the groups in which value==0 and rleid==last(rleid). Then pivot_wider to have the original layout.

library(dplyr)
library(tidyr)
library(data.table)

df1 %>% mutate(index=row_number()) %>%
        pivot_longer(starts_with('DR')) %>%
        mutate(rleid=rleid(value==0)) %>%
        group_by(index) %>%
        mutate(value=replace(value, value==0 & rleid==last(rleid), NA)) %>%
        select(-index, -rleid) %>%
        pivot_wider(names_from = name, values_from = value)

We can also use purrr::pmap and purrr::accumulate to mutate columns of a selected num_range with c(...):

library(dplyr)
library(purrr)
library(tidyr)

df1 %>%
        mutate(pmap(across(num_range('DR0', 14:1)), ~{vec<-c(...)
        accumulate(vec, \(x,y) ifelse(is.na(x) && y==0, NA, y), .init = NA)}),
        .keep = 'unused')%>%
        unnest_wider(`pmap(...)`)%>%
        select(date, num_range('DR0', 1:14))

output

# A tibble: 8 x 16
# Groups:   index [8]
  index date        DR01  DR02  DR03  DR04  DR05  DR06  DR07  DR08  DR09 DR010 DR011 DR012 DR013 DR014
  <int> <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1 2021-06-28     4     4     9     5     5     2     2     0     0     0     4    NA    NA    NA
2     2 2021-06-28     1     2     5     4     4     4     5    NA    NA    NA    NA    NA    NA    NA
3     3 2021-06-28     4     6     4     3     5     3     4     0     0     0     0     0     1    NA
4     4 2021-06-28     3     7     3     3     3     3     4     1     0     0     0     3     0     1
5     5 2021-06-28     3     3     3     6     6     5     9     2    NA    NA    NA    NA    NA    NA
6     6 2021-06-28     4     2     2     2     2     6     4     0     0     0     0     0     0     2
7     7 2021-06-28     3     7     1     1     1     7     7     0     0     0     0     0     2    NA
8     8 2021-06-28     6     4     5     9     9     8     8     0     0     0     0     5    NA    NA
> 

CodePudding user response:

If by excluding you mean turn them to NAs you can use this base R method with apply.

df1[] <- t(apply(df1, 1, function(x) {
  inds <- max(which(x != 0))   1
  if(length(inds) && inds <= length(x))
      x[inds:length(x)] <- NA
  x
}))
df1

#        date DR01 DR02 DR03 DR04 DR05 DR06 DR07 DR08 DR09 DR010 DR011 DR012 DR013 DR014
#1 2021-06-28    4    4    9    5    5    2    2    0    0     0     4  <NA>  <NA>  <NA>
#2 2021-06-28    1    2    5    4    4    4    5 <NA> <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#3 2021-06-28    4    6    4    3    5    3    4    0    0     0     0     0     1  <NA>
#4 2021-06-28    3    7    3    3    3    3    4    1    0     0     0     3     0     1
#5 2021-06-28    3    3    3    6    6    5    9    2 <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#6 2021-06-28    4    2    2    2    2    6    4    0    0     0     0     0     0     2
#7 2021-06-28    3    7    1    1    1    7    7    0    0     0     0     0     2  <NA>
#8 2021-06-28    6    4    5    9    9    8    8    0    0     0     0     5  <NA>  <NA>
  •  Tags:  
  • r
  • Related