Home > Software engineering >  How to use dplyr to fill cells meeting a criteria with the value of the cell immediately above itera
How to use dplyr to fill cells meeting a criteria with the value of the cell immediately above itera

Time:09-15

I am trying to use base R or dplyr to fill dataframe column cells that meet a specified criteria (value of 0 in this case) with the last value above before the 0. So, suppose we have the below data frame myDF, with columns Element, Group, eleCnt, and reSeq generated by the code immediately beneath. I would like to either add a new column "ADD" or modify the existing column reSeq with the manually input values I show under "ADD". Any recommendations for doing this in Base R or dplyr?

   Element Group eleCnt reSeq
   <chr>   <dbl>  <int> <dbl>  ADD  << ADD Explained
 1 R           0      1     1    1     since reSeq not 0 use current row reSeq value
 2 R           0      2     2    2     same as above
 3 X           0      1     1    1     same as above
 4 X           1      2     2    2     same as above
 5 X           1      3     0    2     since reSeq = 0 use reSeq value from last row prior to the 0
 6 X           0      4     4    4     since reSeq not 0 use current row reSeq value
 7 X           0      5     5    5     same as above
 8 X           0      6     6    6     same as above
 9 B           0      1     1    1     same as above
10 R           0      3     3    3     same as above
11 R           2      4     4    4     same as above
12 R           2      5     0    4     since reSeq = 0 use reSeq value from last row prior to the 0
13 X           3      7     7    7     since reSeq not 0 use current row reSeq value
14 X           3      8     0    7     since reSeq = 0 use last reSeq value <> 0 above
15 X           3      9     0    7     since reSeq = 0 use last reSeq value <> 0 above

myDF %>% 
  group_by(Element) %>%
    mutate(eleCnt = row_number()) %>%
  ungroup()%>%
  mutate(reSeq = ifelse(Group == 0 | Group != lag(Group), eleCnt,0) 
  )

Running dput(myDF):

> dput(myDF)
structure(list(Element = c("R", "R", "X", "X", "X", "X", "X", 
"X", "B", "R", "R", "R", "X", "X", "X"), Group = c(0, 0, 0, 1, 
1, 0, 0, 0, 0, 0, 2, 2, 3, 3, 3)), class = "data.frame", row.names = c(NA, 
-15L))

CodePudding user response:

We may change the 0s to NA and then do a group by fill

library(dplyr)
library(tidyr)
df1 %>%
   mutate(ADD = na_if(reSeq, 0)) %>%
   group_by(Element) %>%
   fill(ADD) %>%
   ungroup

-output

# A tibble: 15 × 5
   Element Group eleCnt reSeq   ADD
   <chr>   <dbl>  <int> <dbl> <dbl>
 1 R           0      1     1     1
 2 R           0      2     2     2
 3 X           0      1     1     1
 4 X           1      2     2     2
 5 X           1      3     0     2
 6 X           0      4     4     4
 7 X           0      5     5     5
 8 X           0      6     6     6
 9 B           0      1     1     1
10 R           0      3     3     3
11 R           2      4     4     4
12 R           2      5     0     4
13 X           3      7     7     7
14 X           3      8     0     7
15 X           3      9     0     7

The steps in the creation of 'ADD' can be simplified from the original data 'myDF'

library(data.table)
myDF %>% 
  mutate(ADD =  rowid(Element) * NA^!(Group == 0 |!duplicated(Group))) %>% 
 group_by(Element) %>% 
 fill(ADD) %>% 
 ungroup

-output

# A tibble: 15 × 3
   Element Group   ADD
   <chr>   <dbl> <dbl>
 1 R           0     1
 2 R           0     2
 3 X           0     1
 4 X           1     2
 5 X           1     2
 6 X           0     4
 7 X           0     5
 8 X           0     6
 9 B           0     1
10 R           0     3
11 R           2     4
12 R           2     4
13 X           3     7
14 X           3     7
15 X           3     7

data

df1 <- structure(list(Element = c("R", "R", "X", "X", "X", "X", "X", 
"X", "B", "R", "R", "R", "X", "X", "X"), Group = c(0, 0, 0, 1, 
1, 0, 0, 0, 0, 0, 2, 2, 3, 3, 3), eleCnt = c(1L, 2L, 1L, 2L, 
3L, 4L, 5L, 6L, 1L, 3L, 4L, 5L, 7L, 8L, 9L), reSeq = c(1, 2, 
1, 2, 0, 4, 5, 6, 1, 3, 4, 0, 7, 0, 0)), row.names = c(NA, -15L
), class = c("tbl_df", "tbl", "data.frame"))
  • Related