Home > OS >  How to use R to replace missing values with the sum of previous 4 values in a column?
How to use R to replace missing values with the sum of previous 4 values in a column?

Time:08-10

I have a dataframe that contains (among other things) three columns that have missing values every 5 rows. These missing values need to be replaced with the sum of the previous 4 values in their respective column.

For example, let's say my dataframe looked like this:

id        category1 category2 category3
123         5        10        10
123         6        11        15
123         6        12        23
123         4        10         6
123        NA        NA        NA
567        24        17        15

Those NAs need to represent a "total" based on the sum of the previous 4 values in their column, and this needs to repeat throughout the entire dataframe because the NAs occur every 5 rows. For instance, the three NAs in the mock example above should be replaced with 21, 43, and 54. 5 rows later, the same process will need to be repeated. How can I achieve this?

CodePudding user response:

Another possible solution:

library(dplyr)

df %>% 
  group_by(id) %>% 
  mutate(across(everything(), ~ if_else(is.na(.x), sum(.x, na.rm = T), .x))) %>% 
  ungroup

#> # A tibble: 6 × 4
#>      id category1 category2 category3
#>   <int>     <int>     <int>     <int>
#> 1   123         5        10        10
#> 2   123         6        11        15
#> 3   123         6        12        23
#> 4   123         4        10         6
#> 5   123        21        43        54
#> 6   567        24        17        15

CodePudding user response:

The following should work if there are no occurrences of NA values within the first 4 rows and I am assuming that the NA values appear in all columns at the same time.

for(i in 1:nrow(data)){
    if(is.na(data[i, 2])){
        data[i, 2] <- sum(data[seq(i-5, i-1), 2])
        data[i, 3] <- sum(data[seq(i-5, i-1), 3])
        data[i, 4] <- sum(data[seq(i-5, i-1), 4])
    }
}

CodePudding user response:

If the NAs appear at the end row for each 'id', we may remove it and do a group by summarise to create a row

library(dplyr)
df1 <- df1 %>%
  na.omit %>% 
  group_by(id) %>%
  summarise(across(everything(), ~ c(.x, sum(.x))), .groups = 'drop')

-output

df1
# A tibble: 7 × 4
     id category1 category2 category3
  <int>     <int>     <int>     <int>
1   123         5        10        10
2   123         6        11        15
3   123         6        12        23
4   123         4        10         6
5   123        21        43        54
6   567        24        17        15
7   567        24        17        15

Or another approach would be to replace the NA with the sum using na.aggregate from zoo

library(zoo)
df1 %>%
   group_by(id) %>%
   mutate(across(everything(), na.aggregate, FUN = sum)) %>% 
   ungroup
# A tibble: 6 × 4
     id category1 category2 category3
  <int>     <int>     <int>     <int>
1   123         5        10        10
2   123         6        11        15
3   123         6        12        23
4   123         4        10         6
5   123        21        43        54
6   567        24        17        15

data

df1 <- structure(list(id = c(123L, 123L, 123L, 123L, 123L, 567L), 
category1 = c(5L, 
6L, 6L, 4L, NA, 24L), category2 = c(10L, 11L, 12L, 10L, NA, 17L
), category3 = c(10L, 15L, 23L, 6L, NA, 15L)),
 class = "data.frame", row.names = c(NA, 
-6L))
  • Related