I have a dataframe that contains (among other things) three columns that have missing values every 5 rows. These missing values need to be replaced with the sum of the previous 4 values in their respective column.
For example, let's say my dataframe looked like this:
id category1 category2 category3
123 5 10 10
123 6 11 15
123 6 12 23
123 4 10 6
123 NA NA NA
567 24 17 15
Those NAs need to represent a "total" based on the sum of the previous 4 values in their column, and this needs to repeat throughout the entire dataframe because the NAs occur every 5 rows. For instance, the three NAs in the mock example above should be replaced with 21, 43, and 54. 5 rows later, the same process will need to be repeated. How can I achieve this?
CodePudding user response:
Another possible solution:
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(everything(), ~ if_else(is.na(.x), sum(.x, na.rm = T), .x))) %>%
ungroup
#> # A tibble: 6 × 4
#> id category1 category2 category3
#> <int> <int> <int> <int>
#> 1 123 5 10 10
#> 2 123 6 11 15
#> 3 123 6 12 23
#> 4 123 4 10 6
#> 5 123 21 43 54
#> 6 567 24 17 15
CodePudding user response:
The following should work if there are no occurrences of NA
values within the first 4 rows and I am assuming that the NA
values appear in all columns at the same time.
for(i in 1:nrow(data)){
if(is.na(data[i, 2])){
data[i, 2] <- sum(data[seq(i-5, i-1), 2])
data[i, 3] <- sum(data[seq(i-5, i-1), 3])
data[i, 4] <- sum(data[seq(i-5, i-1), 4])
}
}
CodePudding user response:
If the NA
s appear at the end row for each 'id', we may remove it and do a group by summarise to create a row
library(dplyr)
df1 <- df1 %>%
na.omit %>%
group_by(id) %>%
summarise(across(everything(), ~ c(.x, sum(.x))), .groups = 'drop')
-output
df1
# A tibble: 7 × 4
id category1 category2 category3
<int> <int> <int> <int>
1 123 5 10 10
2 123 6 11 15
3 123 6 12 23
4 123 4 10 6
5 123 21 43 54
6 567 24 17 15
7 567 24 17 15
Or another approach would be to replace the NA with the sum
using na.aggregate
from zoo
library(zoo)
df1 %>%
group_by(id) %>%
mutate(across(everything(), na.aggregate, FUN = sum)) %>%
ungroup
# A tibble: 6 × 4
id category1 category2 category3
<int> <int> <int> <int>
1 123 5 10 10
2 123 6 11 15
3 123 6 12 23
4 123 4 10 6
5 123 21 43 54
6 567 24 17 15
data
df1 <- structure(list(id = c(123L, 123L, 123L, 123L, 123L, 567L),
category1 = c(5L,
6L, 6L, 4L, NA, 24L), category2 = c(10L, 11L, 12L, 10L, NA, 17L
), category3 = c(10L, 15L, 23L, 6L, NA, 15L)),
class = "data.frame", row.names = c(NA,
-6L))