I have the following df
df<-data.frame(value = c(1,1,1,2,1,1,2,2,1,2),
group = c(5,5,5,6,7,7,8,8,9,10),
no_rows = c(3,3,3,1,2,2,2,2,1,1))
where identical consecutive values form a group, i.e., values in rows 1:3 fall under group 5. Column "no_rows" tells us how many rows/entries each group has, i.e., group 5 has 3 rows/entries.
I am trying to substitute all values, where no_rows < 2, with the value from a previous group. I expect my end df to look like this:
df_end<-data.frame(value = c(1,1,1,1,1,1,2,2,2,2),
group = c(5,5,5,6,7,7,8,8,9,10),
no_rows = c(3,3,3,1,2,2,2,2,1,1))
I came up with this combination of if...else in a for loop, which gives me the desired output, however it is very slow and I am looking for a way to optimise it.
for (i in 2:length(df$group)){
if (df$no_rows[i] < 2){
df$value[i] <- df$value[i-1]
}
}
I have also tried with dplyr::mutate and lag() but it does not give me the desired output (it only removes the first value per group instead of taking the value of a previous group).
df<-df%>%
group_by(group) %>%
mutate(value = ifelse(no_rows < 2, lag(value), value))
I looked for a solution now for a few days but I could not find anything that fit my problem completly. Any ideas?
CodePudding user response:
a data.table approach...
first, get the values of groups with length >=2, then fill in missing values (NA) by last-observation-carried-forward.
library(data.table)
# make it a data.table
setDT(df, key = "group")
# get values for groups of no_rows >= 2
df[no_rows >= 2, new_value := value][]
# value group no_rows new_value
# 1: 1 5 3 1
# 2: 1 5 3 1
# 3: 1 5 3 1
# 4: 2 6 1 NA
# 5: 1 7 2 1
# 6: 1 7 2 1
# 7: 2 8 2 2
# 8: 2 8 2 2
# 9: 1 9 1 NA
#10: 2 10 1 NA
# fill down missing values in new_value
setnafill(df, "locf", cols = c("new_value"))
# value group no_rows new_value
# 1: 1 5 3 1
# 2: 1 5 3 1
# 3: 1 5 3 1
# 4: 2 6 1 1
# 5: 1 7 2 1
# 6: 1 7 2 1
# 7: 2 8 2 2
# 8: 2 8 2 2
# 9: 1 9 1 2
#10: 2 10 1 2