I am trying to perform to classify effectiveness of a treatment. Each of the id should contain 4 timeframes.
Dataframe
id | timeframe | distance |
---|---|---|
1 | 1 | 1.1 |
1 | 2 | 1.1 |
1 | 3 | 1.2 |
1 | 4 | 1.1 |
2 | 1 | 1.1 |
2 | 2 | 1.1 |
2 | 4 | 1.1 |
The question is for example id 2 timeframe #3 is missing. How to create a new row added in the missing timeframe with the average distance value for all the rows with such issue?
I am getting the 'not all time is the same length' when running - Longitudinal clustering using "longitudinal k-means (KML)"
CodePudding user response:
We can use complete
to create the missing combination and then replace the NA
with the mean
library(dplyr)
library(tidyr)
df1 %>%
mutate(rn = row_number()) %>%
complete(id, timeframe) %>%
mutate(distance = replace(distance, is.na(distance) & is.na(rn),
mean(distance, na.rm = TRUE)))
If the mean
should be calculated within each 'id', then do a group_by
before the mutate
df1 %>%
mutate(rn = row_number()) %>%
complete(id, timeframe) %>%
group_by(id) %>%
mutate(distance = replace(distance, is.na(distance) & is.na(rn),
mean(distance, na.rm = TRUE))) %>%
ungroup