I have a dataframe which is like the followng:
dat <- data.frame(participant = c(rep("01", times = 3), rep("02", times = 3)),
target = rep(c("1", "2", "3"), times = 2),
eucDist = c(0.06, 0.16, 0.89, 0.10, 0.11, 0.75),
eucDist2 = c(0.09, 0.04, 0.03, 0.05, 1.45, 0.09))
participant target eucDist eucDist2
1 01 1 0.06 0.09
2 01 2 0.16 0.04
3 01 3 0.89 0.03
4 02 1 0.10 0.05
5 02 2 0.11 1.45
6 02 3 0.75 0.09
I have run some code to identify outliers in the eucDist and eucDist2 columns, which I have saved in separate dataframes. Examples of these can be seen here:
outliers1 <- data.frame(participant = c("01", "02"),
target = c("1", "3"),
eucDist = c(0.06, 0.75),
eucDist2 = c(0.09, 0.09))
participant target eucDist eucDist2
1 01 1 0.06 0.09
2 02 3 0.75 0.09
outliers2 <- data.frame(participant = "01",
target = "1",
eucDist = 0.06,
eucDist2 = 0.09)
participant target eucDist eucDist2
1 01 1 0.06 0.09
The rows shown in Outliers1
indicate outliers in the eucDist column in dat
, and the row shown in Outliers2
indicates an outlier in the eucDist2 column.
I would like to replace the outlier values in the eucDist and eucDist2 columns of dat
with 'NA'. I do not want to remove whole rows because in many cases either the eucDist or eucDist2 values are usable, and removing the entire row would remove both variables.
Here is what I would like:
participant target eucDist eucDist2
1 01 1 NA NA
2 01 2 0.16 0.04
3 01 3 0.89 0.03
4 02 1 0.10 0.05
5 02 2 0.11 1.45
6 02 3 NA 0.09
I have been attempting this using conditional %in% statements, but can't quite get the phrasing correct and would really appreciate some help. Here is my non-working code:
library(naniar)
dat1 <- if (dat$eucDist %in% Outliers1$eucDist) {
replace_with_na_all(dat$eucDist)
}
CodePudding user response:
As you have the data frames in this format, you can set the values of the required data to NA
in the new data frames, and then update the rows of the original data frames with these values using dplyr::rows_update()
. This assumes you have at least dplyr
v1.0.0.
library(dplyr)
outliers1$eucDist <- NA
outliers2$eucDist2 <- NA
dat |>
rows_update(outliers1, by = c("participant", "target")) |>
rows_update(outliers2, by = c("participant", "target"))
# participant target eucDist eucDist2
# 1 01 1 NA 0.09
# 2 01 2 0.16 0.04
# 3 01 3 0.89 0.03
# 4 02 1 0.10 0.05
# 5 02 2 0.11 NA
# 6 02 3 NA 0.09
CodePudding user response:
rbind
the two outlier frames while adding column number of respective eucDist. Then run a short and sweet for
loop.
outliers <- rbind(cbind(outliers1[1:2], j=3), cbind(outliers2[1:2], j=4))
for (i in seq_len(nrow(outliers))) {
dat[dat$participant == outliers$participant[i] & dat$target == outliers$target[i], outliers$j[i]] <- NA
}
# participant target eucDist eucDist2
# 1 01 1 NA 0.09
# 2 01 2 0.16 0.04
# 3 01 3 0.89 0.03
# 4 02 1 0.10 0.05
# 5 02 2 0.11 NA
# 6 02 3 NA 0.09
CodePudding user response:
I would recommend using case_when
function in the dplyr
package.
dat_final <- dat %>%
mutate(eucDist = case_when(
eucDist %in% outliers1$eucDist ~ as.numeric(NA),
T ~ eucDist
),
eucDist2 = case_when(
eucDist2 %in% outliers2$eucDist2 ~ as.numeric(NA),
T ~ eucDist2
))
> str(dat_final)
'data.frame': 6 obs. of 4 variables:
$ participant: chr "01" "01" "01" "02" ...
$ target : chr "1" "2" "3" "1" ...
$ eucDist : num NA 0.16 0.89 0.1 0.11 NA
$ eucDist2 : num 0.09 0.04 0.03 0.05 NA 0.09