Home > Software engineering >  Replace matching column values in multiple dataframes with NA in R
Replace matching column values in multiple dataframes with NA in R

Time:07-11

I have a dataframe which is like the followng:

dat <- data.frame(participant = c(rep("01", times = 3), rep("02", times = 3)),
                  target = rep(c("1", "2", "3"), times = 2),
                  eucDist = c(0.06, 0.16, 0.89, 0.10, 0.11, 0.75),
                  eucDist2 = c(0.09, 0.04, 0.03, 0.05, 1.45, 0.09))

   participant target eucDist eucDist2
1          01      1    0.06     0.09
2          01      2    0.16     0.04
3          01      3    0.89     0.03
4          02      1    0.10     0.05
5          02      2    0.11     1.45
6          02      3    0.75     0.09

I have run some code to identify outliers in the eucDist and eucDist2 columns, which I have saved in separate dataframes. Examples of these can be seen here:

outliers1 <- data.frame(participant = c("01", "02"),
              target = c("1", "3"),
              eucDist = c(0.06, 0.75),
              eucDist2 = c(0.09, 0.09))

   participant target eucDist eucDist2
1          01      1    0.06     0.09
2          02      3    0.75     0.09

outliers2 <- data.frame(participant = "01",
                        target = "1",
                        eucDist = 0.06,
                        eucDist2 = 0.09)

  participant target eucDist eucDist2
1          01      1    0.06     0.09

The rows shown in Outliers1 indicate outliers in the eucDist column in dat, and the row shown in Outliers2 indicates an outlier in the eucDist2 column.

I would like to replace the outlier values in the eucDist and eucDist2 columns of datwith 'NA'. I do not want to remove whole rows because in many cases either the eucDist or eucDist2 values are usable, and removing the entire row would remove both variables.

Here is what I would like:

    participant target eucDist eucDist2
1          01      1    NA       NA
2          01      2    0.16     0.04
3          01      3    0.89     0.03
4          02      1    0.10     0.05
5          02      2    0.11     1.45
6          02      3    NA       0.09

I have been attempting this using conditional %in% statements, but can't quite get the phrasing correct and would really appreciate some help. Here is my non-working code:

library(naniar)
dat1 <- if (dat$eucDist %in% Outliers1$eucDist) {
  replace_with_na_all(dat$eucDist)
}

CodePudding user response:

As you have the data frames in this format, you can set the values of the required data to NA in the new data frames, and then update the rows of the original data frames with these values using dplyr::rows_update(). This assumes you have at least dplyr v1.0.0.

library(dplyr)

outliers1$eucDist  <- NA
outliers2$eucDist2  <- NA

dat  |>
    rows_update(outliers1, by = c("participant", "target"))  |>
    rows_update(outliers2, by = c("participant", "target")) 

#   participant target eucDist eucDist2
# 1          01      1      NA     0.09
# 2          01      2    0.16     0.04
# 3          01      3    0.89     0.03
# 4          02      1    0.10     0.05
# 5          02      2    0.11       NA
# 6          02      3      NA     0.09

CodePudding user response:

rbind the two outlier frames while adding column number of respective eucDist. Then run a short and sweet for loop.

outliers <- rbind(cbind(outliers1[1:2], j=3), cbind(outliers2[1:2], j=4))

for (i in seq_len(nrow(outliers))) {
  dat[dat$participant == outliers$participant[i] & dat$target == outliers$target[i], outliers$j[i]] <- NA
}
#   participant target eucDist eucDist2
# 1          01      1      NA     0.09
# 2          01      2    0.16     0.04
# 3          01      3    0.89     0.03
# 4          02      1    0.10     0.05
# 5          02      2    0.11       NA
# 6          02      3      NA     0.09

CodePudding user response:

I would recommend using case_when function in the dplyr package.

dat_final <- dat %>% 
  mutate(eucDist = case_when(
    eucDist %in% outliers1$eucDist ~ as.numeric(NA),
    T ~ eucDist
  ),
  eucDist2 = case_when(
    eucDist2 %in% outliers2$eucDist2 ~ as.numeric(NA),
    T ~ eucDist2
  ))

   > str(dat_final)
'data.frame':   6 obs. of  4 variables:
 $ participant: chr  "01" "01" "01" "02" ...
 $ target     : chr  "1" "2" "3" "1" ...
 $ eucDist    : num  NA 0.16 0.89 0.1 0.11 NA
 $ eucDist2   : num  0.09 0.04 0.03 0.05 NA 0.09
  • Related