Home > front end >  if() statement with paste0() or grep() in r
if() statement with paste0() or grep() in r

Time:05-13

I made reproducible minimal example, but my real data is really huge


ac_1 <-c(0.1, 0.3, 0.03, 0.03)
ac_2 <-c(0.2, 0.4, 0.1, 0.008)
ac_3 <-c(0.8, 0.043, 0.7, 0.01)
ac_4 <-c(0.2, 0.73, 0.1, 0.1)
c_2<-c(1,2,5,23)
check_1<-c(0.01, 0.902,0.02,0.07)
check_2<-c(0.03, 0.042,0.002,0.00001)
check_3<-c(0.01, 0.02,0.5,0.001)
check_4<-c(0.001, 0.042,0.02,0.2)
id<-1:4


df<-data.frame(id,ac_1, ac_2,ac_3,ac_4,c_2,check_1,check_2,check_3,check_4)

so, the dataframe is like this:

> df
  id ac_1  ac_2  ac_3 ac_4 c_2 check_1 check_2 check_3 check_4
1  1 0.10 0.200 0.800 0.20   1   0.010 0.03000   0.010   0.001
2  2 0.30 0.400 0.043 0.73   2   0.902 0.04200   0.020   0.042
3  3 0.03 0.100 0.700 0.10   5   0.020 0.00200   0.500   0.020
4  4 0.03 0.008 0.010 0.10  23   0.070 0.00001   0.001   0.200


and what I want to do is,

if check_1 is 0.02, I will make the corresponding ac_1 to be missing data. if check_2 is 0.02, I will make the corresponding ac_2 to be missing data. I will keep doing this every "check" and "ac"columns

For example, in the check_1 column, the 3th id person have 0.02. so, this person's ac_1 score should be missing data-- 0.03 should be missing data (NA)

In the check_3 column, the 2nd id person have 0.02. so, this person's ac_3 score should be missing data.

In the check_4 column, the 3th id person have 0.02 so, this person's ac_4 score should be missing data.

so. what i did is as follows:



for(i in 1:4){
  
  if(paste0("df$check_",i)==0.02){
    paste0("df$ac_",i)==NA
  }
}

But, it did not work...

CodePudding user response:

You're really close, but you're off on a few fundamentals.

  1. You can't (easily) use strings to refer to objects, so "df$check_1" won't work. You can use strings to refer to column names, but not with $, you need to use [ or [[, so df[["check_1"]] will work.

  2. if isn't vectorized, so it won't work on each value in a column. Use ifelse instead, or even better in this case we can skip the if entirely.

  3. Using == on non-integer numbers is risky due to precision issues. We'll use a tolerance instead.

  4. Minor issue, paste0("df$ac_",i)==NA isn't good, == is for checking equality. You need = or <- for assignment on that line.

Addressing all of these issues:

for(i in 1:4){  
  df[
    ## rows to replace
    abs(df[[paste0("check_", i)]] - 0.02) < 1e-10,
    ## column to replace
    paste0("ac_", i)
  ] <- NA
}

df
#   id ac_1  ac_2 ac_3 ac_4 c_2 check_1 check_2 check_3 check_4
# 1  1 0.10 0.200 0.80 0.20   1   0.010 0.03000   0.010   0.001
# 2  2 0.30 0.400   NA 0.73   2   0.902 0.04200   0.020   0.042
# 3  3   NA 0.100 0.70   NA   5   0.020 0.00200   0.500   0.020
# 4  4 0.03 0.008 0.01 0.10  23   0.070 0.00001   0.001   0.200

CodePudding user response:

Its often better to work with long format data, even if just temporarily. Here is an example of doing so, using dplyr and tidyr:

pivot_longer(df, -c(id,c_2)) %>%
  separate(name,into=c("type", "pos")) %>% 
  pivot_wider(names_from=type, values_from = value) %>% 
  mutate(ac=if_else(near(check,0.02), as.double(NA), ac)) %>% 
  pivot_wider(names_from = pos, values_from = ac:check)

(Updated with near() thanks to Gregor)

Output:

     id   c_2  ac_1  ac_2  ac_3  ac_4 check_1 check_2 check_3 check_4
  <int> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1     1     1  0.1  0.2    0.8   0.2    0.01  0.03      0.01    0.001
2     2     2  0.3  0.4   NA     0.73   0.902 0.042     0.02    0.042
3     3     5 NA    0.1    0.7  NA      0.02  0.002     0.5     0.02 
4     4    23  0.03 0.008  0.01  0.1    0.07  0.00001   0.001   0.2  
  • Related