conditional sapply to change levels of DF factors-CodePudding

I have a very messy dataset where researchers did not match levels of data across sessions. In one session, a '[digit]: ' or '[digit] : ' was added.

I created a dataframe of session 3, 6, and 10 visits from the SWAN dataset. You can download what I'm working with here: LINK

Here's an example of the levels:

levels(dat$ABLECLM)
 [1] "(1) Always"                              "(2) Almost Always"                      
 [3] "(3) Sometimes"                           "(4) Almost Never"                       
 [5] "(5) Never"                               "(6) No Intercourse In Last 6 Mons"      
 [7] "(1) 1 : Always"                          "(2) 2 : Almost always"                  
 [9] "(3) 3 : Sometimes"                       "(4) 4 : Almost never"                   
[11] "(5) 5 : Never"                           "(6) 6 : No intercourse in last 6 months"
[13] "(2) Almost always"                       "(4) Almost never"                       
[15] "(6) No intercourse in last 6 months"

I wrote this function to use in an apply function:

match_levels <- function(column){
    if(is.factor(column)){
        levels(column) <- sapply(column, tolower)
        levels(column) <- sapply(column,function(x) sub('\\d: |\\d : ', '', levels(x)))
        return(column)
    }else{
        return(column)
    }
}

It works on a single column, but when I try and apply to each column I have:

dat <- read.csv(<data_link>)
df <- data.frame(apply(dat,2, function(x) x=match_levels(x)))

I get this:

levels(as.factor(df$ABLECLM))
 [1] "(1) 1 : Always"  
 [2] "(1) Always"  
 [3] "(2) 2 : Almost always" 
 [4] "(2) Almost always"    
 [5] "(2) Almost Always"     
 [6] "(3) 3 : Sometimes"     
 [7] "(3) Sometimes"         
 [8] "(4) 4 : Almost never" 
 [9] "(4) Almost never"      
 [10] "(4) Almost Never"      
 [11] "(5) 5 : Never"         
 [12] "(5) Never"

CodePudding user response：

You can use your original logic of updating the levels of the factor, rather than the value of the variable, which requires factoring again.

new_levels <- function(vec) {
  if (is.factor(vec)) {
    lvls <- gsub('\\d: |\\d : ', '', tolower(levels(vec)))
    dups <- which(duplicated(substr(lvls,1,3)))
    lvls[dups] <- lvls[dups-1]
    lvls[dups] <- lvls[dups-1]
    levels(vec) <- lvls
    return(vec)
  } else {
    return(vec)
  }
}

df[] <- lapply(df, new_levels)

> microbenchmark::microbenchmark(lapply(df, new_levels),lapply(dat, match_levels))
Unit: milliseconds
                      expr        min        lq       mean     median         uq       max
    lapply(df, new_levels)   4.875398   5.13453   6.654272   5.737605   6.568733  80.17889
 lapply(dat, match_levels) 516.539473 532.93423 539.665595 536.944488 541.752497 615.87178

CodePudding user response：

You're using exactly the right tools. However, it's no fun, to gsub every single factor in detail. Instead we can exploit the neat numbers in front of the factors. First, we split the levels along those numbers using the result of gsub. Next, we choose each second level w=2 which looks cleanest to me. Finally we do the same with the entire vector what we did with the levels, gsub them and create a factor out of it using the labs we chose before as labels=.

dat <- read.csv('~/Downloads/proj3_data_before_cleaning.csv', stringsAsFactors=TRUE)

f <- function(x, w=2) {
  if (is.factor(x)) {
    lv <- levels(x)
    lvs <- split(lv, gsub('^(\\(\\d \\)).*', '\\1', lv))
    labs <- sapply(lvs, `[`, w)
    factor(gsub('^(\\(\\d \\)).*', '\\1', as.character(x), perl=TRUE), labels=labs)
  } else {
    x
  }
}   

dat[] <- lapply(dat[], f)  ## apply

summary(dat)  ## check