I have a very messy dataset where researchers did not match levels of data across sessions. In one session, a '[digit]: ' or '[digit] : ' was added.
I created a dataframe of session 3, 6, and 10 visits from the SWAN dataset. You can download what I'm working with here: LINK
Here's an example of the levels:
levels(dat$ABLECLM)
[1] "(1) Always" "(2) Almost Always"
[3] "(3) Sometimes" "(4) Almost Never"
[5] "(5) Never" "(6) No Intercourse In Last 6 Mons"
[7] "(1) 1 : Always" "(2) 2 : Almost always"
[9] "(3) 3 : Sometimes" "(4) 4 : Almost never"
[11] "(5) 5 : Never" "(6) 6 : No intercourse in last 6 months"
[13] "(2) Almost always" "(4) Almost never"
[15] "(6) No intercourse in last 6 months"
I wrote this function to use in an apply function:
match_levels <- function(column){
if(is.factor(column)){
levels(column) <- sapply(column, tolower)
levels(column) <- sapply(column,function(x) sub('\\d: |\\d : ', '', levels(x)))
return(column)
}else{
return(column)
}
}
It works on a single column, but when I try and apply to each column I have:
dat <- read.csv(<data_link>)
df <- data.frame(apply(dat,2, function(x) x=match_levels(x)))
I get this:
levels(as.factor(df$ABLECLM))
[1] "(1) 1 : Always"
[2] "(1) Always"
[3] "(2) 2 : Almost always"
[4] "(2) Almost always"
[5] "(2) Almost Always"
[6] "(3) 3 : Sometimes"
[7] "(3) Sometimes"
[8] "(4) 4 : Almost never"
[9] "(4) Almost never"
[10] "(4) Almost Never"
[11] "(5) 5 : Never"
[12] "(5) Never"
CodePudding user response:
You can use your original logic of updating the levels of the factor, rather than the value of the variable, which requires factoring again.
new_levels <- function(vec) {
if (is.factor(vec)) {
lvls <- gsub('\\d: |\\d : ', '', tolower(levels(vec)))
dups <- which(duplicated(substr(lvls,1,3)))
lvls[dups] <- lvls[dups-1]
lvls[dups] <- lvls[dups-1]
levels(vec) <- lvls
return(vec)
} else {
return(vec)
}
}
df[] <- lapply(df, new_levels)
> microbenchmark::microbenchmark(lapply(df, new_levels),lapply(dat, match_levels))
Unit: milliseconds
expr min lq mean median uq max
lapply(df, new_levels) 4.875398 5.13453 6.654272 5.737605 6.568733 80.17889
lapply(dat, match_levels) 516.539473 532.93423 539.665595 536.944488 541.752497 615.87178
CodePudding user response:
You're using exactly the right tools. However, it's no fun, to gsub
every single factor in detail. Instead we can exploit the neat numbers in front of the factors. First, we split
the levels along those numbers using the result of gsub
. Next, we choose each second level w=2
which looks cleanest to me. Finally we do the same with the entire vector what we did with the levels, gsub
them and create a factor
out of it using the labs
we chose before as labels=
.
dat <- read.csv('~/Downloads/proj3_data_before_cleaning.csv', stringsAsFactors=TRUE)
f <- function(x, w=2) {
if (is.factor(x)) {
lv <- levels(x)
lvs <- split(lv, gsub('^(\\(\\d \\)).*', '\\1', lv))
labs <- sapply(lvs, `[`, w)
factor(gsub('^(\\(\\d \\)).*', '\\1', as.character(x), perl=TRUE), labels=labs)
} else {
x
}
}
dat[] <- lapply(dat[], f) ## apply
summary(dat) ## check