When run this code,I will get an error:
genes<-colnames(survdata)[-c(1:3)]
univ_formulas<-sapply(genes,function(x)as.formula(paste('Surv(OS,status)~',x)))
Error in str2lang(x) : <text>:1:31: unexpected symbol
1: Surv(OS,status)~ ABC7-42389800N19.1
^
If I remove the element and run the code again, a similar error appears again:
univ_formulas<-sapply(genes,function(x)as.formula(paste('Surv(OS,status)~',x)))
Error in str2lang(x) : <text>:1:26: unexpected symbol
1: Surv(OS,status)~ CITF22-1A6.3
^
I don't know where the wrong is.
example of the data:
head(genes,n = 50)
[1] "A1BG" "A1BG-AS1" "A2M"
[4] "A2M-AS1" "A2ML1" "A2MP1"
[7] "A3GALT2" "A4GALT" "AAAS"
[10] "AACS" "AACSP1" "AADAT"
[13] "AAED1" "AAGAB" "AAK1"
[16] "AAMDC" "AAMP" "AANAT"
[19] "AAR2" "AARD" "AARS"
[22] "AARS2" "AARSD1" "AASDH"
[25] "AASDHPPT" "AASS" "AATF"
[28] "AATK" "AATK-AS1" "ABAT"
[31] "ABC7-42389800N19.1" "ABCA1" "ABCA10"
[34] "ABCA11P" "ABCA12" "ABCA13"
[37] "ABCA17P" "ABCA2" "ABCA3"
[40] "ABCA4" "ABCA5" "ABCA6"
[43] "ABCA7" "ABCA8" "ABCA9"
[46] "ABCB1" "ABCB10" "ABCB4"
[49] "ABCB6" "ABCB7"
CodePudding user response:
This is because the names of the genes contain -
which base::str2lang
regards as a mathematical expression. We can fix this as follows:
- "Clean" gene names to convert
-
to_
and document this somewhere.
We then have:
genes <- c("ABC7-42389800N19.1", "AATK-AS1")
sapply(genes,function(x)as.formula(paste('Surv(OS,status)~',
sub("-", "_",x))))
$`ABC7-42389800N19.1`
Surv(OS, status) ~ ABC7_42389800N19.1
<environment: 0x000002ad508b58e8>
$`AATK-AS1`
Surv(OS, status) ~ AATK_AS1
<environment: 0x000002ad508b3c30>
This is an illustration of why that is the case:
A <- 4; B<- 20
str2lang("A-B")
A - B
eval(str2lang("A-B"))
[1] -16
str2lang
is essentially similar to the dreaded eval-parse
framework. From the docs, this is what it does:
str2expression(s) and str2lang(s) return special versions of parse(text=s, keep.source=FALSE) and can therefore be regarded as transforming character strings s to expressions, calls, etc.
NOTE
- Since this is to be used in modeling, it is probably better to perform the
sub
at thecolnames
stage such that the input data to the model has the names we expect:
# not tested but you get the idea
colnames(survdata)[-c(1:3)]<-sub("-", "_",colnames(survdata)[-c(1:3)])
- It is important, for biological/research purposes, to document why gene names where cleaned as suggested in this answer.