Home > Software engineering >  summary_factorlist, Having Error due variables with less than two levels
summary_factorlist, Having Error due variables with less than two levels

Time:09-05

*I have a large data set including 2000 variables, including factors and continuous variables.

For example:

library(finalfit)
library(dplyr)
data(colon_s)
explanatory = c("age", "age.factor", "sex.factor", "obstruct.factor")
dependent = "perfor.factor"

I use the following function to compare the mean of each continuous variable among the level of the categorical dependent variable (ANOVA) or the percentage of each categorical variable among the level of the categorical dependent variable (CHI-SQUARE)

summary_factorlist(colon_s, dependent ="perfor.factor", explanatory =explanatory , add_dependent_label=T, p=T,p_cat="fisher", p_cont_para = "aov", fit_id
= T)

But as soon as running the above code, I got the following error: Error in dplyr::summarise(): ! Problem while computing ..1 = ...$p.value. Caused by error in fisher.test(): ! 'x' and 'y' must have at least 2 levels *In the data set, there are some variables which do not include at least two levels or just one of their levels has a non-zero frequency. I was wondering if there is any loop function to remove the variable if one of these conditions satisfies.

  1. If the variable includes just one level
  2. If the variable includes more than one level but the frequency of just one level is no-zero.
  3. if all values of the variable are missing*

CodePudding user response:

Update (partial answer): With this code we can remove factors with only one level and keep other non factor variables:

x <- colon_s[, (sapply(colon_s, nlevels)>1) | (sapply(colon_s, is.factor)==FALSE)]

CodePudding user response:

The OP's code does work with the data provided

library(dplyr)
library(finalfit)
summary_factorlist(colon_s, dependent ="perfor.factor", 
   explanatory =explanatory , 
    add_dependent_label=TRUE, p=TRUE,p_cat="fisher", p_cont_para = "aov", fit_id = TRUE)
 Dependent: Perforation                      No         Yes     p                fit_id index
            Age (years)   Mean (SD) 59.8 (11.9) 58.4 (13.3) 0.542                   age     1
                    Age   <40 years    68 (7.5)     2 (7.4) 1.000   age.factor<40 years     2
                        40-59 years  334 (37.0)   10 (37.0)       age.factor40-59 years     3
                          60  years  500 (55.4)   15 (55.6)         age.factor60  years     4
                    Sex      Female  432 (47.9)   13 (48.1) 1.000      sex.factorFemale     5
                               Male  470 (52.1)   14 (51.9)              sex.factorMale     6
            Obstruction          No  715 (81.2)   17 (63.0) 0.026     obstruct.factorNo     7
                                Yes  166 (18.8)   10 (37.0)          obstruct.factorYes     8

The strcture of data shows the factor variables to have more than 1 level

> str(colon_s[c(explanatory, dependent)])
'data.frame':   929 obs. of  5 variables:
 $ age            : num  43 63 71 66 69 57 77 54 46 68 ...
  ..- attr(*, "label")= chr "Age (years)"
 $ age.factor     : Factor w/ 3 levels "<40 years","40-59 years",..: 2 3 3 3 3 2 3 2 2 3 ...
  ..- attr(*, "label")= chr "Age"
 $ sex.factor     : Factor w/ 2 levels "Female","Male": 2 2 1 1 2 1 2 2 2 1 ...
  ..- attr(*, "label")= chr "Sex"
 $ obstruct.factor: Factor w/ 2 levels "No","Yes": NA 1 1 2 1 1 1 1 1 1 ...
  ..- attr(*, "label")= chr "Obstruction"
 $ perfor.factor  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, "label")= chr "Perforation"

Regarding selection of factor variables with the condition mentioned, we could use

library(dplyr)
colon_s_sub <- colon_s %>% 
  select(where(~ is.factor(.x) && nlevels(.x) > 1 && all(table(.x) > 0) &
         sum(complete.cases(.x)) > 0)) 
  •  Tags:  
  • r
  • Related