Home > Software engineering >  Turns thousands of dummy variables into multinomial variable
Turns thousands of dummy variables into multinomial variable

Time:02-10

I have a dataframe of the following sort:

    a<-c('q','w')
    b<-c(T,T)
    d<-c(F,F)
    .e<-c(T,F)
    .f<-c(F,F)
    .g<-c(F,T)
    h<-c(F,F)
    i<-c(F,T)
    j<-c(T,T)
    
    df<-data.frame(a,b,d,.e,.f,.g,h,i,j)
      a    b     d    .e    .f    .g     h     i    j
    1 q TRUE FALSE  TRUE FALSE FALSE FALSE FALSE TRUE
    2 w TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE TRUE

I want to turn all variables starting with periods at the start into a single multinomial variable called Index such that the second row would have a value 1 for the Index column, the third row would have a value 2, etc. :

df$Index<-c('e','g')
  a    b     d    .e    .f    .g     h     i    j Index
1 q TRUE FALSE  TRUE FALSE FALSE FALSE FALSE TRUE     e
2 w TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE TRUE     g

Although many rows can have a T for any of period-initial variable, each row can be T for only ONE period-initial variable.

If it were just a few items id do an ifelse statement:

df$Index <- ifelse(df$_10000, '10000',...

But there are 12000 of these. The names for all dummy variables begin with underscores, so I feel like there must be a better way. In pseudocode I would say something like:

for every row:
    for every column beginning with '_':
        if value == T:
            assign the name of the column without '_' to a Column 'Index'

Thanks in advance

CodePudding user response:

Sample data:

df <- cbind(a = letters[1:10], b = LETTERS[1:10],
            data.frame(diag(10) == 1))
names(df)[-(1:2)] <- paste0("_", 1:10)
set.seed(42)
df <- df[sample(nrow(df)),]
head(df,3)
#    a b    _1    _2    _3    _4    _5    _6    _7    _8    _9   _10
# 1  a A  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# 5  e E FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
# 10 j J FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

Execution:

df$Index <- apply(subset(df, select = grepl("^_", names(df))), 1, 
                  function(z) which(z)[1])
df
#    a b    _1    _2    _3    _4    _5    _6    _7    _8    _9   _10 Index
# 1  a A  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE     1
# 5  e E FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE     5
# 10 j J FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE    10
# 8  h H FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE     8
# 2  b B FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE     2
# 4  d D FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE     4
# 6  f F FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE     6
# 9  i I FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE     9
# 7  g G FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE     7
# 3  c C FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE     3

If there are more than one TRUE in a row of _-columns, the first found will be used, the remainder silently ignored. If there are none, Index will be NA for that row.

  • Related