Home > Net >  Get column names into a new variable based on conditions
Get column names into a new variable based on conditions

Time:11-22

I have a data frame like this and I am doing this on R. My problems can be divided into two steps.

SUBID ABC BCD DEF
192838 4 -3 2
193928 -6 -2 6
205829 4 -5 9
201837 3 4 4

I want to make a new variable that contains a list of the column names that has a negative value for each SUBID. The output should look something like this:

SUBID ABC BCD DEF output
192838 4 -3 2 "BCD"
193928 -6 -2 6 "ABC","BCD"
205829 4 -5 9 "BCD"
201837 3 4 4 " "

And then, in the second step, I would like to collapse the SUBID into a more general ID and get the number of unique strings from the output variable for each ID (I just need the number, the specific strings in the parenthesis are just for illustration).

SUBID output
19 2 ("ABC","BCD")
20 1 ("BCD")

Those are the two steps that I thing should be done, but maybe there is a way that can skip the first step and goes to the second step directly that I don't know. I would appreciate any help since right now I am not sure where to start on this. Thank you!

CodePudding user response:

Another way:

library(dplyr)
library(tidyr)

df <- df %>% pivot_longer(-SUBID)

df1 <- df %>%
  group_by(SUBID) %>%
  summarise(output = paste(name[value < 0L], collapse = ','))

df2 <- df %>%
  group_by(SUBID = substr(SUBID, 1, 2)) %>%
  summarise(output_count = n_distinct(name[value < 0L]),
            output = paste0(output_count, ' (', paste(name[value < 0L], collapse = ','), ')'))

Outputs (two columns are created in the second case, one with just the count and another following your example):

df1

# A tibble: 4 x 2
   SUBID output   
   <int> <chr>    
1 192838 "BCD"    
2 193928 "ABC,BCD"
3 201837 ""       
4 205829 "BCD"    

df2

# A tibble: 2 x 3
  SUBID output_count output         
  <chr>        <int> <chr>          
1 19               2 2 (BCD,ABC,BCD)
2 20               1 1 (BCD)   

CodePudding user response:

This answers the first part of your question, the second one, I didn't understand

 df$output <-apply(df[,-1], 1, function(x) paste(names(df)[-1][x<0], collapse = ","))
 df
   SUBID ABC BCD DEF  output
1 192838   4   3  -2     DEF
2 193928  -6  -2   6 ABC,BCD
3 205829   4  -5   9     BCD
4 201837   3   4   4        

For the second part, try this:

id <- sapply(strsplit(sub("\\W ", "",  df$output), split = ""), function(x){
  sum(!(duplicated(x) | duplicated(x, fromLast = TRUE)))
} )


   data.frame(SUBID = substr(df$SUBID, 1,2), output = id, string = df$output)
      SUBID output  string
    1    19      3     DEF
    2    19      2 ABC,BCD
    3    20      3     BCD
    4    20      0  

I added the variable string for you make sure your count of unique values is ok.

CodePudding user response:

Here is a base R option for 2) using aggregate (includes solution for 1) in dd)

dd <- cbind(df, output = apply(df, 1, function(x) 
  paste(colnames(df)[x < 0], collapse = " ")))

aggregate(output ~ sub, cbind(sub = substr(dd$SUBID, 1, 2), dd), function(x) 
  sum((nchar(x) > 0) * 1))
  sub output
1  19      2
2  20      1

Data

df <- structure(list(SUBID = c(192838L, 193928L, 205829L, 201837L),
    ABC = c(4L, -6L, 4L, 3L), BCD = c(-3L, -2L, -5L, 4L), DEF = c(2L,
    6L, 9L, 4L)), class = "data.frame", row.names = c(NA, -4L
))
  • Related