I have a data frame like this and I am doing this on R. My problems can be divided into two steps.
SUBID | ABC | BCD | DEF |
---|---|---|---|
192838 | 4 | -3 | 2 |
193928 | -6 | -2 | 6 |
205829 | 4 | -5 | 9 |
201837 | 3 | 4 | 4 |
I want to make a new variable that contains a list of the column names that has a negative value for each SUBID. The output should look something like this:
SUBID | ABC | BCD | DEF | output |
---|---|---|---|---|
192838 | 4 | -3 | 2 | "BCD" |
193928 | -6 | -2 | 6 | "ABC","BCD" |
205829 | 4 | -5 | 9 | "BCD" |
201837 | 3 | 4 | 4 | " " |
And then, in the second step, I would like to collapse the SUBID into a more general ID and get the number of unique strings from the output variable for each ID (I just need the number, the specific strings in the parenthesis are just for illustration).
SUBID | output |
---|---|
19 | 2 ("ABC","BCD") |
20 | 1 ("BCD") |
Those are the two steps that I thing should be done, but maybe there is a way that can skip the first step and goes to the second step directly that I don't know. I would appreciate any help since right now I am not sure where to start on this. Thank you!
CodePudding user response:
Another way:
library(dplyr)
library(tidyr)
df <- df %>% pivot_longer(-SUBID)
df1 <- df %>%
group_by(SUBID) %>%
summarise(output = paste(name[value < 0L], collapse = ','))
df2 <- df %>%
group_by(SUBID = substr(SUBID, 1, 2)) %>%
summarise(output_count = n_distinct(name[value < 0L]),
output = paste0(output_count, ' (', paste(name[value < 0L], collapse = ','), ')'))
Outputs (two columns are created in the second case, one with just the count and another following your example):
df1
# A tibble: 4 x 2
SUBID output
<int> <chr>
1 192838 "BCD"
2 193928 "ABC,BCD"
3 201837 ""
4 205829 "BCD"
df2
# A tibble: 2 x 3
SUBID output_count output
<chr> <int> <chr>
1 19 2 2 (BCD,ABC,BCD)
2 20 1 1 (BCD)
CodePudding user response:
This answers the first part of your question, the second one, I didn't understand
df$output <-apply(df[,-1], 1, function(x) paste(names(df)[-1][x<0], collapse = ","))
df
SUBID ABC BCD DEF output
1 192838 4 3 -2 DEF
2 193928 -6 -2 6 ABC,BCD
3 205829 4 -5 9 BCD
4 201837 3 4 4
For the second part, try this:
id <- sapply(strsplit(sub("\\W ", "", df$output), split = ""), function(x){
sum(!(duplicated(x) | duplicated(x, fromLast = TRUE)))
} )
data.frame(SUBID = substr(df$SUBID, 1,2), output = id, string = df$output)
SUBID output string
1 19 3 DEF
2 19 2 ABC,BCD
3 20 3 BCD
4 20 0
I added the variable string
for you make sure your count of unique values is ok.
CodePudding user response:
Here is a base R option for 2) using aggregate
(includes solution for 1) in dd
)
dd <- cbind(df, output = apply(df, 1, function(x)
paste(colnames(df)[x < 0], collapse = " ")))
aggregate(output ~ sub, cbind(sub = substr(dd$SUBID, 1, 2), dd), function(x)
sum((nchar(x) > 0) * 1))
sub output
1 19 2
2 20 1
Data
df <- structure(list(SUBID = c(192838L, 193928L, 205829L, 201837L),
ABC = c(4L, -6L, 4L, 3L), BCD = c(-3L, -2L, -5L, 4L), DEF = c(2L,
6L, 9L, 4L)), class = "data.frame", row.names = c(NA, -4L
))