Home > Software engineering >  Summarize multiple fields in R and suppressing values less than x
Summarize multiple fields in R and suppressing values less than x

Time:10-19

I am working with a dataframe with thousands of responses to questions about interest in a set of resources. I want to summarize how many participants are interested in a given resource by counting the number of positive responses (coded as "1").

As a final step, I would like to suppress any answer where <5 participants responded.

I've created code that works, but its clunky when I'm dealing with dozens of fields. So, I'm looking for suggestions for a more streamlined approach, perhaps using piping or dplyr?

Example Input

ID Resource1 Resource2 Resource3 Resource4
1 1 0 1 1
2 0 0 0 1
3 1 0 0 0
4 0 0 0 0
5 1 1 1 1

Desired output

Interested Not Interested
Resource1 3 2
Resource2 1 4
Resource3 2 3
Resource4 3 2

My (ugly) code

###Select and summarise relevent columns
resource1 <- df %>% drop_na(resource1) %>% group_by(resource1) %>% summarise(n=n()) %>% rename(resp=resource1, r1 =n)
resource2 <- df %>% drop_na(resource2) %>% group_by(resource2) %>% summarise(n=n()) %>% rename(resp=resource2, r2 =n)
resource3 <- df %>% drop_na(resource3) %>% group_by(resource3) %>% summarise(n=n()) %>% rename(resp=resource3, r3 =n)
resource4 <- df %>% drop_na(resource4) %>% group_by(resource4) %>% summarise(n=n()) %>% rename(resp=resource4, r4 =n)

###Merge summarised data
resource_sum <-join_all(list(resource1,resource2,resource3,resource4), by=c("resp"))

###Replace all values less than 5 with NA per suppression rules. 
resource_sum <- apply(resource_sum, function(x) ifelse(x<5, "NA", x))
resource_sum <-as.data.frame(resource_sum)

CodePudding user response:

We may reshape into 'long' format with pivot_longer and then do a group by summarise to get the count of 1s and 0s

library(dplyr)
library(tidyr)
library(tibble)
df %>% 
   pivot_longer(cols = -ID) %>%
   group_by(name) %>%
   summarise(Interested = sum(value), NotInterested = n() - Interested) %>%
   column_to_rownames('name')

-output

            Interested NotInterested
Resource1          3             2
Resource2          1             4
Resource3          2             3
Resource4          3             2

Or using base R

v1 <- colSums(df[-1])
cbind(Interested = v1, NotInterested = nrow(df) - v1)

-output

          Interested NotInterested
Resource1          3             2
Resource2          1             4
Resource3          2             3
Resource4          3             2

data

df <- structure(list(ID = 1:5, Resource1 = c(1L, 0L, 1L, 0L, 1L),
 Resource2 = c(0L, 
0L, 0L, 0L, 1L), Resource3 = c(1L, 0L, 0L, 0L, 1L), Resource4 = c(1L, 
1L, 0L, 0L, 1L)), class = "data.frame", row.names = c(NA, -5L
))

CodePudding user response:

You can use table to get counts of 0 and 1 value. To apply the function (table) to multiple columns you can use sapply -

t(sapply(df[-1], table))

#          0 1
#Resource1 2 3
#Resource2 4 1
#Resource3 3 2
#Resource4 2 3
  • Related