Is there a way to calculate a percentage through summarize() when some rows are NAs in R that can th-CodePudding

I have a data frame where some columns have entries in every row, and other variables only have entries in some rows, and NAs in the rest. A reproducible example is below:

df <- data.frame(id = 1:5,
                 var1 = c("Agree", "Strongly Agree", "Disagree", "Disagree", "Strongly Agree"),
                 var2 = c("Strongly Agree", NA, "Agree", NA, NA))

I would like to summarize this data frame and calculate the percentage of responses that fall into each category for each column, i.e. the percentage of "Agree" in var1, and the percentage of "Strongly Agree" in var2, among others. I would like the denominator of the percentages, so to speak, not to include the rows with NA, i.e. have different Ns depending on the row. This is what I have tried thus far:

df %>%
  filter(filter(!is.na(var1)) %>% 
  group_by(var1) %>% 
  summarize(frequency = n()/count(filter(df, !is.na(var1))))

This produces a list where the first column is populated with all of the different possible levels, but the second column is a data.frame that in turn includes the frequency values. The problem with this is that I would like to pipe this into ggplot to make a bar graph of the percentages, but since the second column is a data.frame, ggplot will not take the data, throwing the error Error in is.finite(x) : default method not implemented for type 'list'.

Is there a way to make the output of the summarize() function a non-nested data.frame, or alternatively, is there a way to force ggplot to accept the output as shown here? Any assistance would be appreciated!

CodePudding user response：

I think you can achieve what you want by reshaping the data from wide to long format using tidyr::pivot_longer.

library(dplyr)
library(tidyr)
library(ggplot2)

df %>% 
  pivot_longer(-id) %>% 
  filter(!is.na(value)) %>% 
  count(name, value) %>% 
  group_by(name) %>% 
  mutate(frequency = n/sum(n))

Result:

  name  value              n frequency
  <chr> <chr>          <int>     <dbl>
1 var1  Agree              1       0.2
2 var1  Disagree           2       0.4
3 var1  Strongly Agree     2       0.4
4 var2  Agree              1       0.5
5 var2  Strongly Agree     1       0.5

And you can pipe into ggplot.

df %>% 
  pivot_longer(-id) %>% 
  filter(!is.na(value)) %>% 
  count(name, value) %>% 
  group_by(name) %>% 
  mutate(frequency = n/sum(n)) %>% 
  ggplot(aes(name, frequency))   
  geom_col(aes(fill = value))

Result: