Home > database >  How to count values in a list by group and save the result as part of a list
How to count values in a list by group and save the result as part of a list

Time:08-24

I have a large dataset (>15000 cases) with variables of different numbers of values. Here as a much smaller sample data set:

df <- structure(list(year = c("1", "2", "3", "1", "2", "3", "1", "2", 
"3", "1", "2", "3", "1", "2", "3", "1", "2", "3", "1", "2", "3", 
"1", "2", "3", "1", "2", "3", "1", "2", "3"), var2tab = c("1", 
"3", "1", "2", "2", "3", "3", "1", "3", "3", "3", "2", "2", "1", 
"2", "1", "1", "2", "3", "3", "1", "1", "2", "2", "2", "3", "1", 
"1", "3", "2"), group = c("1", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "2", "2", "2", "2", "2", "2", "2", "2", "2", 
"2", "2", "3", "3", "3", "3", "3", "3", "3", "3")),
class = "data.frame", row.names = c(NA, -30L))

First I save where in my large dataset the columns to be used are located. This helps with indexing.

  var2use <-  which(colnames(df) == "var2tab")
  var2x <-  which(colnames(df) == "group")

In my large dataset are variables with different numbers of values, so I count the number of values and years.

  years_unique <- unique(df$year)

  x_unique <- unique(df[var2x])
  x_unique <- unlist (x_unique)

  n_years <- length (years_unique)
  n_var2x <- length (x_unique)

Now I create a list of dataframes for each group.

  my_list <- NULL
  my_list <- list()
  for (i in x_unique) {
  for (j in years_unique)
  {
  my_list[[i]] <- filter(df, df[var2x] == i)
  }
  }

Up to this point, everything works as desired...

"my_list" contains a data record of the values for the individual years for each group. On this basis, I would like to create new data sets by counting the variable "var2tab" for each year in order to be able to calculate the respective percentage values. However, so far I have only been able to do this for all years in total:

  my_df <- NULL
  my_df <- list()
  for (j in years_unique) {
  my_df[[j]] <-
  count(my_list[[j]][var2use])
  my_df[[j]]$var_rel <-
  my_df[[j]][, 2] / sum(my_df[[j]][, 2])
  my_df[[j]]$group <-
  paste0(j)
  }

Edit: Here is a desired output for this sample input:

df_new <- data.frame(group = c("1", "1", "1", "2", "2", "2", "3", "3", "3"), value = c("1", "2", "3"), abs_year1 =c("1", "1", "1", "2", "2", "1", "1", "0", "1"), rel_year1 = c(".25", ".25", ".333", ".5", ".667", ".25", ".5", "0", ".333"), abs_year2 =c("1", "1", "0", "1", "0", "3", "1", "1", "2"), rel_year2 = c(".25", ".25", "0", ".25", "0", ".75", ".5", ".333", ".667"), abs_year3 =c("2", "2", "2", "1", "1", "0", "0", "2", "0"), rel_year3 = c(".5", ".5", ".667", ".25", ".333", "0", "0", ".667", "0"))

CodePudding user response:

Here is a simple way to count the var2tab values grouped by group and year, and then to calculate their relative frequency within each group and year:

library(dplyr)
df %>%
  count(group, year, var2tab) %>%
  group_by(group, year) %>%
  mutate(proportion = n / sum(n))
# # A tibble: 21 × 5
# # Groups:   group, year [9]
#    group year  var2tab     n   pct
#    <chr> <chr> <chr>   <int> <dbl>
#  1 1     1     1           1 0.25 
#  2 1     1     2           1 0.25 
#  3 1     1     3           2 0.5  
#  4 1     2     1           1 0.25 
#  5 1     2     2           1 0.25 
#  6 1     2     3           2 0.5  
#  7 1     3     1           1 0.333
#  8 1     3     3           2 0.667
#  9 2     1     1           2 0.5  
# 10 2     1     2           1 0.25 
# # … with 11 more rows
# # ℹ Use `print(n = ...)` to see more rows

If you want them in a list separate data frames, you can add ... %>% group_split() to the end.

  • Related