I have a large dataset (>15000 cases) with variables of different numbers of values. Here as a much smaller sample data set:
df <- structure(list(year = c("1", "2", "3", "1", "2", "3", "1", "2",
"3", "1", "2", "3", "1", "2", "3", "1", "2", "3", "1", "2", "3",
"1", "2", "3", "1", "2", "3", "1", "2", "3"), var2tab = c("1",
"3", "1", "2", "2", "3", "3", "1", "3", "3", "3", "2", "2", "1",
"2", "1", "1", "2", "3", "3", "1", "1", "2", "2", "2", "3", "1",
"1", "3", "2"), group = c("1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "2", "2", "2", "2", "2", "2", "2", "2", "2",
"2", "2", "3", "3", "3", "3", "3", "3", "3", "3")),
class = "data.frame", row.names = c(NA, -30L))
First I save where in my large dataset the columns to be used are located. This helps with indexing.
var2use <- which(colnames(df) == "var2tab")
var2x <- which(colnames(df) == "group")
In my large dataset are variables with different numbers of values, so I count the number of values and years.
years_unique <- unique(df$year)
x_unique <- unique(df[var2x])
x_unique <- unlist (x_unique)
n_years <- length (years_unique)
n_var2x <- length (x_unique)
Now I create a list of dataframes for each group.
my_list <- NULL
my_list <- list()
for (i in x_unique) {
for (j in years_unique)
{
my_list[[i]] <- filter(df, df[var2x] == i)
}
}
Up to this point, everything works as desired...
"my_list" contains a data record of the values for the individual years for each group. On this basis, I would like to create new data sets by counting the variable "var2tab" for each year in order to be able to calculate the respective percentage values. However, so far I have only been able to do this for all years in total:
my_df <- NULL
my_df <- list()
for (j in years_unique) {
my_df[[j]] <-
count(my_list[[j]][var2use])
my_df[[j]]$var_rel <-
my_df[[j]][, 2] / sum(my_df[[j]][, 2])
my_df[[j]]$group <-
paste0(j)
}
Edit: Here is a desired output for this sample input:
df_new <- data.frame(group = c("1", "1", "1", "2", "2", "2", "3", "3", "3"), value = c("1", "2", "3"), abs_year1 =c("1", "1", "1", "2", "2", "1", "1", "0", "1"), rel_year1 = c(".25", ".25", ".333", ".5", ".667", ".25", ".5", "0", ".333"), abs_year2 =c("1", "1", "0", "1", "0", "3", "1", "1", "2"), rel_year2 = c(".25", ".25", "0", ".25", "0", ".75", ".5", ".333", ".667"), abs_year3 =c("2", "2", "2", "1", "1", "0", "0", "2", "0"), rel_year3 = c(".5", ".5", ".667", ".25", ".333", "0", "0", ".667", "0"))
CodePudding user response:
Here is a simple way to count the var2tab
values grouped by group
and year
, and then to calculate their relative frequency within each group and year:
library(dplyr)
df %>%
count(group, year, var2tab) %>%
group_by(group, year) %>%
mutate(proportion = n / sum(n))
# # A tibble: 21 × 5
# # Groups: group, year [9]
# group year var2tab n pct
# <chr> <chr> <chr> <int> <dbl>
# 1 1 1 1 1 0.25
# 2 1 1 2 1 0.25
# 3 1 1 3 2 0.5
# 4 1 2 1 1 0.25
# 5 1 2 2 1 0.25
# 6 1 2 3 2 0.5
# 7 1 3 1 1 0.333
# 8 1 3 3 2 0.667
# 9 2 1 1 2 0.5
# 10 2 1 2 1 0.25
# # … with 11 more rows
# # ℹ Use `print(n = ...)` to see more rows
If you want them in a list separate data frames, you can add ... %>% group_split()
to the end.