Home > Blockchain >  Create cross-tabulation of most frequent value of string variable and sort by frequency
Create cross-tabulation of most frequent value of string variable and sort by frequency

Time:06-10

I have a sample dataset:

df <- data.frame(category = c("A", "A", "B", "C", "C", "D", "E", "C", "E", "A", "B", "C", "B", "B", "B", "D", "D", "D", "D", "B"), year = c(1, 2, 1, 2, 3, 2, 3, 1, 3, 2, 1, 1, 2, 1, 2, 3, 1, 2, 3, 1))

and would like to create a cross-tabulation of year and category such that only the 3 most frequent categories are in the table and also sorted by total number of occurences:

    1 2 3
  B 4 2 0
  D 1 2 2
  C 2 1 1

Using something like

df %>% 
  add_count(category) %>% 
  filter(n %in% tail(sort(unique(n)),3)) %>% 
  arrange(desc(n)) %>% {table(.$category, .$year)}

will filter for the three most occurring categories but leave the table unsorted

    1 2 3
  B 4 2 0
  C 2 1 1
  D 1 2 2

CodePudding user response:

This should give you what you want.

# Make a table
df.t <- table(df)
# Order by top occurrences (sum over margin 1)
df.t <- df.t[order(apply(df.t, 1, sum), decreasing=TRUE),]
# Keep top 3 results
df.t <- df.t[1:3,]

Output:

        year
category 1 2 3
       B 4 2 0
       D 1 2 2
       C 2 1 1

CodePudding user response:

It's not elegent solution using base R but it works

result <- as.data.frame.matrix(table(df))
result$sum <- rowSums(result)
result <- result[order(-result$sum),]
result <- result[1:3,]
result$sum <- NULL

  1 2 3
B 4 2 0
D 1 2 2
C 2 1 1

CodePudding user response:

You'd want to arrange by the rowsums after creating table. If you want to stay (more) within tidyverse, e.g.:

df |>
  janitor::tabyl(category, year) |>
  arrange(desc(rowSums(across(where(is.numeric))))) |>
  head(3)

Here with janitor::tabyl(), but you could also use dplyr::tally() and tidyr::pivot_longer() directly or do df |> table() |> as.data.frame.matrix() like @Adamm.

  • Related