Home > Software engineering >  count co-occurrences of values in dataframe r
count co-occurrences of values in dataframe r

Time:04-25

I have a sample data frame as follows and I need to explore the Category column.

> df
  Id  Category
  1 [1,2,3,4]
  2   [2,3,5]
  3   [1,4,5]
  4   [1,2,5]
  5       [4]
  6   [2,3,5]
  7     [1,5]

In reality, there are thousands of rows in the data frame with categories between 1 to 25. I want to see how many times these categories co-occurred in that column. The output should be as a matrix or data frame.

Matrix Output:

     [,1] [,2] [,3] [,4] [,5]
[1,]    0    2    1    2    3
[2,]    2    0    3    1    3
[3,]    1    3    0    1    2
[4,]    2    1    1    0    1
[5,]    3    3    2    1    0

Dataframe Output:

     C1 C2     Count
     1  2      2
     1  3      1
     1  4      2
     1  5      3
     2  3      3
     2  4      1
     2  5      3
     3  4      1
     3  5      2
     4  5      1

Can anybody help me in this regard?

CodePudding user response:

We may use crossprod with table after splitting the 'Category' column with strsplit (using only base R functions)

out <- crossprod(table(subset(stack(setNames(lapply(strsplit(df$Category, 
    "[][]|,\\s*"), trimws), df$Id))[2:1], nzchar(values))))
diag(out) <- 0

-output

> out
      values
values 1 2 3 4 5
     1 0 2 1 2 3
     2 2 0 3 1 3
     3 1 3 0 1 2
     4 2 1 1 0 1
     5 3 3 2 1 0

In the above code, we extract the numeric part by splitting on the , and [] in strsplit, set the names of the returning list with Id column, stack the named list to a two column data.frame (stack), use table to get the frequency count and then wrap with crossprod on the table output. The diagonals are then replaced with 0


If we need the long format data.frame, then replace one of the triangular matrix to 0, before converting to table and subseting after wrapping with as.data.frame

out2 <-  subset(as.data.frame.table(replace(out, upper.tri(out), 
     0)), Freq != 0)
row.names(out2) <- NULL
colnames(out2) <- c("C1", "C2", "Count")

-output

> out2
   C1 C2 Count
1   2  1     2
2   3  1     1
3   4  1     2
4   5  1     3
5   3  2     3
6   4  2     1
7   5  2     3
8   4  3     1
9   5  3     2
10  5  4     1

data

df <- structure(list(Id = 1:7, Category = c("[1,2,3,4]", "[2,3,5]", 
"[1,4,5]", "[1,2,5]", "[4]", "[2,3,5]", "[1,5]")), 
class = "data.frame", row.names = c(NA, 
-7L))

CodePudding user response:

another option:

df %>%
  group_by(Id) %>%
  mutate(Category = list(reticulate::py_eval(Category))) %>%
  unnest(Category) %>%
  table()  %>%
  crossprod() %>%
  as.data.frame.table() %>%
  filter(Category!=Category.1)

   Category Category.1 Freq
1         2          1    2
2         3          1    1
3         4          1    2
4         5          1    3
5         1          2    2
6         3          2    3
7         4          2    1
8         5          2    3
9         1          3    1
10        2          3    3
11        4          3    1
12        5          3    2
13        1          4    2
14        2          4    1
15        3          4    1
16        5          4    1
17        1          5    3
18        2          5    3
19        3          5    2
20        4          5    1
  • Related