I have a sample data frame as follows and I need to explore the Category column.
> df
Id Category
1 [1,2,3,4]
2 [2,3,5]
3 [1,4,5]
4 [1,2,5]
5 [4]
6 [2,3,5]
7 [1,5]
In reality, there are thousands of rows in the data frame with categories between 1 to 25. I want to see how many times these categories co-occurred in that column. The output should be as a matrix or data frame.
Matrix Output:
[,1] [,2] [,3] [,4] [,5]
[1,] 0 2 1 2 3
[2,] 2 0 3 1 3
[3,] 1 3 0 1 2
[4,] 2 1 1 0 1
[5,] 3 3 2 1 0
Dataframe Output:
C1 C2 Count
1 2 2
1 3 1
1 4 2
1 5 3
2 3 3
2 4 1
2 5 3
3 4 1
3 5 2
4 5 1
Can anybody help me in this regard?
CodePudding user response:
We may use crossprod
with table
after splitting the 'Category' column with strsplit
(using only base R
functions)
out <- crossprod(table(subset(stack(setNames(lapply(strsplit(df$Category,
"[][]|,\\s*"), trimws), df$Id))[2:1], nzchar(values))))
diag(out) <- 0
-output
> out
values
values 1 2 3 4 5
1 0 2 1 2 3
2 2 0 3 1 3
3 1 3 0 1 2
4 2 1 1 0 1
5 3 3 2 1 0
In the above code, we extract the numeric part by splitting on the ,
and []
in strsplit
, set the names of the returning list
with Id
column, stack
the named
list
to a two column data.frame (stack
), use table
to get the frequency count and then wrap with crossprod
on the table
output. The diag
onals are then replaced with 0
If we need the long format data.frame, then replace
one of the triangular matrix to 0, before converting to table
and subset
ing after wrapping with as.data.frame
out2 <- subset(as.data.frame.table(replace(out, upper.tri(out),
0)), Freq != 0)
row.names(out2) <- NULL
colnames(out2) <- c("C1", "C2", "Count")
-output
> out2
C1 C2 Count
1 2 1 2
2 3 1 1
3 4 1 2
4 5 1 3
5 3 2 3
6 4 2 1
7 5 2 3
8 4 3 1
9 5 3 2
10 5 4 1
data
df <- structure(list(Id = 1:7, Category = c("[1,2,3,4]", "[2,3,5]",
"[1,4,5]", "[1,2,5]", "[4]", "[2,3,5]", "[1,5]")),
class = "data.frame", row.names = c(NA,
-7L))
CodePudding user response:
another option:
df %>%
group_by(Id) %>%
mutate(Category = list(reticulate::py_eval(Category))) %>%
unnest(Category) %>%
table() %>%
crossprod() %>%
as.data.frame.table() %>%
filter(Category!=Category.1)
Category Category.1 Freq
1 2 1 2
2 3 1 1
3 4 1 2
4 5 1 3
5 1 2 2
6 3 2 3
7 4 2 1
8 5 2 3
9 1 3 1
10 2 3 3
11 4 3 1
12 5 3 2
13 1 4 2
14 2 4 1
15 3 4 1
16 5 4 1
17 1 5 3
18 2 5 3
19 3 5 2
20 4 5 1