I have a dataframe that looks like this:
Cluster.x Cluster.y
1 5 7
2 4 2
3 4 7
4 4 4
5 1 4
6 4 4
dput:
structure(list(Cluster.x = c(5L, 4L, 4L, 4L, 1L, 4L), Cluster.y = c(7L,
2L, 7L, 4L, 4L, 4L)), row.names = c(NA, 6L), class = "data.frame")
I now want to know what the pairing probability is for each cluster with every other cluster (it doesn't matter if they are in column x or y, cluster 4 is always the same cluster 4). To give some context: the clusters are groups of countries and I want to know for every cluster, what their probability of conflict (rows represent conflicts) with every other cluster are. Can anyone point me in the right direction?
EDIT: My desired output would be a dataframe with each combination of clusters and their corresponding count and probability in another column.
CodePudding user response:
Sort your data so the lower-numbered cluster is always in the first column, then count and compute frequencies.
df %>%
mutate(Cx = pmin(Cluster.x, Cluster.y),
Cy = pmax(Cluster.x, Cluster.y)) %>%
select(Cx, Cy) %>%
count(Cx, Cy) %>%
mutate(p = n / sum(n))
# Cx Cy n p
# 1 1 4 1 0.1666667
# 2 2 4 1 0.1666667
# 3 4 4 2 0.3333333
# 4 4 7 1 0.1666667
# 5 5 7 1 0.1666667