Home > Mobile >  Assess probability of co-occurrence of two variables
Assess probability of co-occurrence of two variables

Time:11-01

I have a dataframe that looks like this:

  Cluster.x Cluster.y
1         5         7
2         4         2
3         4         7
4         4         4
5         1         4
6         4         4

dput:

structure(list(Cluster.x = c(5L, 4L, 4L, 4L, 1L, 4L), Cluster.y = c(7L, 
2L, 7L, 4L, 4L, 4L)), row.names = c(NA, 6L), class = "data.frame")

I now want to know what the pairing probability is for each cluster with every other cluster (it doesn't matter if they are in column x or y, cluster 4 is always the same cluster 4). To give some context: the clusters are groups of countries and I want to know for every cluster, what their probability of conflict (rows represent conflicts) with every other cluster are. Can anyone point me in the right direction?

EDIT: My desired output would be a dataframe with each combination of clusters and their corresponding count and probability in another column.

CodePudding user response:

Sort your data so the lower-numbered cluster is always in the first column, then count and compute frequencies.

df %>%
   mutate(Cx = pmin(Cluster.x, Cluster.y),
          Cy = pmax(Cluster.x, Cluster.y)) %>%
  select(Cx, Cy) %>%
  count(Cx, Cy) %>%
  mutate(p = n / sum(n))
#   Cx Cy n         p
# 1  1  4 1 0.1666667
# 2  2  4 1 0.1666667
# 3  4  4 2 0.3333333
# 4  4  7 1 0.1666667
# 5  5  7 1 0.1666667
  • Related