Home > Back-end >  How to create pairs from a single column counting the occurrence in R?
How to create pairs from a single column counting the occurrence in R?

Time:11-16

So I'm working on creating an edges file for a social network analysis based on IMDb data. And I've run into a problem and I can't figure out how to fix it as I'm new to R.

Assuming I have the following dataframe:

movieID <- c('A', 'A','A', 'B','B', 'C','C', 'C')
crewID <- c('Z', 'Y', 'X', 'Z','V','V', 'X', 'Y')
rating <- c('7.3','7.3', '7.3', '2.1', '2.1', '9.0','9.0', '9.0')
df <- data.frame(movieID, crewID, rating)
movieID CrewID Rating
A Z 7.3
A Y 7.3
A X 7.3
B Z 2.1
B V 2.1
C V 9.0
C X 9.0
C Y 9.0

I am trying to build unique pairs of CrewIDs within a movie with a weight that equals the occurrence of that pair, meaning how often these two crew members have worked on a movie together. So basically I want a dataframe like the following as a result:

CrewID1 CrewID2 weight (not a col but explanation)
Z Y 1 together once in movie A
Z X 1 together once in movie A
Y X 2 together twice in movies A and C
Z V 1 together once in movie B
V X 1 together once in movie C
V Y 1 together once in movie C

The pairs (Z,Y) and (Y,Z) are equal to each other as I don't care about direction.

I found the following StackOverflow thread on a similar issue: How to create pairs from a single column based on order of occurrence in R?

However in my case this skips the combination (V,Y) and (X,Z) and the count for (X,Y) is still 1 and I can't figure out how to fix it.

CodePudding user response:

m <- crossprod(table(df[-3]))
m[upper.tri(m, diag = TRUE)] <-0
subset(as.data.frame.table(m), Freq > 0)

   CrewID CrewID.1 Freq
2       X        V    1
3       Y        V    1
4       Z        V    1
7       Y        X    2
8       Z        X    1
12      Z        Y    1

CodePudding user response:

Maybe not the most efficient solution but this would be one way of doing it:

# Define a function that generates pairs of ids
make_pairs <- function(data){
# Extract all ids in the movie
data$crew %>% 
    # Organize them alphabetically
    sort() %>% 
    # Generate all unique pairs
    combn(2) %>% 
    # Prep for map
    as.data.frame() %>% 
    # Generate pairs as single string
    purrr::map_chr(str_flatten, '_')
}
# Generate the data
tibble::tibble(
movie = c('A', 'A', 'A', 'B','B', "C", 'C', 'C'),
crew = c('Z', 'Y', 'X', 'Z', 'V', 'V', 'X', 'Y')
) %>% 
    # Nest the data so all ids in one movie gets put together
    tidyr::nest(data = -movie) %>%
    # Generate pairs of interactions
        dplyr::mutate(
        pairs = purrr::map(data, make_pairs)
    ) %>% 
    # Expand all pairs
    tidyr::unnest(cols = pairs) %>% 
    # Separate them into unique colums
    tidyr::separate(pairs, c('id1', 'id2')) %>% 
    # Count the number of times two ids co-occure
    dplyr::count(id1, id2)

# A tibble: 6 x 3
  id1   id2       n
  <chr> <chr> <int>
1 V     X         1
2 V     Y         1
3 V     Z         1
4 X     Y         2
5 X     Z         1
6 Y     Z         1
  • Related