So I'm working on creating an edges file for a social network analysis based on IMDb data. And I've run into a problem and I can't figure out how to fix it as I'm new to R.
Assuming I have the following dataframe:
movieID <- c('A', 'A','A', 'B','B', 'C','C', 'C')
crewID <- c('Z', 'Y', 'X', 'Z','V','V', 'X', 'Y')
rating <- c('7.3','7.3', '7.3', '2.1', '2.1', '9.0','9.0', '9.0')
df <- data.frame(movieID, crewID, rating)
movieID | CrewID | Rating |
---|---|---|
A | Z | 7.3 |
A | Y | 7.3 |
A | X | 7.3 |
B | Z | 2.1 |
B | V | 2.1 |
C | V | 9.0 |
C | X | 9.0 |
C | Y | 9.0 |
I am trying to build unique pairs of CrewIDs within a movie with a weight that equals the occurrence of that pair, meaning how often these two crew members have worked on a movie together. So basically I want a dataframe like the following as a result:
CrewID1 | CrewID2 | weight | (not a col but explanation) |
---|---|---|---|
Z | Y | 1 | together once in movie A |
Z | X | 1 | together once in movie A |
Y | X | 2 | together twice in movies A and C |
Z | V | 1 | together once in movie B |
V | X | 1 | together once in movie C |
V | Y | 1 | together once in movie C |
The pairs (Z,Y) and (Y,Z) are equal to each other as I don't care about direction.
I found the following StackOverflow thread on a similar issue: How to create pairs from a single column based on order of occurrence in R?
However in my case this skips the combination (V,Y) and (X,Z) and the count for (X,Y) is still 1 and I can't figure out how to fix it.
CodePudding user response:
m <- crossprod(table(df[-3]))
m[upper.tri(m, diag = TRUE)] <-0
subset(as.data.frame.table(m), Freq > 0)
CrewID CrewID.1 Freq
2 X V 1
3 Y V 1
4 Z V 1
7 Y X 2
8 Z X 1
12 Z Y 1
CodePudding user response:
Maybe not the most efficient solution but this would be one way of doing it:
# Define a function that generates pairs of ids
make_pairs <- function(data){
# Extract all ids in the movie
data$crew %>%
# Organize them alphabetically
sort() %>%
# Generate all unique pairs
combn(2) %>%
# Prep for map
as.data.frame() %>%
# Generate pairs as single string
purrr::map_chr(str_flatten, '_')
}
# Generate the data
tibble::tibble(
movie = c('A', 'A', 'A', 'B','B', "C", 'C', 'C'),
crew = c('Z', 'Y', 'X', 'Z', 'V', 'V', 'X', 'Y')
) %>%
# Nest the data so all ids in one movie gets put together
tidyr::nest(data = -movie) %>%
# Generate pairs of interactions
dplyr::mutate(
pairs = purrr::map(data, make_pairs)
) %>%
# Expand all pairs
tidyr::unnest(cols = pairs) %>%
# Separate them into unique colums
tidyr::separate(pairs, c('id1', 'id2')) %>%
# Count the number of times two ids co-occure
dplyr::count(id1, id2)
# A tibble: 6 x 3
id1 id2 n
<chr> <chr> <int>
1 V X 1
2 V Y 1
3 V Z 1
4 X Y 2
5 X Z 1
6 Y Z 1