Home > Enterprise >  How do I find all the rows that share column value 2 based on column value 1?
How do I find all the rows that share column value 2 based on column value 1?

Time:02-11

I have a dataframe that looks like this

Movie          Actor      Year 
Ocean's Eleven Brad Pitt  2001
Ocean's Eleven George Clooney 2001
Ocean's Eleven Matt Damon 2001
Pulp Fiction   Samuel L. Jackson 1994
Pulp Fiction   John Travolta 1994

etcc...

It basically is a long list of movies with the main actors in that movie Now my question: how can I find all the movies that share atleast one actor?

For example:

  • Ocean's Eleven and Ocean's Twelve both share Brad Pitt and George Clooney
  • Ocean's Eleven and Fury they both have Brad Pitt
  • Ocean's Eleven and Pulp Fiction do not share any main actors so it should not be found

I know I can use the intersect function to find overlapping elements in vectors/lists however I'm not sure how to do this for all unique movies in my dataframe! I thought of writing a for loop but I figured that there is probably an easier method.

CodePudding user response:

split Movie by Actor then subset for lengths > 1.

split(dat$Movie, dat$Actor) |> {\(.) .[lengths(.) > 1]}()
# $`Brad Pitt`
# [1] "Ocean's Eleven" "Fury"          

Data:

dat <- structure(list(Movie = c("Ocean's Eleven", "Ocean's Eleven", 
"Ocean's Eleven", "Fury", "Fury", "Pulp Fiction", "Pulp Fiction"
), Actor = c("Brad Pitt", "George Clooney", "Matt Damon", "Brad Pitt", 
"Someone Else", "Samuel L. Jackson", "John Travolta"), Year = c(2001L, 
2001L, 2001L, 2222L, 2222L, 1994L, 1994L)), row.names = c(NA, 
-7L), class = "data.frame")

CodePudding user response:

We can use group_by(Actor) to group Actor together, and filter any Actor that appears more than 1 time.

Example input

library(tidyverse)

set.seed(3)
df <- tibble(Actor = sample(LETTERS[1:7], 10, replace  = T), 
             Movie = paste0("Movie", sample(1:5, 10, replace = T))) %>% 
  arrange(Actor)

# A tibble: 10 x 2
   Actor Movie 
   <chr> <chr> 
 1 A     Movie1
 2 B     Movie3
 3 C     Movie3
 4 C     Movie3
 5 C     Movie4
 6 E     Movie5
 7 E     Movie5
 8 E     Movie5
 9 E     Movie2
10 F     Movie3

Code and output

df %>% group_by(Actor) %>% filter(n() > 1)

# A tibble: 7 x 2
# Groups:   Actor [2]
  Actor Movie 
  <chr> <chr> 
1 C     Movie3
2 C     Movie3
3 C     Movie4
4 E     Movie5
5 E     Movie5
6 E     Movie5
7 E     Movie2
  • Related