I am looking for some tips to use the dplyr package in r. I have extracted two movie datasets from a common df, where the variable 'type' tells you when the movie was nomineed and awarded. The original df was not tidy so there are duplicated rows for those movie that were awarded, as they necessarily have to be nominated as well. They look like the followings:
#df1
title | type| year| producer|
filmA nominee 2009 HBO
filmB nominee 2011 NETFLIX
filmC nominee 2011 HBO
filmD nominee 2017 AMAZON
filmE nominee 2017 NETFLIX
filmF nominee 2018 AMAZON
#df2
title | type| year| producer|
filmA awarded 2009 HBO
filmF awarded 2018 AMAZON
filmE awarded 2017 NETFLIX
I manage to bind the two dataset and the result looks like this:
#rbind(df1,df2)
title | type| year| producer|
filmA nominee 2009 HBO
filmB nominee 2011 NETFLIX
filmC nominee 2011 HBO
filmD nominee 2017 AMAZON
filmE nominee 2017 NETFLIX
filmF nominee 2018 AMAZON
filmA awarded 2009 HBO
filmF awarded 2018 AMAZON
filmE awarded 2017 NETFLIX
What I would like to have is a count of nominees and awards for each movie's producer and for each year, something like this:
producer| nr_of_nominee| nr_of_awards| year|
HBO 1 1 2009
NETFLIX 1 0 2011
HBO 1 0 2011
AMAZON 1 0 2017
NETFLIX 2 1 2017
AMAZON 2 1 2018
NETFLIX 1 0 2018
Instead what I get using the count() havee duplicated rows, like this:
producer| nr_of_nominee| nr_of_awards| year|
HBO 1 0 2009
HBO 0 1 2009
NETFLIX 1 0 2011
HBO 1 0 2011
AMAZON 1 0 2017
NETFLIX 2 0 2017
NETFLIX 0 1 2017
AMAZON 2 0 2018
AMAZON 0 1 2018
NETFLIX 1 0 2018
how can I fix? any help would be helpful thanks
CodePudding user response:
Is this what you're looking for?
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
movies <- data.frame(
title=c(paste0("film", LETTERS[1:6]), paste0("film", LETTERS[1:3])),
type=c(rep("nominee", 6), rep("awarded", 3)),
year=c(2009, 2011, 2011, 2017, 2017, 2018, 2009, 2018, 2017),
producer=c("HBO", "Netflix", "HBO", "Netflix", "Netflix", "Amazon", "HBO", "Amazon", "Netflix"))
movies |>
group_by(year, producer, type) |>
summarise(n = n()) |>
pivot_wider(names_from=type, values_from=n) |>
mutate(across(c(awarded, nominee), ~ifelse(is.na(.), 0, .)))
#> `summarise()` has grouped output by 'year', 'producer'. You can override using the `.groups` argument.
#> # A tibble: 5 × 4
#> # Groups: year, producer [5]
#> year producer awarded nominee
#> <dbl> <chr> <dbl> <int>
#> 1 2009 HBO 1 1
#> 2 2011 HBO 0 1
#> 3 2011 Netflix 0 1
#> 4 2017 Netflix 1 2
#> 5 2018 Amazon 1 1
Created on 2021-09-28 by the reprex package (v2.0.1)
CodePudding user response:
You can use count
and join.
library(dplyr)
df1 %>%
count(year, producer, name = 'nr_of_nominee') %>%
full_join(df2 %>%
count(year, producer, name = 'nr_of_awards'), by = c("year", "producer")) %>%
mutate(across(starts_with('nr'), tidyr::replace_na, 0))
# year producer nr_of_nominee nr_of_awards
#1 2009 HBO 1 1
#2 2011 HBO 1 0
#3 2011 NETFLIX 1 0
#4 2017 AMAZON 1 0
#5 2017 NETFLIX 1 1
#6 2018 AMAZON 1 1
data
It is easier to help if you provide data in a reproducible format
df1 <- structure(list(title = c("filmA", "filmB", "filmC", "filmD",
"filmE", "filmF"), type = c("nominee", "nominee", "nominee",
"nominee", "nominee", "nominee"), year = c(2009L, 2011L, 2011L,
2017L, 2017L, 2018L), producer = c("HBO", "NETFLIX", "HBO", "AMAZON",
"NETFLIX", "AMAZON")), row.names = c(NA, -6L), class = "data.frame")
df2 <- structure(list(title = c("filmA", "filmF", "filmE"), type = c("awarded",
"awarded", "awarded"), year = c(2009L, 2018L, 2017L), producer = c("HBO",
"AMAZON", "NETFLIX")), row.names = c(NA, -3L), class = "data.frame")