Home > Enterprise >  rbind on two dataframes produces redundant rows - how to tidy the data on a matching condition
rbind on two dataframes produces redundant rows - how to tidy the data on a matching condition

Time:09-28

I am looking for some tips to use the dplyr package in r. I have extracted two movie datasets from a common df, where the variable 'type' tells you when the movie was nomineed and awarded. The original df was not tidy so there are duplicated rows for those movie that were awarded, as they necessarily have to be nominated as well. They look like the followings:

#df1

title   |   type|  year|  producer|
filmA    nominee    2009    HBO
filmB    nominee    2011    NETFLIX
filmC    nominee    2011    HBO
filmD    nominee    2017    AMAZON
filmE    nominee    2017    NETFLIX 
filmF    nominee    2018    AMAZON

#df2

title   |   type|  year|  producer|
filmA    awarded    2009    HBO
filmF    awarded    2018    AMAZON
filmE    awarded    2017    NETFLIX

I manage to bind the two dataset and the result looks like this:

#rbind(df1,df2)

title   |   type|  year|  producer|
filmA    nominee    2009    HBO
filmB    nominee    2011    NETFLIX
filmC    nominee    2011    HBO
filmD    nominee    2017    AMAZON
filmE    nominee    2017    NETFLIX 
filmF    nominee    2018    AMAZON
filmA    awarded    2009    HBO
filmF    awarded    2018    AMAZON
filmE    awarded    2017    NETFLIX

What I would like to have is a count of nominees and awards for each movie's producer and for each year, something like this:

producer|  nr_of_nominee|  nr_of_awards| year|
HBO                 1           1       2009
NETFLIX             1           0       2011
HBO                 1           0       2011
AMAZON              1           0       2017
NETFLIX             2           1       2017
AMAZON              2           1       2018
NETFLIX             1           0       2018

Instead what I get using the count() havee duplicated rows, like this:

producer|  nr_of_nominee|  nr_of_awards| year|
HBO                 1           0       2009
HBO                 0           1       2009
NETFLIX             1           0       2011
HBO                 1           0       2011
AMAZON              1           0       2017
NETFLIX             2           0       2017
NETFLIX             0           1       2017
AMAZON              2           0       2018
AMAZON              0           1       2018
NETFLIX             1           0       2018

how can I fix? any help would be helpful thanks

CodePudding user response:

Is this what you're looking for?

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
movies <- data.frame(
  title=c(paste0("film", LETTERS[1:6]), paste0("film", LETTERS[1:3])),
  type=c(rep("nominee", 6), rep("awarded", 3)),
  year=c(2009, 2011, 2011, 2017, 2017, 2018, 2009, 2018, 2017),
  producer=c("HBO", "Netflix", "HBO", "Netflix", "Netflix", "Amazon", "HBO", "Amazon", "Netflix"))

movies |> 
  group_by(year, producer, type) |> 
  summarise(n = n()) |> 
  pivot_wider(names_from=type, values_from=n) |> 
  mutate(across(c(awarded, nominee), ~ifelse(is.na(.), 0, .)))
#> `summarise()` has grouped output by 'year', 'producer'. You can override using the `.groups` argument.
#> # A tibble: 5 × 4
#> # Groups:   year, producer [5]
#>    year producer awarded nominee
#>   <dbl> <chr>      <dbl>   <int>
#> 1  2009 HBO            1       1
#> 2  2011 HBO            0       1
#> 3  2011 Netflix        0       1
#> 4  2017 Netflix        1       2
#> 5  2018 Amazon         1       1

Created on 2021-09-28 by the reprex package (v2.0.1)

CodePudding user response:

You can use count and join.

library(dplyr)

df1 %>%
  count(year, producer, name = 'nr_of_nominee') %>%
  full_join(df2 %>%
  count(year, producer, name = 'nr_of_awards'), by = c("year", "producer")) %>%
  mutate(across(starts_with('nr'), tidyr::replace_na, 0))

#  year producer nr_of_nominee nr_of_awards
#1 2009      HBO             1            1
#2 2011      HBO             1            0
#3 2011  NETFLIX             1            0
#4 2017   AMAZON             1            0
#5 2017  NETFLIX             1            1
#6 2018   AMAZON             1            1

data

It is easier to help if you provide data in a reproducible format

df1 <- structure(list(title = c("filmA", "filmB", "filmC", "filmD", 
"filmE", "filmF"), type = c("nominee", "nominee", "nominee", 
"nominee", "nominee", "nominee"), year = c(2009L, 2011L, 2011L, 
2017L, 2017L, 2018L), producer = c("HBO", "NETFLIX", "HBO", "AMAZON", 
"NETFLIX", "AMAZON")), row.names = c(NA, -6L), class = "data.frame")

df2 <- structure(list(title = c("filmA", "filmF", "filmE"), type = c("awarded", 
"awarded", "awarded"), year = c(2009L, 2018L, 2017L), producer = c("HBO", 
"AMAZON", "NETFLIX")), row.names = c(NA, -3L), class = "data.frame")
  • Related