I have a data frame of radio programms with ~7 million rows, ~130 radio channels and ~130K musicians or bands (and lots of variables). The df looks like this:
| Channel | Performer|
| --------| -------- |
| Radio1 | Rihanna |
| Radio1 | ACDC |
| Radio2 | Jay-Z |
| Radio3 | ACDC |
| Radio2 | Jay-Z |
| Radio1 | Rihanna |
| Radio2 | ACDC |
| Radio3 | Jay-Z |
| Radio1 | Rihanna |
| Radio1 | ACDC |
| Radio2 | Jay-Z |
| Radio3 | ACDC |
| Radio2 | Rihanna |
| Radio1 | Rihanna |
| Radio2 | ACDC |
| Radio1 | Jay-Z |
I would like to know which are the 3 most famous performers on the radio channels and how many times were played and get a table like this (or a pivot or whatever, just get the information):
|Channel|No1 Performer|No2 Performer|No3 Performer|No1 Plays|No2 Plays|No3 Plays|
|-------|-------------|-------------|-------------|---------|---------|---------|
|Radio1 |Rihanna |ACDC |Jay-Z |4 |2 |1 |
|Radio2 |Jay-Z |ACDC |Rihanna |3 |2 |1 |
|Radio3 |ACDC |Jay-Z |- |2 |1 |0 |
CodePudding user response:
Package dplyr
is helpful for these data manipulations.
count
will summarise the dataframe by collapsing the rows into their countsslice_max
will keep only the rows with the top 3 singers per group.
library(dplyr)
df |>
# Count instances
count(Channel, Performer) |>
group_by(Channel) |>
# Keep only the top 3 per channel
slice_max(order_by = n, n = 3)
If you want to reshape it, pivot_wider
from tidyr
can do that for you.
CodePudding user response:
library(tidyverse)
df %>%
group_by(Channel, Performer) %>%
tally() %>%
slice_max(n, n=3) %>%
mutate(name = rank(-n, ties = 'first')) %>%
pivot_wider(Channel, values_from = c(Performer, n))
Channel Performer_1 Performer_2 Performer_3 n_1 n_2 n_3
<chr> <chr> <chr> <chr> <int> <int> <int>
1 Radio1 Rihanna ACDC Jay-Z 4 2 1
2 Radio2 Jay-Z ACDC Rihanna 3 2 1
3 Radio3 ACDC Jay-Z NA 2 1 NA
CodePudding user response:
Another solution, instead of tally()
you can combine n()
and rowid()
library(tidyverse)
set.seed(4321)
example = data.frame(
Channel = sample(c('Radio1','Radio2','Radio3'),20,replace = TRUE),
Performer = sample(c('Rihanna','ACDC','Jay-Z'),20,replace = TRUE)
)
example
> example
Channel Performer
1 Radio1 Jay-Z
2 Radio2 Jay-Z
3 Radio3 ACDC
4 Radio2 Jay-Z
5 Radio1 Jay-Z
6 Radio1 Rihanna
7 Radio2 ACDC
8 Radio2 ACDC
9 Radio3 Rihanna
10 Radio1 ACDC
11 Radio3 Rihanna
12 Radio1 Rihanna
13 Radio2 Jay-Z
14 Radio2 Jay-Z
15 Radio2 ACDC
16 Radio3 Rihanna
17 Radio1 Jay-Z
18 Radio2 Jay-Z
19 Radio3 Rihanna
20 Radio1 ACDC
Code:
example %>%
group_by(Channel,Performer) %>%
summarise(times = n()) %>%
arrange(desc(times),.by_group=TRUE) %>%
slice_max(times, n=3) %>%
mutate(ranking = data.table::rowid(Channel,prefix = 'No'))
# A tibble: 7 x 4
# Groups: Channel [3]
Channel Performer times ranking
<chr> <chr> <int> <chr>
1 Radio1 Jay-Z 3 No1
2 Radio1 ACDC 2 No2
3 Radio1 Rihanna 2 No3
4 Radio2 Jay-Z 5 No1
5 Radio2 ACDC 3 No2
6 Radio3 Rihanna 4 No1
7 Radio3 ACDC 1 No2
If you want to pivot, add:
pivot_wider(names_from = ranking, values_from = c(Performer, times))
Output:
# A tibble: 3 x 7
# Groups: Channel [3]
Channel Performer_No1 Performer_No2 Performer_No3 times_No1 times_No2 times_No3
<chr> <chr> <chr> <chr> <int> <int> <int>
1 Radio1 Jay-Z ACDC Rihanna 3 2 2
2 Radio2 Jay-Z ACDC NA 5 3 NA
3 Radio3 Rihanna ACDC NA 4 1 NA