Home > Blockchain >  Spreading data within dplyr pipe; dealing with multiple entries by group
Spreading data within dplyr pipe; dealing with multiple entries by group

Time:11-01

I am trying to convert a long dataframe to a wide dataframe using the spread function in tidyr and within a dplyr pipe. My data is in the following format:-

df<-structure(list(Hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 
                        1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
               dt = c("05/06/2021", "07/06/2021", "08/06/2021", "10/06/2021", 
                      "12/06/2021", "12/06/2021", "14/06/2021", "14/06/2021", "15/06/2021", 
                      "09/06/2021", "10/06/2021", "10/06/2021", "11/06/2021", "13/06/2021", 
                      "14/06/2021", "05/06/2021", "06/06/2021", "07/06/2021", "08/06/2021", 
                      "08/06/2021", "09/06/2021", "09/06/2021", "10/06/2021", "11/06/2021", 
                      "12/06/2021"), Programme_watched = c("Wrestling", "Drama", 
                                                           "Drama", "Football", "Rugby", "Movie", "Movie", "News", "Music", 
                                                           "Tennis", "Racing", "Racing", "Documentary ", "Movie", "Movie", 
                                                           "Athletics", "News", "Children", "Music", "Movie", "Football", 
                                                           "News", "Football", "Rugby", "Drama")), class = "data.frame", row.names = c(NA, 
                                                                                                                                       -25L))


df

   Hour         dt Programme_watched
1     0 05/06/2021         Wrestling
2     0 07/06/2021             Drama
3     0 08/06/2021             Drama
4     0 10/06/2021          Football
5     0 12/06/2021             Rugby
6     0 12/06/2021             Movie
7     0 14/06/2021             Movie
8     0 14/06/2021              News
9     0 15/06/2021             Music
10    1 09/06/2021            Tennis
11    1 10/06/2021            Racing
12    1 10/06/2021            Racing
13    1 11/06/2021      Documentary 
14    1 13/06/2021             Movie
15    1 14/06/2021             Movie
16    2 05/06/2021         Athletics
17    2 06/06/2021              News
18    2 07/06/2021          Children
19    2 08/06/2021             Music
20    2 08/06/2021             Movie
21    2 09/06/2021          Football
22    2 09/06/2021              News
23    2 10/06/2021          Football
24    2 11/06/2021             Rugby
25    2 12/06/2021             Drama

I want to be able to make this wider, turning the dates into the column headings with the Programme_watched variable as the value. However, this is what I get:-

df%>%
  group_by(Hour,dt)%>%
  summarise(Programmes=Programme_watched)%>%
  spread(dt, Programmes)


`summarise()` has grouped output by 'Hour', 'dt'. You can override using the `.groups` argument.
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 10 rows:
* 19, 20
* 21, 22
* 11, 12
* 5, 6
* 7, 8

Is it because there is more than one value for some of the grouping variables i.e. Hour and dt ? If so, I would like these two values to be pasted into the one cell together, then spread by the dt variable. Can anyone help?

CodePudding user response:

Are you looking for such a solution? spreadis superseded:

Development on spread() is complete, and for new code we recommend switching to pivot_wider(), which is easier to use, more featureful, and still under active development. df %>% spread(key, value) is equivalent to df %>% pivot_wider(names_from = key, values_from = value)

df%>%
  group_by(Hour,dt)%>%
  summarise(Programmes=Programme_watched) %>% 
  ungroup() %>% 
  pivot_wider(
    names_from= dt, 
    values_from = Programmes
  ) %>% 
  unnest()
# A tibble: 6 x 12
   Hour `05/06/2021` `07/06/2021` `08/06/2021` `10/06/2021` `12/06/2021` `14/06/2021` `15/06/2021`
  <int> <chr>        <chr>        <chr>        <chr>        <chr>        <chr>        <chr>       
1     0 Wrestling    Drama        Drama        Football     Rugby        Movie        Music       
2     0 Wrestling    Drama        Drama        Football     Movie        News         Music       
3     1 NA           NA           NA           Racing       NA           Movie        NA          
4     1 NA           NA           NA           Racing       NA           Movie        NA          
5     2 Athletics    Children     Music        Football     Drama        NA           NA          
6     2 Athletics    Children     Movie        Football     Drama        NA           NA          
# ... with 4 more variables: 09/06/2021 <chr>, 11/06/2021 <chr>, 13/06/2021 <chr>, 06/06/2021 <chr>

OR

library(tidyr)
library(dplyr)
df %>%
  pivot_wider(
    names_from= dt, 
    values_from = Programme_watched
  ) %>% 
  unnest() %>% 
  distinct()
   Hour `05/06/2021` `07/06/2021` `08/06/2021` `10/06/2021` `12/06/2021` `14/06/2021` `15/06/2021` `09/06/2021` `11/06/2021`   `13/06/2021` `06/06/2021`
  <int> <chr>        <chr>        <chr>        <chr>        <chr>        <chr>        <chr>        <chr>        <chr>          <chr>        <chr>       
1     0 Wrestling    Drama        Drama        Football     Rugby        Movie        Music        NA            NA            NA           NA          
2     0 Wrestling    Drama        Drama        Football     Movie        News         Music        NA            NA            NA           NA          
3     1 NA           NA           NA           Racing       NA           Movie        NA           Tennis       "Documentary " Movie        NA          
4     2 Athletics    Children     Music        Football     Drama        NA           NA           Football     "Rugby"        NA           News        
5     2 Athletics    Children     Movie        Football     Drama        NA           NA           News         "Rugby"        NA           News   

CodePudding user response:

An option is also to create a sequence by group (rowid) before the pivot_wider to take care of the duplicate elements

library(data.table)
library(tidyr)
library(dplyr)
df%>%
  group_by(Hour,dt)%>%
  summarise(Programmes=Programme_watched, .groups = 'drop') %>% 
  mutate(rn = rowid(Hour, dt)) %>% 
  pivot_wider(names_from = dt, values_from = Programmes) %>%
  select(-rn)

-output

# A tibble: 6 × 12
   Hour `05/06/2021` `07/06/2021` `08/06/2021` `10/06/2021` `12/06/2021` `14/06/2021` `15/06/2021` `09/06/2021` `11/06/2021`   `13/06/2021` `06/06/2021`
  <int> <chr>        <chr>        <chr>        <chr>        <chr>        <chr>        <chr>        <chr>        <chr>          <chr>        <chr>       
1     0 Wrestling    Drama        Drama        Football     Rugby        Movie        Music        <NA>          <NA>          <NA>         <NA>        
2     0 <NA>         <NA>         <NA>         <NA>         Movie        News         <NA>         <NA>          <NA>          <NA>         <NA>        
3     1 <NA>         <NA>         <NA>         Racing       <NA>         Movie        <NA>         Tennis       "Documentary " Movie        <NA>        
4     1 <NA>         <NA>         <NA>         Racing       <NA>         <NA>         <NA>         <NA>          <NA>          <NA>         <NA>        
5     2 Athletics    Children     Music        Football     Drama        <NA>         <NA>         Football     "Rugby"        <NA>         News        
6     2 <NA>         <NA>         Movie        <NA>         <NA>         <NA>         <NA>         News          <NA>          <NA>         <NA>     
  • Related