R: Count overall unique objects in column of lists-CodePudding

Ok, so here is my scenario: I have a dataset with a column composed of lists of words (keyword tags for YT videos, where each row is video data).

What I want to do is do a complete count of all unique object instances within these lists, for the entire column. So basically what I want in the end is a table with two fields: keyword, count.

If I just do a simple dplyr query, then it counts the list itself as a unique object. While this is also interesting, this is not what I want.

So this is the above dplyr query that I want to utilize further, but not sure how to nest unique instances within the unique lists:

vid_tag_freq = df %>%
  count(tags)

To further clarify:

With a dataset like:

     Tags
1    ['Dog', 'Cat', 'Mouse', 'Fish']
2    ['Cat', 'Fish']
3    ['Cat', 'Fish']

I am now getting: 

    Tags                                Count
1   ['Dog', 'Cat', 'Mouse', 'Fish']     1
2   ['Cat', 'Fish']                     2

What I actually want:

    Tags              Count
1   'Cat'             3
2   'Fish'            3
3   'Dog'             1
4   'Mouse'           1

I hope that explains it lol

EDIT: This is what my data looks like, guess most are lists of lists? Maybe I should clean up [0]s as null?

[1] "[['Flood (Disaster Type)', 'Burlington (City/Town/Village)', 'Ontario (City/Town/Village)']]"                                                                                                                                                                                                                                                                                                                                                                                               
   [2] "[0]"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
   [3] "[0]"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
   [4] "[['Rocket (Product Category)', 'Interview (TV Genre)', 'Canadian Broadcasting Corporation (TV Network)', 'Israel (Country)', 'Gaza War (Military Conflict)']]"                                                                                                                                                                                                                                                                                                                              
   [5] "[0]"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
   [6] "[['Iraq (Country)', 'Military (Film Genre)', 'United States Of America (Country)']]"                                                                                                                                                                                                                                                                                                                                                                                                        
   [7] "[['Ebola (Disease Or Medical Condition)', 'Chair', 'Margaret Chan (Physician)', 'WHO']]"                                                                                                                                                                                                                                                                                                                                                                                                    
   [8] "[['CBC Television (TV Network)', 'CBC News (Website Owner)', 'Canadian Broadcasting Corporation (TV Network)']]"                                                                                                                                                                                                                                                                                                                                                                            
   [9] "[['Rob Ford (Politician)', 'the fifth estate', 'CBC Television (TV Network)', 'Bill Blair', 'Gillian Findlay', 'Documentary (TV Genre)']]"                                                                                                                                                                                                                                                                                                                                                  
  [10] "[['B.C.', 'Dog Walking (Profession)', 'dogs', 'dog walker', 'death', 'dead']]"                                                                                                                                                                                                                                                                                                                                                                                                              
  [11] "[['Suicide Of Amanda Todd (Event)', 'Amanda Todd', 'cyberbullying', 'CBC Television (TV Network)', 'the fifth estate', 'Mark Kelley', 'cappers', 'Documentary (TV Genre)']]"                                                                                                                                                                                                                                                                                                                
  [12] "[['National Hockey League (Sports Association)', 'Climate Change (Website Category)', 'Hockey (Sport)', 'greenhouse gas', 'emissions']]"                                                                                                                                                                                                                                                                                                                                                    
  [13] "[['Rob Ford (Politician)', 'bomb threat', 'Toronto (City/Town/Village)', 'City Hall (Building)']]"                                                                                                                                                                                                                                                                                                                                                                                          
  [14] "[['Blue Jays', 'Ashes', 'friends']]"                                                                                                                                                                                                                                                                                                                                                                                                                                                        
  [15] "[['Robin Williams (Celebrity)', 'Peter Gzowski']]"

CodePudding user response：

It would help if you could dput() some of the data for a working example. Going off the idea that you have a list column, here are a couple of general solutions you may be able to work with:

df <- tibble::tibble(
  x = replicate(10, sample(state.name, sample(5:10, 1), TRUE), simplify = FALSE)
)

df
#> # A tibble: 10 × 1
#>    x         
#>    <list>    
#>  1 <chr [7]> 
#>  2 <chr [7]> 
#>  3 <chr [8]> 
#>  4 <chr [6]> 
#>  5 <chr [8]> 
#>  6 <chr [8]> 
#>  7 <chr [8]> 
#>  8 <chr [6]> 
#>  9 <chr [5]> 
#> 10 <chr [10]>

# dplyr in a dataframe
df |> 
  tidyr::unnest(x) |> 
  dplyr::count(x)
#> # A tibble: 36 × 2
#>    x               n
#>    <chr>       <int>
#>  1 Alabama         1
#>  2 Alaska          1
#>  3 Arkansas        4
#>  4 California      3
#>  5 Colorado        5
#>  6 Connecticut     1
#>  7 Delaware        3
#>  8 Florida         1
#>  9 Georgia         3
#> 10 Hawaii          2
#> # … with 26 more rows

# vctrs
vctrs::vec_count(unlist(df$x))
#>               key count
#> 1        Colorado     5
#> 2       Louisiana     5
#> 3    North Dakota     4
#> 4     Mississippi     4
#> 5        Arkansas     4
#> 6        Delaware     3
#> 7         Vermont     3
#> 8       Minnesota     3
#> 9            Utah     3
#> 10     California     3
#> 11        Georgia     3
#> 12        Indiana     2
#> 13       Missouri     2
#> 14  New Hampshire     2
#> 15       Maryland     2
#> 16       Nebraska     2
#> 17         Hawaii     2
#> 18     New Jersey     2
#> 19       Oklahoma     2
#> 20  Massachusetts     1
#> 21       Illinois     1
#> 22          Texas     1
#> 23    Connecticut     1
#> 24   Rhode Island     1
#> 25       Michigan     1
#> 26       New York     1
#> 27           Ohio     1
#> 28         Nevada     1
#> 29        Florida     1
#> 30        Montana     1
#> 31      Wisconsin     1
#> 32        Alabama     1
#> 33         Alaska     1
#> 34 North Carolina     1
#> 35     Washington     1
#> 36         Kansas     1

^{Created on 2022-10-07 with reprex v2.0.2}

CodePudding user response：

It looks like you need unnest_longer():

library(dplyr)
library(tidyr)

df <- tibble(
  Tags = list(
    list('Dog', 'Cat', 'Mouse', 'Fish'),
    list('Cat', 'Fish'),
    list('Cat', 'Fish')
  )
)

df %>% 
  tidyr::unnest_longer(Tags) %>% 
  count(Tags) %>% 
  arrange(desc(n))
#> # A tibble: 4 × 2
#>   Tags      n
#>   <chr> <int>
#> 1 Cat       3
#> 2 Fish      3
#> 3 Dog       1
#> 4 Mouse     1