Ok, so here is my scenario: I have a dataset with a column composed of lists of words (keyword tags for YT videos, where each row is video data).
What I want to do is do a complete count of all unique object instances within these lists, for the entire column. So basically what I want in the end is a table with two fields: keyword, count.
If I just do a simple dplyr query, then it counts the list itself as a unique object. While this is also interesting, this is not what I want.
So this is the above dplyr query that I want to utilize further, but not sure how to nest unique instances within the unique lists:
vid_tag_freq = df %>%
count(tags)
To further clarify:
With a dataset like:
Tags
1 ['Dog', 'Cat', 'Mouse', 'Fish']
2 ['Cat', 'Fish']
3 ['Cat', 'Fish']
I am now getting:
Tags Count
1 ['Dog', 'Cat', 'Mouse', 'Fish'] 1
2 ['Cat', 'Fish'] 2
What I actually want:
Tags Count
1 'Cat' 3
2 'Fish' 3
3 'Dog' 1
4 'Mouse' 1
I hope that explains it lol
EDIT: This is what my data looks like, guess most are lists of lists? Maybe I should clean up [0]s as null?
[1] "[['Flood (Disaster Type)', 'Burlington (City/Town/Village)', 'Ontario (City/Town/Village)']]"
[2] "[0]"
[3] "[0]"
[4] "[['Rocket (Product Category)', 'Interview (TV Genre)', 'Canadian Broadcasting Corporation (TV Network)', 'Israel (Country)', 'Gaza War (Military Conflict)']]"
[5] "[0]"
[6] "[['Iraq (Country)', 'Military (Film Genre)', 'United States Of America (Country)']]"
[7] "[['Ebola (Disease Or Medical Condition)', 'Chair', 'Margaret Chan (Physician)', 'WHO']]"
[8] "[['CBC Television (TV Network)', 'CBC News (Website Owner)', 'Canadian Broadcasting Corporation (TV Network)']]"
[9] "[['Rob Ford (Politician)', 'the fifth estate', 'CBC Television (TV Network)', 'Bill Blair', 'Gillian Findlay', 'Documentary (TV Genre)']]"
[10] "[['B.C.', 'Dog Walking (Profession)', 'dogs', 'dog walker', 'death', 'dead']]"
[11] "[['Suicide Of Amanda Todd (Event)', 'Amanda Todd', 'cyberbullying', 'CBC Television (TV Network)', 'the fifth estate', 'Mark Kelley', 'cappers', 'Documentary (TV Genre)']]"
[12] "[['National Hockey League (Sports Association)', 'Climate Change (Website Category)', 'Hockey (Sport)', 'greenhouse gas', 'emissions']]"
[13] "[['Rob Ford (Politician)', 'bomb threat', 'Toronto (City/Town/Village)', 'City Hall (Building)']]"
[14] "[['Blue Jays', 'Ashes', 'friends']]"
[15] "[['Robin Williams (Celebrity)', 'Peter Gzowski']]"
CodePudding user response:
It would help if you could dput()
some of the data for a working example. Going off the idea that you have a list
column, here are a couple of general solutions you may be able to work with:
df <- tibble::tibble(
x = replicate(10, sample(state.name, sample(5:10, 1), TRUE), simplify = FALSE)
)
df
#> # A tibble: 10 × 1
#> x
#> <list>
#> 1 <chr [7]>
#> 2 <chr [7]>
#> 3 <chr [8]>
#> 4 <chr [6]>
#> 5 <chr [8]>
#> 6 <chr [8]>
#> 7 <chr [8]>
#> 8 <chr [6]>
#> 9 <chr [5]>
#> 10 <chr [10]>
# dplyr in a dataframe
df |>
tidyr::unnest(x) |>
dplyr::count(x)
#> # A tibble: 36 × 2
#> x n
#> <chr> <int>
#> 1 Alabama 1
#> 2 Alaska 1
#> 3 Arkansas 4
#> 4 California 3
#> 5 Colorado 5
#> 6 Connecticut 1
#> 7 Delaware 3
#> 8 Florida 1
#> 9 Georgia 3
#> 10 Hawaii 2
#> # … with 26 more rows
# vctrs
vctrs::vec_count(unlist(df$x))
#> key count
#> 1 Colorado 5
#> 2 Louisiana 5
#> 3 North Dakota 4
#> 4 Mississippi 4
#> 5 Arkansas 4
#> 6 Delaware 3
#> 7 Vermont 3
#> 8 Minnesota 3
#> 9 Utah 3
#> 10 California 3
#> 11 Georgia 3
#> 12 Indiana 2
#> 13 Missouri 2
#> 14 New Hampshire 2
#> 15 Maryland 2
#> 16 Nebraska 2
#> 17 Hawaii 2
#> 18 New Jersey 2
#> 19 Oklahoma 2
#> 20 Massachusetts 1
#> 21 Illinois 1
#> 22 Texas 1
#> 23 Connecticut 1
#> 24 Rhode Island 1
#> 25 Michigan 1
#> 26 New York 1
#> 27 Ohio 1
#> 28 Nevada 1
#> 29 Florida 1
#> 30 Montana 1
#> 31 Wisconsin 1
#> 32 Alabama 1
#> 33 Alaska 1
#> 34 North Carolina 1
#> 35 Washington 1
#> 36 Kansas 1
Created on 2022-10-07 with reprex v2.0.2
CodePudding user response:
It looks like you need unnest_longer()
:
library(dplyr)
library(tidyr)
df <- tibble(
Tags = list(
list('Dog', 'Cat', 'Mouse', 'Fish'),
list('Cat', 'Fish'),
list('Cat', 'Fish')
)
)
df %>%
tidyr::unnest_longer(Tags) %>%
count(Tags) %>%
arrange(desc(n))
#> # A tibble: 4 × 2
#> Tags n
#> <chr> <int>
#> 1 Cat 3
#> 2 Fish 3
#> 3 Dog 1
#> 4 Mouse 1