I'm working with a modified version of the babynames dataset, which can be gotten by installing the babynames packages and calling:
# to install the package
install.packages('babynames')
# to load the package
library(babynames)
# to get the only one dataframe of interest from the package
babynames <- babynames::babynames
# the modified data that I'm working with
babynames_prop_sexes <- babynames %>%
select(-prop, -year) %>%
group_by(name, sex) %>%
mutate(total_occurence = sum(n))
I need to sort out names that have more than 10000 occurrences for both sexes. How can I approach this? (Preferably by using dplyr but any method is welcomed.)
Thanks in advance for any help!
CodePudding user response:
There might be a more elegant solution. But this should get you a list of names that appear with > 10000 entries as both an M and an F.
For the method, I just kept going with dplyr
verbs. After using filter
to get rid of the entries that appear < 10000 times, I could then group_by
the name and use tally()
, knowing that n = 2 when that entry appeared twice, once for M and once for F.
large_total_both_genders_same_name <- babynames %>%
group_by(name, sex) %>%
summarize(total = sum(n)) %>%
filter(total > 10000) %>%
arrange(name) %>%
group_by(name) %>%
tally() %>%
arrange(desc(n)) %>%
filter(n == 2) %>%
dplyr::select(name)
And if you want to filter your original file by that shortlist of names you can use a semi_join on the table we created, to shorten up the list. In this case, it wouldn't be obvious what you are looking at unless you also included the year column, which you removed.
original_babynames_shortened <- babynames_prop_sexes %>%
filter(name %in% large_total_both_genders_same_name$name)
But anyway, this is a common process. Create a summary table of some kind that is saved as its own 'intermediary' table, so to speak, then join that to your original, as a filter. Sometimes this can all be done in one go, but it's often easier, in my opinion to break this into two pieces.