I have a dataframe which was originally a survey where people could answer their dog breeds and their favorit dog breed. Obviously each person can have multiple answers which will be displayed in the same categorie but separated with a comma. However I can't figure out how to count the number of times each breed was mensionned for their respective category.
here is the code:
dogs_owned <-c("labrador, golden", "golden","pitbull, chihuahua")
dogs_fav <- c("beagle", "labrador, shepherd", "chihuahua, pitbull")
test <- data.frame(dogs_owned,dogs_fav)
list <- c("labrador", "golden","pitbull","chihuahua","beagle","shepherd")
list_test <- data.frame(list)
list_test$count_own <- 0
list_test$count_fav <- 0
The goal is to count how many times the name of each breed of dog appear in either dogs_owned and dogs_fav in their respective list count
CodePudding user response:
You can use strsplit
to split strings and then count them using table
> (owned <- as.data.frame(table(trimws(unlist(strsplit(test$dogs_owned, split=","))))))
Var1 Freq
1 chihuahua 1
2 golden 2
3 labrador 1
4 pitbull 1
> (fav <- as.data.frame(table(trimws(unlist(strsplit(test$dogs_fav, split=","))))))
Var1 Freq
1 beagle 1
2 chihuahua 1
3 labrador 1
4 pitbull 1
5 shepherd 1
If you wish them in one data.frame, we can use full_join
from dplyr package as alternative to merge
(see r2evans' answer)
> library(dplyr)
owned %>%
full_join(fav, by="Var1") %>%
rename(Owned = Freq.x,
Fav = Freq.y)
Var1 Owned Fav
1 chihuahua 1 1
2 golden 2 NA
3 labrador 1 1
4 pitbull 1 1
5 beagle NA 1
6 shepherd NA 1
CodePudding user response:
Try this:
out <- Reduce(function(a, b) merge(a, b, by="Var1", all=TRUE),
lapply(test,
function(z) as.data.frame.table(table(unlist(strsplit(z, "[ ,] "))),
stringsAsFactors=FALSE)))
names(out)[-1] <- names(test)
out
# Var1 dogs_owned dogs_fav
# 1 beagle NA 1
# 2 chihuahua 1 1
# 3 golden 2 NA
# 4 labrador 1 1
# 5 pitbull 1 1
# 6 shepherd NA 1
Brief walk-through:
strsplit(z, "[ ,] ")
splits a string by consecutive commas and spaces (this is a trick to remove the need fortrimws
, which also works well); this results instrsplit(test$dogs_fav, "[ ,] ") # [[1]] # [1] "beagle" # [[2]] # [1] "labrador" "shepherd" # [[3]] # [1] "chihuahua" "pitbull"
table(unlist(.))
converts that to a vector and counts themtable(unlist(strsplit(test$dogs_fav, "[ ,] "))) # beagle chihuahua labrador pitbull shepherd # 1 1 1 1 1
as.data.frame.table(.)
converts that to a 2-column frame with namesVar1
andFreq
merge(a, b, by="Var1", ...)
combines the 2-column frames by theVar1
column (dog type), keeping eachFreq
column unique. Usingall=TRUE
ensures dogs found in one and not another are preservedReduce
allows us to do this for a dynamic number of columns, so if we had moredogs_*
columns, this same code works unchanged