Home > Software engineering >  How to count values separated by commas in a dataframe
How to count values separated by commas in a dataframe

Time:11-19

I have a dataframe which was originally a survey where people could answer their dog breeds and their favorit dog breed. Obviously each person can have multiple answers which will be displayed in the same categorie but separated with a comma. However I can't figure out how to count the number of times each breed was mensionned for their respective category.

here is the code:

dogs_owned <-c("labrador, golden", "golden","pitbull, chihuahua")
dogs_fav <- c("beagle", "labrador, shepherd", "chihuahua, pitbull")
test <- data.frame(dogs_owned,dogs_fav)

list <- c("labrador", "golden","pitbull","chihuahua","beagle","shepherd")
list_test <- data.frame(list)
list_test$count_own <- 0
list_test$count_fav <- 0

The goal is to count how many times the name of each breed of dog appear in either dogs_owned and dogs_fav in their respective list count

CodePudding user response:

You can use strsplit to split strings and then count them using table

> (owned <- as.data.frame(table(trimws(unlist(strsplit(test$dogs_owned, split=","))))))
       Var1 Freq
1 chihuahua    1
2    golden    2
3  labrador    1
4   pitbull    1
> (fav <- as.data.frame(table(trimws(unlist(strsplit(test$dogs_fav, split=","))))))
       Var1 Freq
1    beagle    1
2 chihuahua    1
3  labrador    1
4   pitbull    1
5  shepherd    1

If you wish them in one data.frame, we can use full_join from dplyr package as alternative to merge (see r2evans' answer)

> library(dplyr)
 owned %>% 
   full_join(fav, by="Var1") %>% 
   rename(Owned = Freq.x,
          Fav = Freq.y)
       Var1 Owned Fav
1 chihuahua     1   1
2    golden     2  NA
3  labrador     1   1
4   pitbull     1   1
5    beagle    NA   1
6  shepherd    NA   1

CodePudding user response:

Try this:

out <- Reduce(function(a, b) merge(a, b, by="Var1", all=TRUE),
  lapply(test,
    function(z) as.data.frame.table(table(unlist(strsplit(z, "[ ,] "))), 
                                    stringsAsFactors=FALSE)))
names(out)[-1] <- names(test)
out
#        Var1 dogs_owned dogs_fav
# 1    beagle         NA        1
# 2 chihuahua          1        1
# 3    golden          2       NA
# 4  labrador          1        1
# 5   pitbull          1        1
# 6  shepherd         NA        1

Brief walk-through:

  • strsplit(z, "[ ,] ") splits a string by consecutive commas and spaces (this is a trick to remove the need for trimws, which also works well); this results in

    strsplit(test$dogs_fav, "[ ,] ")
    # [[1]]
    # [1] "beagle"
    # [[2]]
    # [1] "labrador" "shepherd"
    # [[3]]
    # [1] "chihuahua" "pitbull"  
    
  • table(unlist(.)) converts that to a vector and counts them

    table(unlist(strsplit(test$dogs_fav, "[ ,] ")))
    #    beagle chihuahua  labrador   pitbull  shepherd 
    #         1         1         1         1         1 
    
  • as.data.frame.table(.) converts that to a 2-column frame with names Var1 and Freq

  • merge(a, b, by="Var1", ...) combines the 2-column frames by the Var1 column (dog type), keeping each Freq column unique. Using all=TRUE ensures dogs found in one and not another are preserved

  • Reduce allows us to do this for a dynamic number of columns, so if we had more dogs_* columns, this same code works unchanged

  •  Tags:  
  • r
  • Related