Home > Blockchain >  Need the best way to count occurrences of unique values in a list of list of character strings acros
Need the best way to count occurrences of unique values in a list of list of character strings acros

Time:03-04

Having a character vector of character vector or let's say strings in list, need to get a dictionary count of all the strings occuring together in a list in the vector accross all the lists.

For Eg:

> animal_rows
    [1] "Elephant, Lion, Cat"                           "Dog, Snake, Elephant"                              
    [3] "Lion, Horse, Cow"                              "Elephant, Dog, Lion"
    [5] "Cat, Pig, Snake"                              "Elephant, Lion, Cow"

After running the function, need to get count of all animals occuring together with the key animal within a row. As in

> hashdict['Elephant']
Lion: 3    # -- Lion occurs 3 times with elephant if checked accross all rows
Cat: 1
Dog: 2
Snake: 1
Horse: 0
Cow: 1
Pig: 0

I am able to do it using a hash and using multiple loops down through each animal, counting and storing the count values but it just doesn't seem optimal and right, takes lot of time going through tens of thousands of rows.

For Eg:

# Pseudocode 

hashdict = hash()

unique_animal_list = c("Elephant", "Lion", "Cat", "Dog", "Snake", "Horse", "Cow", "Pig") 

for(k in unique_animal_list){
  hashdict[k] = 0
}

for(k in unique_animal_list){
    for(i in animal_rows) {
    
    # If k occurrs in i
      
      for(j in i){           # Although this is just for one animal, need as many hashdict as number of animals?
        hashdict[j]  = 1
    }
  }
}

CodePudding user response:

I'd use two loops and put the counts into a matrix:

animal_rows <- c("Elephant, Lion, Cat",
                 "Dog, Snake, Elephant",
                 "Lion, Horse, Cow" ,
                 "Elephant, Dog, Lion", "Cat, Pig, Snake" ,
                 "Elephant, Lion, Cow")

animals <- sort(unique(unlist(strsplit(animal_rows, " *, *"))))
count <- array(0, dim = c(length(animals), length(animals)))
colnames(count) <- rownames(count) <- animals

for (a1 in animals)
    for (a2 in animals)
        count[a1, a2] <- sum(grepl(a1, animal_rows) &
                             grepl(a2, animal_rows))

count
##          Cat Cow Dog Elephant Horse Lion Pig Snake
## Cat        2   0   0        1     0    1   1     1
## Cow        0   2   0        1     1    2   0     0
## Dog        0   0   2        2     0    1   0     1
## Elephant   1   1   2        4     0    3   0     1
## Horse      0   1   0        0     1    1   0     0
## Lion       1   2   1        3     1    4   0     0
## Pig        1   0   0        0     0    0   1     1
## Snake      1   0   1        1     0    0   1     2

Note that this actually does too much work, as the matrix is symmetric. You can break the inner loop once it goes across the main diagonal.

CodePudding user response:

You can split all strings and then make all 2 combinations and use table on columns:

res <- do.call(rbind, lapply(strsplit(animal_rows, ", "), function(x) t(combn(x, 2))))
table(res[,1], res[,2])

Using tidyverse you could do something like:

animal_rows %>%
  str_split(", ") %>%
  map(combn, m = 2) %>%
  reduce(cbind) %>%
  t() %>%
  .[order(.[,1]),] %>%
  "colnames<-"(c("animal1", "animal")) %>%
  as_tibble() %>%
  count(animal1, animal) %>%
  pivot_wider(names_from = animal1, values_from = n, values_fill = list(n = 0))

where you can remove pivot_wider if you want result in long format.

  • Related