Home > Software design >  How can I count unique 2 word phrases that are seperated by a comma within a cell in R?
How can I count unique 2 word phrases that are seperated by a comma within a cell in R?

Time:10-05

I have a dataframe of different locations (Location) along with the species of animals (Spp) found at each location. The species of animals are coded using their unique Genus species names. I would like to be able to know how frequent each unique Genus species is in the dataframe.

Example Data

df1 <- data.frame(matrix(ncol = 2, nrow = 3))
x <- c("Location","Spp")
colnames(df1) <- x
df1$Location <- seq(1,3,1)
df1[1,2] <- c("Genus1 species1")
df1[2,2] <- c("Genus1 species1, Genus1 species2")
df1[3,2] <- c("Genus1 species1, Genus1 species2, Genus2 species1")

Output should look something like this

            Spp Freq
Genus1 species1    3
Genus1 species2    2
Genus2 species1    1

I have tried using the corpus package to answer this problem but can only get it to work on counting the unique words rather than the unique Genus species phrase.

library(tm)
library(corpus)
library(dplyr)

text <- df1[,2]
docs <- Corpus(VectorSource(text))
docs <- docs %>%
  tm_map(removePunctuation)
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
words <- sort(rowSums(matrix), decreasing = TRUE)
words ### only provides count of unique individual Genus and species words. I want similar but need to keep Genus and species together.

CodePudding user response:

This is a quick solution:

df1 <- data.frame(matrix(ncol = 2, nrow = 3))
x <- c("Location","Spp")
colnames(df1) <- x
df1$Location <- seq(1,3,1)
df1[1,2] <- c("Genus1 species1")
df1[2,2] <- c("Genus1 species1, Genus1 species2")
df1[3,2] <- c("Genus1 species1, Genus1 species2, Genus2 species1")

table(unlist(strsplit(df1$Spp,', ')))
#> 
#> Genus1 species1 Genus1 species2 Genus2 species1 
#>               3               2               1

Created on 2021-10-04 by the reprex package (v2.0.1)

CodePudding user response:

We may use separate_rows with count

library(dplyr)
library(tidyr)
df1 %>% 
   separate_rows(Spp, sep = ",\\s ") %>%
   count(Spp, name = 'Freq')
# A tibble: 3 × 2
  Spp              Freq
  <chr>           <int>
1 Genus1 species1     3
2 Genus1 species2     2
3 Genus2 species1     1
  • Related