Home > Net >  R: count multiple occurences of string within cells of a dataframe
R: count multiple occurences of string within cells of a dataframe

Time:12-30

I have a data frame speech_N_rows that looks like this:

#    channel start.time stop.time    vp id              overlaps
# 1:       A      0.000     9.719     N  1             A:EE, C:N
# 2:       A      9.719    11.735     N  2     A:N, D:other, A:N
# 3:       C      0.264     2.032     N  3                   A:N
# 4:       B     26.514    28.264     N  4            B:CH1, D:N
# 5:       D     82.316    82.702     N  5          C:CH2, B:CH2
# 6:       D     10.354    11.666     N  6         A:EE, A:other
# 7:       C     80.251    82.719   CH2  7         D:self, B:CH2
# 8:       B     27.564    30.819   CH1  8            B:CH1, D:N
# 9:       D     25.621    27.693     N  9          B:CH1, B:CH1
#10:       A     10.354    11.666 other 10         A:EE, D:other
#11:       B     80.251    82.719   CH2 11         D:self, C:CH2
#12:       B     61.564    64.819   CH1 12                   A:N
#13:       A     60.621    62.693     N 13                 B:CH1

In the overlaps column, there are a series of strings, often, multiple strings in each cell separated by ','

I'm trying to get counts of specific strings, in this case "A:N". But I haven't figured out how to do that yet.

I can get the number of rows in which "A:N" occurs with by making vector of the 'overlaps' column and using the length function

testdata <- c(speech_N_rows$overlaps)
length(grep("A:N", testdata))
# [1] 3

However there are 4 total instances of "A:N", not 3. I can't figure out how to count multiple occurrences in the column, including multiple occurrences within a single row of the column (as is the case in row 2 of the 'overlaps' column).

Suggestions would be most appreciated.

CodePudding user response:

To count all the instances of A:N you could use str_count in the stringr library in combination with sum():

sum(stringr::str_count(df$overlaps, "A:N"))
# [1] 4

The stringr::str_count() counts the number of the designated pattern in each element:

stringr::str_count(df$overlaps, "A:N")
# [1] 0 2 1 0 0 0 0 0 0 0 0 1 0

While sum() adds them all up to produce the overall number of instances.

Data

df <- read.table( text = "channel start.time stop.time    vp id              overlaps
A      0.000     9.719     N  1             A:EE,C:N
A      9.719    11.735     N  2     A:N,D:other,A:N
C      0.264     2.032     N  3                   A:N
B     26.514    28.264     N  4            B:CH1,D:N
D     82.316    82.702     N  5          C:CH2,B:CH2
D     10.354    11.666     N  6         A:EE,A:other
C     80.251    82.719   CH2  7         D:self,B:CH2
B     27.564    30.819   CH1  8            B:CH1,D:N
D     25.621    27.693     N  9          B:CH1,B:CH1
A     10.354    11.666 other 10         A:EE,D:other
B     80.251    82.719   CH2 11         D:self,C:CH2
B     61.564    64.819   CH1 12                   A:N
A     60.621    62.693     N 13                 B:CH1", header = TRUE)
  • Related