Home > Software design >  Counting multiple characters in multiple character strings in a column r
Counting multiple characters in multiple character strings in a column r

Time:03-22

I'm trying to count the number of occurrences of multiple character IDs that occur in multiple rows of one column.

I have a vector of independent IDs

  x2<-sample(101:255,200,replace=TRUE)

And in a separate data frame I have column which consists of multiple character strings with some corresponding IDs, please forgive inability to create a full sample data set but the corresponding column in a data frame looks like this

 sample <- c("105 234 213", "147 211 174", "223 250 243", "189 190 153", "137 200 199")

 as.data.frame(sample)

The finished result would be a new data frame with an output like this

 ID    Occurrences 
 101      3
 102      15
 103      2
 104      26
 105      11
  ... etc

I hope this makes sense and any help would be great

CodePudding user response:

In base R you can do:

df <- as.data.frame(sample)

setNames(as.data.frame(
  table(factor(unlist(strsplit(df$sample, " ")), unique(x2)))), 
  c("ID", "Occurences"))
#>      ID Occurences
#> 1   227          0
#> 2   213          1
#> 3   220          0
#> 4   143          0
#> 5   139          0
#> 6   216          0
#> 7   183          0
#> 8   107          0
#> 9   175          0
#> 10  148          0
#> 11  198          0
#> 12  131          0
#> 13  200          1
#> 14  110          0
#> 15  118          0
#> 16  126          0
#> 17  132          0
#> 18  162          0
#> 19  125          0
#> 20  195          0
#> 21  103          0
#> 22  237          0
#> 23  185          0
#> 24  142          0
#> 25  134          0
#> 26  235          0
#> 27  146          0
#> 28  225          0
#> 29  189          1
#> 30  208          0
#> 31  182          0
#> 32  124          0
#> 33  221          0
#> 34  101          0
#> 35  145          0
#> 36  127          0
#> 37  157          0
#> 38  212          0
#> 39  186          0
#> 40  119          0
#> 41  170          0
#> 42  116          0
#> 43  214          0
#> 44  247          0
#> 45  169          0
#> 46  205          0
#> 47  128          0
#> 48  230          0
#> 49  138          0
#> 50  209          0
#> 51  181          0
#> 52  191          0
#> 53  203          0
#> 54  254          0
#> 55  105          1
#> 56  133          0
#> 57  194          0
#> 58  238          0
#> 59  193          0
#> 60  201          0
#> 61  206          0
#> 62  117          0
#> 63  102          0
#> 64  172          0
#> 65  240          0
#> 66  155          0
#> 67  123          0
#> 68  197          0
#> 69  166          0
#> 70  141          0
#> 71  173          0
#> 72  222          0
#> 73  149          0
#> 74  144          0
#> 75  154          0
#> 76  202          0
#> 77  178          0
#> 78  210          0
#> 79  196          0
#> 80  187          0
#> 81  174          1
#> 82  188          0
#> 83  190          1
#> 84  130          0
#> 85  135          0
#> 86  140          0
#> 87  171          0
#> 88  217          0
#> 89  228          0
#> 90  249          0
#> 91  252          0
#> 92  165          0
#> 93  167          0
#> 94  180          0
#> 95  152          0
#> 96  156          0
#> 97  184          0
#> 98  243          1
#> 99  244          0
#> 100 242          0
#> 101 137          1
#> 102 104          0
#> 103 215          0
#> 104 255          0
#> 105 233          0
#> 106 246          0
#> 107 204          0
#> 108 112          0
#> 109 108          0
#> 110 218          0
#> 111 106          0
#> 112 163          0
#> 113 224          0
#> 114 239          0
#> 115 122          0
#> 116 251          0
#> 117 160          0
#> 118 231          0

Or, in pipe notation to make the logic easier to follow:

df$sample                           |>
  strsplit(" ")                     |>
  unlist()                          |>
  factor(levels = sort(unique(x2))) |>
  table()                           |>
  as.data.frame()                   |>
  setNames(c("ID", "Occurences"))

Created on 2022-03-21 by the reprex package (v2.0.1)

CodePudding user response:

Something like this?

library(dplyr)
library(tidyr)

as.data.frame(sample) %>% 
  separate_rows(sample) %>% 
  bind_rows(as_tibble(x2)) %>%
  type.convert(as.is = TRUE) %>% 
  mutate(ID = coalesce(sample,value)) %>% 
  count(ID, name = "Occurences")
ID Occurences
   <int>      <int>
 1   101          1
 2   102          2
 3   103          3
 4   104          1
 5   105          2
 6   106          3
 7   107          1
 8   109          1
 9   110          1
10   111          1
# ... with 111 more rows
> 
  • Related