I'm trying to count the number of occurrences of multiple character IDs that occur in multiple rows of one column.
I have a vector of independent IDs
x2<-sample(101:255,200,replace=TRUE)
And in a separate data frame I have column which consists of multiple character strings with some corresponding IDs, please forgive inability to create a full sample data set but the corresponding column in a data frame looks like this
sample <- c("105 234 213", "147 211 174", "223 250 243", "189 190 153", "137 200 199")
as.data.frame(sample)
The finished result would be a new data frame with an output like this
ID Occurrences
101 3
102 15
103 2
104 26
105 11
... etc
I hope this makes sense and any help would be great
CodePudding user response:
In base R you can do:
df <- as.data.frame(sample)
setNames(as.data.frame(
table(factor(unlist(strsplit(df$sample, " ")), unique(x2)))),
c("ID", "Occurences"))
#> ID Occurences
#> 1 227 0
#> 2 213 1
#> 3 220 0
#> 4 143 0
#> 5 139 0
#> 6 216 0
#> 7 183 0
#> 8 107 0
#> 9 175 0
#> 10 148 0
#> 11 198 0
#> 12 131 0
#> 13 200 1
#> 14 110 0
#> 15 118 0
#> 16 126 0
#> 17 132 0
#> 18 162 0
#> 19 125 0
#> 20 195 0
#> 21 103 0
#> 22 237 0
#> 23 185 0
#> 24 142 0
#> 25 134 0
#> 26 235 0
#> 27 146 0
#> 28 225 0
#> 29 189 1
#> 30 208 0
#> 31 182 0
#> 32 124 0
#> 33 221 0
#> 34 101 0
#> 35 145 0
#> 36 127 0
#> 37 157 0
#> 38 212 0
#> 39 186 0
#> 40 119 0
#> 41 170 0
#> 42 116 0
#> 43 214 0
#> 44 247 0
#> 45 169 0
#> 46 205 0
#> 47 128 0
#> 48 230 0
#> 49 138 0
#> 50 209 0
#> 51 181 0
#> 52 191 0
#> 53 203 0
#> 54 254 0
#> 55 105 1
#> 56 133 0
#> 57 194 0
#> 58 238 0
#> 59 193 0
#> 60 201 0
#> 61 206 0
#> 62 117 0
#> 63 102 0
#> 64 172 0
#> 65 240 0
#> 66 155 0
#> 67 123 0
#> 68 197 0
#> 69 166 0
#> 70 141 0
#> 71 173 0
#> 72 222 0
#> 73 149 0
#> 74 144 0
#> 75 154 0
#> 76 202 0
#> 77 178 0
#> 78 210 0
#> 79 196 0
#> 80 187 0
#> 81 174 1
#> 82 188 0
#> 83 190 1
#> 84 130 0
#> 85 135 0
#> 86 140 0
#> 87 171 0
#> 88 217 0
#> 89 228 0
#> 90 249 0
#> 91 252 0
#> 92 165 0
#> 93 167 0
#> 94 180 0
#> 95 152 0
#> 96 156 0
#> 97 184 0
#> 98 243 1
#> 99 244 0
#> 100 242 0
#> 101 137 1
#> 102 104 0
#> 103 215 0
#> 104 255 0
#> 105 233 0
#> 106 246 0
#> 107 204 0
#> 108 112 0
#> 109 108 0
#> 110 218 0
#> 111 106 0
#> 112 163 0
#> 113 224 0
#> 114 239 0
#> 115 122 0
#> 116 251 0
#> 117 160 0
#> 118 231 0
Or, in pipe notation to make the logic easier to follow:
df$sample |>
strsplit(" ") |>
unlist() |>
factor(levels = sort(unique(x2))) |>
table() |>
as.data.frame() |>
setNames(c("ID", "Occurences"))
Created on 2022-03-21 by the reprex package (v2.0.1)
CodePudding user response:
Something like this?
library(dplyr)
library(tidyr)
as.data.frame(sample) %>%
separate_rows(sample) %>%
bind_rows(as_tibble(x2)) %>%
type.convert(as.is = TRUE) %>%
mutate(ID = coalesce(sample,value)) %>%
count(ID, name = "Occurences")
ID Occurences
<int> <int>
1 101 1
2 102 2
3 103 3
4 104 1
5 105 2
6 106 3
7 107 1
8 109 1
9 110 1
10 111 1
# ... with 111 more rows
>