How to summarise values in a column with non-exact match in R?-CodePudding

I have a data.table with over ten thousand of rows. I want to count in one column how many times a variable appears, but I want to use non-exact match. The data looks like this:

dt1 <- data.table (place = c("a north", "a south", "b south", "a north", "c west", "b north", "c south", "a west", "b west"))

     place
1: a north
2: a south
3: b south
4: a north
5: c west
6: b north
7: c south
8: a west
9  b west

I just want to count how many times "a", "b" and "c" appears independent from the words that follows. I would like the result to look like this:

   a b c
1: 4 3 2

I tried summarise, charmath and pmatch, but they didn't work. Could anyone help?

CodePudding user response：

All depends on how varied the location could be and other scenario shapes.

You could separate the column into 2 and then group and count

dt1
separate(dt1, place, into = c('letter', 'direction')) %>%
  group_by(letter) %>%
  count() %>%
  pivot_wider(names_from = letter, values_from = n)

CodePudding user response：

You can create a new column that contain only the string that you want with mutate() and substr() and then count the number of occurrences with count() like this.

library("data.table")
library("dplyr")

dt1 <- data.table(place = c("a north", "a south", "b south", "a north", "c west", "b north", "c south", "a west", "b west"))

dt1 |>
  mutate(first_letter = substr(place,1,1)) |>
  count(first_letter)

Output:

   first_letter n
1:            a 4
2:            b 3
3:            c 2

If you want a different match than that, you may need to use regex and case_when in your mutate.

CodePudding user response：

You can try a full data.table solution:

 dt1[,'.'(var = sub(" .*", "",place))
   ][,'.'(cnt = .N), by = var
   ][,data.table::transpose(.SD, make.names= 'var')]

   a b c
1: 4 3 2

CodePudding user response：

One simple full data.table solution:

library(data.table)

dt1[,lapply(.SD, substr,1,1)][,.N, by = place]
#>    place N
#> 1:     a 4
#> 2:     b 3
#> 3:     c 2

^{Created on 2021-10-11 by the reprex package (v2.0.1)}