Ordering of words in r-CodePudding

I want to find the order of words in sentences. Here's a toy example and a failed command:

sentences = c('The cat and snake played with the dog',
              'The dog and cat ran from the cow')
animals = c('cat','dog','cow')
str_locate_all(sentences, pattern = animals)
[[1]]
     start end
[1,]     5   7

[[2]]
     start end
[1,]     5   7

[[3]]
     start end

What I really want is something like this:

sentence cat dog cow
1          1   2  NA
2          2   1   3

So the order of each word in the sentence.

CodePudding user response：

Extract the elements into a list, create a tibble with the values and the sequence of values, and reshape to 'wide' with pivot_wider

library(dplyr)
library(tidyr)
library(stringr)
str_extract_all(sentences, str_c(animals, collapse = "|")) %>% 
  setNames(seq_along(.)) %>%
  map_dfr(~ tibble(cat = .x, ind = seq_along(cat)), .id = 'sentence') %>% 
  pivot_wider(names_from = cat, values_from = ind)

-output

# A tibble: 2 × 4
  sentence   cat   dog   cow
  <chr>    <int> <int> <int>
1 1            1     2    NA
2 2            2     1     3

CodePudding user response：

You can use a regexpr function in outer, which already gives

fun <- \(x, y) {w <- which(regexpr(x, y) > 0); ifelse(length(w) == 0, NA_integer_, w)}
(o <- t(outer(animals, strsplit(sentences, ' '), Vectorize(fun))) )
#      [,1] [,2] [,3]
# [1,]    2    8   NA
# [2,]    4    2    8

on which you may apply order on margin 1 (i.e. rows) and cbind sentence numbers.

o |> apply(1, \(x) {x[!is.na(x)] <- order(x[!is.na(x)]); x}) |> t() |>
  cbind.data.frame(seq_along(sentences)) |> setNames(c("cat", "dog", "cow", "sentence")) |>
  subset(select=c(length(animals)   1, seq_along(animals)))  ## optional to get column order
#   sentence cat dog cow
# 1        1   1   2  NA
# 2        2   2   1   3

Data:

sentences <- c("The cat and snake played with the dog", "The dog and cat ran from the cow")

animals <- c("cat", "dog", "cow")