I want to find the order of words in sentences. Here's a toy example and a failed command:
sentences = c('The cat and snake played with the dog',
'The dog and cat ran from the cow')
animals = c('cat','dog','cow')
str_locate_all(sentences, pattern = animals)
[[1]]
start end
[1,] 5 7
[[2]]
start end
[1,] 5 7
[[3]]
start end
What I really want is something like this:
sentence cat dog cow
1 1 2 NA
2 2 1 3
So the order of each word in the sentence.
CodePudding user response:
Extract the elements into a list
, create a tibble
with the values and the sequence of values, and reshape to 'wide' with pivot_wider
library(dplyr)
library(tidyr)
library(stringr)
str_extract_all(sentences, str_c(animals, collapse = "|")) %>%
setNames(seq_along(.)) %>%
map_dfr(~ tibble(cat = .x, ind = seq_along(cat)), .id = 'sentence') %>%
pivot_wider(names_from = cat, values_from = ind)
-output
# A tibble: 2 × 4
sentence cat dog cow
<chr> <int> <int> <int>
1 1 1 2 NA
2 2 2 1 3
CodePudding user response:
You can use a regexpr
fun
ction in outer
, which already gives
fun <- \(x, y) {w <- which(regexpr(x, y) > 0); ifelse(length(w) == 0, NA_integer_, w)}
(o <- t(outer(animals, strsplit(sentences, ' '), Vectorize(fun))) )
# [,1] [,2] [,3]
# [1,] 2 8 NA
# [2,] 4 2 8
on which you may apply
order
on margin 1
(i.e. rows) and cbind
sentence numbers.
o |> apply(1, \(x) {x[!is.na(x)] <- order(x[!is.na(x)]); x}) |> t() |>
cbind.data.frame(seq_along(sentences)) |> setNames(c("cat", "dog", "cow", "sentence")) |>
subset(select=c(length(animals) 1, seq_along(animals))) ## optional to get column order
# sentence cat dog cow
# 1 1 1 2 NA
# 2 2 2 1 3
Data:
sentences <- c("The cat and snake played with the dog", "The dog and cat ran from the cow")
animals <- c("cat", "dog", "cow")