I am trying to separate my text data into tokens using the unnest_tokens function from the tidytext package. The thing is that some expressions appear multiple times and I would like to keep them a single token instead of multiple tokens.
Normal outcome:
df <- data.frame(
Id = c(1, 2),
Text = c('A first nice text', 'A second nice text')
)
df %>%
unnest_tokens(word, text)
Id Word
1 1 a
2 1 first
3 1 nice
4 1 text
5 2 a
6 2 second
7 2 nice
8 2 text
What I would like (expression = "nice text"):
df <- data.frame(
Id = c(1, 2),
Text = c('A first nice text', 'A second nice text')
)
df %>%
unnest_tokens(word, text)
Id Word
1 1 a
2 1 first
3 1 nice text
4 2 a
5 2 second
6 2 nice text
CodePudding user response:
A bit verbose, and there might be an option to exclude certain phrases in the unnest_tokens, but it does the trick:
library(tidyverse)
library(tidytext)
df <- data.frame(Id = c(1, 2),,
Text = c('A first nice text', 'A second nice text')
) %>% unnest_tokens('Word', Text)
df %>%
group_by(Id) %>%
summarize(Word = paste(if_else(lag(Word) == 'nice' & Word == 'text', 'nice text', Word))) %>%
mutate(temp_id = row_number()) %>%
filter(temp_id != temp_id[Word == 'nice text'] - 1) %>%
ungroup() %>%
select(-temp_id)
which gives:
# A tibble: 6 x 2
Id Word
<dbl> <chr>
1 1 a
2 1 first
3 1 nice text
4 2 a
5 2 second
6 2 nice text
CodePudding user response:
Here's a concise solution based on negative lookahead (?!...)
, to disallow separate_rows
to separate Text
on whitespace \\s
if there's nice
to the left of \\s
and text
to its right (\\b
are word boundary anchors, in case you have, say, "nice texts", which you do want to separate)
library(tidyr)
df %>%
separate_rows(Text, sep = "(?!\\bnice\\b)\\s(?!\\btext\\b)")
# A tibble: 6 × 2
Id Text
<dbl> <chr>
1 1 A
2 1 first
3 1 nice text
4 2 A
5 2 second
6 2 nice text
A more advanced regex is with (*SKIP)(*F)
:
df %>%
separate_rows(Text, sep = "(\\bnice text\\b)(*SKIP)(*F)|\\s")
For more info: How do (*SKIP) or (*F) work on regex?