Tidytext - set expressions as a single token-CodePudding

I am trying to separate my text data into tokens using the unnest_tokens function from the tidytext package. The thing is that some expressions appear multiple times and I would like to keep them a single token instead of multiple tokens.

Normal outcome:

df <- data.frame(
  Id = c(1, 2),
  Text = c('A first nice text', 'A second nice text')
)

df %>% 
  unnest_tokens(word, text)

  Id   Word
1  1      a
2  1  first
3  1   nice
4  1   text
5  2      a
6  2 second
7  2   nice
8  2   text

What I would like (expression = "nice text"):

df <- data.frame(
  Id = c(1, 2),
  Text = c('A first nice text', 'A second nice text')
)

df %>% 
  unnest_tokens(word, text)

  Id   Word
1  1      a
2  1  first
3  1   nice text
4  2      a
5  2 second
6  2   nice text

CodePudding user response：

A bit verbose, and there might be an option to exclude certain phrases in the unnest_tokens, but it does the trick:

library(tidyverse)
library(tidytext)
df <- data.frame(Id = c(1, 2),,
                 Text = c('A first nice text', 'A second nice text')

) %>% unnest_tokens('Word', Text)

df %>%
  group_by(Id) %>%
  summarize(Word = paste(if_else(lag(Word) == 'nice' & Word == 'text', 'nice text', Word))) %>%
  mutate(temp_id = row_number()) %>%
  filter(temp_id != temp_id[Word == 'nice text'] - 1) %>%
  ungroup() %>%
  select(-temp_id)

which gives:

# A tibble: 6 x 2
     Id Word     
  <dbl> <chr>    
1     1 a        
2     1 first    
3     1 nice text
4     2 a        
5     2 second   
6     2 nice text

CodePudding user response：

Here's a concise solution based on negative lookahead (?!...), to disallow separate_rows to separate Text on whitespace \\s if there's nice to the left of \\s and text to its right (\\bare word boundary anchors, in case you have, say, "nice texts", which you do want to separate)

library(tidyr)
df %>%
  separate_rows(Text, sep = "(?!\\bnice\\b)\\s(?!\\btext\\b)")
# A tibble: 6 × 2
     Id Text     
  <dbl> <chr>    
1     1 A        
2     1 first    
3     1 nice text
4     2 A        
5     2 second   
6     2 nice text

A more advanced regex is with (*SKIP)(*F):

df %>%
  separate_rows(Text, sep = "(\\bnice text\\b)(*SKIP)(*F)|\\s")

For more info: How do (*SKIP) or (*F) work on regex?