Home > database >  Extract words enclosed within asterisks in a column in R
Extract words enclosed within asterisks in a column in R

Time:08-04

I have a dataframe, col1 contains text and within the text there are words enclosed by double asterisks. I want to extract all of these words and put them in another column called col2. If there is more than 1 word enclosed with double asterisks, I would like them to be separated by a comma. col2 in the example shows the desired result.

col1<-c("**sometimes** i code in python",
"I like **walks** in the park",
"I **often** do **exercise**")
col2<-c("**sometimes**","**walks**","**often**,**exercise**")

df<-data.frame(col1, col2, stringsAsFactors = FALSE)

Can anyone suggest a solution?

CodePudding user response:

You may use stringr::str_match_all -

df$col3 <- sapply(stringr::str_match_all(df$col1, '(\\* .*?\\* )'),
                  function(x) toString(x[, 2]))
df

#                            col1                   col2                    col3
#1 **sometimes** i code in python          **sometimes**           **sometimes**
#2   I like **walks** in the park              **walks**               **walks**
#3    I **often** do **exercise** **often**,**exercise** **often**, **exercise**

* has a special meaning in regex. Here we want to match an actual * so we escape it with \\. We extract all the values which come between 1 or more than 1 *.

str_match_all returns a list of matrix, we are interested in the capture group that is between (...) which is the 2nd column hence x[, 2] and finally for more than one value we collapse them in one comma separated string using toString.

CodePudding user response:

You can use str_extract_all:

library(stringr)
library(dplyr)
df %>%
  mutate(col2 = str_extract_all(col1, "\\*\\*[^* ] \\*\\*"))
                            col1                    col2
1 **sometimes** i code in python           **sometimes**
2   I like **walks** in the park               **walks**
3    I **often** do **exercise** **often**, **exercise**

How the regex works:

  • \\*\\* matches two asterisks
  • [^* ] matches any character occurring one or more time which is not a literal * and not a whitespace
  • \\*\\* matches two asterisks

If you don't need the asterisks in col2, then this is how you can extract the strings without them:

df %>%
   mutate(col2 = str_extract_all(col1, "(?<=\\*\\*)[^* ] (?=\\*\\*)"))
                            col1            col2
1 **sometimes** i code in python       sometimes
2   I like **walks** in the park           walks
3    I **often** do **exercise** often, exercise

How this regex works:

  • (?<=\\*\\*): positive lookbehind asserting that there must be two asterisks to the left
  • [^* ] matches any character occurring one or more time which is not a literal * and not a whitespace
  • (?=\\*\\*) positive lookahead asserting that there must be two two asterisks to the right
  • Related