I have a dataframe, col1 contains text and within the text there are words enclosed by double asterisks. I want to extract all of these words and put them in another column called col2. If there is more than 1 word enclosed with double asterisks, I would like them to be separated by a comma. col2 in the example shows the desired result.
col1<-c("**sometimes** i code in python",
"I like **walks** in the park",
"I **often** do **exercise**")
col2<-c("**sometimes**","**walks**","**often**,**exercise**")
df<-data.frame(col1, col2, stringsAsFactors = FALSE)
Can anyone suggest a solution?
CodePudding user response:
You may use stringr::str_match_all
-
df$col3 <- sapply(stringr::str_match_all(df$col1, '(\\* .*?\\* )'),
function(x) toString(x[, 2]))
df
# col1 col2 col3
#1 **sometimes** i code in python **sometimes** **sometimes**
#2 I like **walks** in the park **walks** **walks**
#3 I **often** do **exercise** **often**,**exercise** **often**, **exercise**
*
has a special meaning in regex. Here we want to match an actual *
so we escape it with \\
. We extract all the values which come between 1 or more than 1 *
.
str_match_all
returns a list of matrix, we are interested in the capture group that is between (...)
which is the 2nd column hence x[, 2]
and finally for more than one value we collapse them in one comma separated string using toString
.
CodePudding user response:
You can use str_extract_all
:
library(stringr)
library(dplyr)
df %>%
mutate(col2 = str_extract_all(col1, "\\*\\*[^* ] \\*\\*"))
col1 col2
1 **sometimes** i code in python **sometimes**
2 I like **walks** in the park **walks**
3 I **often** do **exercise** **often**, **exercise**
How the regex works:
\\*\\*
matches two asterisks[^* ]
matches any character occurring one or more time which is not a literal*
and not a whitespace\\*\\*
matches two asterisks
If you don't need the asterisks in col2
, then this is how you can extract the strings without them:
df %>%
mutate(col2 = str_extract_all(col1, "(?<=\\*\\*)[^* ] (?=\\*\\*)"))
col1 col2
1 **sometimes** i code in python sometimes
2 I like **walks** in the park walks
3 I **often** do **exercise** often, exercise
How this regex works:
(?<=\\*\\*)
: positive lookbehind asserting that there must be two asterisks to the left[^* ]
matches any character occurring one or more time which is not a literal*
and not a whitespace(?=\\*\\*)
positive lookahead asserting that there must be two two asterisks to the right