I have a dataset with a text column that includes texts and a term that starts with a term such as sa
and with two following digits
after. The letters can be anything from a to z and either small or capital. A snapshot of the data is as follows:
df_new <- data.frame(
given_info=c('SA12 is given','he has his sa12',
'she will get Sr15','why not having an ra31',
'his tA23 is missing', 'pa12 is given'))
df_new %>% select(given_info)
given_info
1 SA12 is given
2 he has his sa12
3 she will get Sr15
4 why not having an ra31
5 his tA23 is missing
6 pa12 is given
I need to replace any term that has the sa (or any other combinations of two random letters with the two digits
after with the term document
. Hence, the outcome of interest is:
given_info
1 document is given
2 he has his document
3 she will get document
4 why not having an document
5 his document is missing
6 document is given
Thank you so much for your help in advance!
CodePudding user response:
We can use gsub()
here as follows:
df_new$given_info <- gsub("\\b[A-Za-z]{2}\\d{2}\\b", "document", df_new$given_info)
df_new
given_info
1 document is given
2 he has his document
3 she will get document
4 why not having an document
5 his document is missing
6 document is given
The regex pattern used here says to match:
\b
a word boundary (meaning what precedes is NOT a word character)[A-Za-z]{2}
match any 2 letters\d{2}
match 2 digits\b
another word boundary (what follows the digits is NOT a word character)
The word boundaries ensure, for example, that abc12
in your text does not get replaced with document
. If we didn't use the word boundaries, then you would also get substring matches, which maybe you don't want.