Question
I have a character vecter a
that mix letter and Chinese character. I want to extract the 5 Classes
. Like the example below.
I don't know how to do it in R.
I have tried ex_between()
from qdapRegex
package but failed.
Reproducible code
a = "Class1:AAA_123_视物。\nClass2:BBB_456,行手术。\nClass3:CCC_789\nClass4:DDD_111\nEEE_222\nFFF_333\nClass5:GGG_444_右:光感? "
# I used ex_between() from qdapRegex package to extract Class1 to Class5 but failed.
Class1 = ex_between(a, 'Class1:', '\n')
Class2 = ex_between(a, 'Class2:', '\n')
Class3 = ex_between(a, 'Class3:', '\n')
Class4 = ex_between(a, 'Class4:', '\n')
Class5 = ex_between(a, 'Class1:', ' ')
Expected output
Class1 = "AAA_123_视物。"
Class2 = "BBB_456,行手术。"
Class3 = "CCC_789"
Class4 = "DDD_111,EEE_222,FFF_333" # Note that \n has been replaced with comma(,)
Class5 = "GGG_444_右:光感? "
CodePudding user response:
As the ex_between()
function has a limited control over regex, especially for the
case of Class4
which should not terminate the match at the first encountered
newline character, I'd use str_extract()
function instead.
Would you please try:
library(stringr)
a = "Class1:AAA_123_视物。\nClass2:BBB_456,行手术。\nClass3:CCC_789\nClass4:DDD_111\nEEE_222\nFFF_333\nClass5:GGG_444_右:光感? "
Class1 = str_extract(a, regex("(?<=Class1:).*?(?=\nClass)", dotall=TRUE))
Class2 = str_extract(a, regex("(?<=Class2:).*?(?=\nClass)", dotall=TRUE))
Class3 = str_extract(a, regex("(?<=Class3:).*?(?=\nClass)", dotall=TRUE))
Class4 = str_extract(a, regex("(?<=Class4:).*?(?=\nClass)", dotall=TRUE))
Class5 = str_extract(a, regex("(?<=Class5:).*"))
Output:
Class1 = "AAA_123_视物。"
Class2 = "BBB_456,行手术。"
Class3 = "CCC_789"
Class4 = "DDD_111\nEEE_222\nFFF_333"
Class5 = "GGG_444_右:光感? "
Explanation of the regex:
(?<=Class1:)
is a lookbehind assertion and the matched portion is not included in the result..*?
is the shortest match between the lookbehind (above) and the lookahead (below).(?=\nClass)
is a lookahead which matches the string "Class" preceded by a newline.- The
dotall=TRUE
option makes a dot match a newline character.