How to extract text ending with new line marker in r using regex-CodePudding

Question

I have a character vecter a that mix letter and Chinese character. I want to extract the 5 Classes. Like the example below.

I don't know how to do it in R.

I have tried ex_between() from qdapRegexpackage but failed.

Reproducible code

a = "Class1：AAA_123_视物。\nClass2：BBB_456，行手术。\nClass3：CCC_789\nClass4：DDD_111\nEEE_222\nFFF_333\nClass5：GGG_444_右：光感？  "

# I used ex_between() from qdapRegex package to extract Class1 to Class5 but failed.
Class1 = ex_between(a, 'Class1：', '\n')
Class2 = ex_between(a, 'Class2：', '\n')
Class3 = ex_between(a, 'Class3：', '\n')
Class4 = ex_between(a, 'Class4：', '\n')
Class5 = ex_between(a, 'Class1：', ' ')

Expected output

Class1 = "AAA_123_视物。"
Class2 = "BBB_456，行手术。"
Class3 = "CCC_789"
Class4 = "DDD_111,EEE_222,FFF_333" # Note that \n has been replaced with comma(,)
Class5 = "GGG_444_右：光感？  "

CodePudding user response：

As the ex_between() function has a limited control over regex, especially for the case of Class4 which should not terminate the match at the first encountered newline character, I'd use str_extract() function instead.
Would you please try:

library(stringr)

a = "Class1：AAA_123_视物。\nClass2：BBB_456，行手术。\nClass3：CCC_789\nClass4：DDD_111\nEEE_222\nFFF_333\nClass5：GGG_444_右：光感？  "

Class1 = str_extract(a, regex("(?<=Class1：).*?(?=\nClass)", dotall=TRUE))
Class2 = str_extract(a, regex("(?<=Class2：).*?(?=\nClass)", dotall=TRUE))
Class3 = str_extract(a, regex("(?<=Class3：).*?(?=\nClass)", dotall=TRUE))
Class4 = str_extract(a, regex("(?<=Class4：).*?(?=\nClass)", dotall=TRUE))
Class5 = str_extract(a, regex("(?<=Class5：).*"))

Output:

Class1 = "AAA_123_视物。"
Class2 = "BBB_456，行手术。"
Class3 = "CCC_789"
Class4 = "DDD_111\nEEE_222\nFFF_333"
Class5 = "GGG_444_右：光感？  "

Explanation of the regex:

(?<=Class1：) is a lookbehind assertion and the matched portion is not included in the result.
.*? is the shortest match between the lookbehind (above) and the lookahead (below).
(?=\nClass) is a lookahead which matches the string "Class" preceded by a newline.
The dotall=TRUE option makes a dot match a newline character.