Home > OS >  How to extract text ending with new line marker in r using regex
How to extract text ending with new line marker in r using regex

Time:09-07

Question

I have a character vecter a that mix letter and Chinese character. I want to extract the 5 Classes. Like the example below.

I don't know how to do it in R.

I have tried ex_between() from qdapRegexpackage but failed.

Reproducible code

a = "Class1:AAA_123_视物。\nClass2:BBB_456,行手术。\nClass3:CCC_789\nClass4:DDD_111\nEEE_222\nFFF_333\nClass5:GGG_444_右:光感?  "

# I used ex_between() from qdapRegex package to extract Class1 to Class5 but failed.
Class1 = ex_between(a, 'Class1:', '\n')
Class2 = ex_between(a, 'Class2:', '\n')
Class3 = ex_between(a, 'Class3:', '\n')
Class4 = ex_between(a, 'Class4:', '\n')
Class5 = ex_between(a, 'Class1:', ' ')

Expected output

Class1 = "AAA_123_视物。"
Class2 = "BBB_456,行手术。"
Class3 = "CCC_789"
Class4 = "DDD_111,EEE_222,FFF_333" # Note that \n has been replaced with comma(,)
Class5 = "GGG_444_右:光感?  "

CodePudding user response:

As the ex_between() function has a limited control over regex, especially for the case of Class4 which should not terminate the match at the first encountered newline character, I'd use str_extract() function instead.
Would you please try:

library(stringr)

a = "Class1:AAA_123_视物。\nClass2:BBB_456,行手术。\nClass3:CCC_789\nClass4:DDD_111\nEEE_222\nFFF_333\nClass5:GGG_444_右:光感?  "

Class1 = str_extract(a, regex("(?<=Class1:).*?(?=\nClass)", dotall=TRUE))
Class2 = str_extract(a, regex("(?<=Class2:).*?(?=\nClass)", dotall=TRUE))
Class3 = str_extract(a, regex("(?<=Class3:).*?(?=\nClass)", dotall=TRUE))
Class4 = str_extract(a, regex("(?<=Class4:).*?(?=\nClass)", dotall=TRUE))
Class5 = str_extract(a, regex("(?<=Class5:).*"))

Output:

Class1 = "AAA_123_视物。"
Class2 = "BBB_456,行手术。"
Class3 = "CCC_789"
Class4 = "DDD_111\nEEE_222\nFFF_333"
Class5 = "GGG_444_右:光感?  "

Explanation of the regex:

  • (?<=Class1:) is a lookbehind assertion and the matched portion is not included in the result.
  • .*? is the shortest match between the lookbehind (above) and the lookahead (below).
  • (?=\nClass) is a lookahead which matches the string "Class" preceded by a newline.
  • The dotall=TRUE option makes a dot match a newline character.
  • Related