Regrex Extract special Strings from text-CodePudding

My problem looks like this:

data_example <-
  c("Creditshelf Aktiengesellschaft / Key word(s): Forecast/Development of Sales\n\ncreditshelf Aktiengesellschaft",
    "Swiss Life Holding AG / Key word(s): 9 Month figures\n\nSwiss Life increases fee income by 13%",
    "tonies SE / Key word(s): Capital Increase\n\ntonies SE: tonies successfully places 12,000,000 new class A shares",
    "init innovation in traffic systems SE / Key word(s): Contract/Incoming Orders\n\ninit innovation in traffic systems SEs")
strings_to_extract <-
  c("Key word(s): Word1/Word2",
    "Key word(s): Word1/Word2 Word3",
    "Key word(s): Word1 Word2 Word3",
    "Key word(s): Word1/Word2/Word3",
    "Key word(s): Number Word1/Word2",
    "Key word(s): Number Word1 Word2",
    "Key word(s): Word1 Number Word2")

There will always be a whitespace or a "/" to separate them. My try looks like this:

str_extract(data, "Key word[[:punct:]]{1}s[[:punct:]]{2} [[:alpha:]]{1,}|Key word[[:punct:]]{1}s[[:punct:]]{2} [[:alpha:]]{1,}[[:punct:]]{1,}[[:alpha:]]{1,}Key word[[:punct:]]{1}s[[:punct:]]{2} [[:alpha:]]{1,}[[:punct:]]{1,}[[:alpha:]]{1,}[[:punct:]]{1,}[[:alpha:]]{1,}")

I mean I capture a good part of theme, but I think its too complicated. Could somebody give me a advice how to do it better?

Thx amd KR

CodePudding user response：

You can use

str_extract(data, "Key word\\(s\\):\\s*\\w (?:\\W \w ){1,2}")

See the regex demo.

Details:

Key word\(s\):
\s* - zero or more whitespaces
\w - one or more word chars
(?:\W \w ){1,2} - one or two sequences of one or more non-word chars followed with one or more word chars.

CodePudding user response：

Your example data makes a different approach suitable as well, as your keywords always end at \n.

In this case you could just do:

data_example <-
c("Creditshelf Aktiengesellschaft / Key word(s): Forecast/Development of Sales\n\ncreditshelf Aktiengesellschaft",
  "Swiss Life Holding AG / Key word(s): 9 Month figures\n\nSwiss Life increases fee income by 13%",
  "tonies SE / Key word(s): Capital Increase\n\ntonies SE: tonies successfully places 12,000,000 new class A shares",
  "init innovation in traffic systems SE / Key word(s): Contract/Incoming Orders\n\ninit innovation in traffic systems SEs")

stringr::str_extract(data_example, "Key word\\(s\\):. (?=\\n)")
#> [1] "Key word(s): Forecast/Development of Sales"
#> [2] "Key word(s): 9 Month figures"              
#> [3] "Key word(s): Capital Increase"             
#> [4] "Key word(s): Contract/Incoming Orders"

Key word\\(s\\): matches Key word(s):, and . (?=\\n) matches all characters: . which are succeeded by \n: (?=\\n). Notice the double escapes (\\) which are needed in R.