How to count all words in a sentence excluding all forms of white space with regex and Julia-CodePudding

I am trying to match all the words in the sentence:

"That's the password: 'PASSWORD 123'!", cried the Special Agent.\nSo I fled.

I tried:

([A-Za-z\d(^\n$)] ('[A-Za-z] )?)

but I don't want to match \nSo as a word. Only So. As a matter of fact, I want to exclude all forms of white space like \n or \t.

My Julia code is:

sentence = """"That's the password: 'PASSWORD 123'!", cried the Special Agent.\nSo I fled."""
regex = r"([A-Za-z\d(^\n$)] ('[A-Za-z] )?)"
v =[m.match for m = eachmatch(regex, sentence)]

CodePudding user response：

It turned out the \r, \n and \t are two-letter combinations in your texts.

Since Julia uses PCRE you can use a SKIP-FAIL regex here to easily ingore these combinations from matches:

\\[rnt](*SKIP)(*F)|\w (?:['-]\w )*

See the regex demo. Details:

\\[rnt](*SKIP)(*F) - a \ char and then either r, n or t, and then the matched chars are dropped, the match is failed and the engine starts looking for the next match from the failure position
| - or
\w (?:['-]\w )* - one or more word chars and then zero or more repetitions of ' or - and then one or more chars.

In Julia:

julia> sentence = """"That's the password: 'PASSWORD 123'!", cried the Special Agent.\nSo I fled."""
"\"That's the password: 'PASSWORD 123'!\", cried the Special Agent.\nSo I fled."

julia> regex = r"\\[rnt](*SKIP)(*F)|\w (?:['-]\w )*"
r"\\[rnt](*SKIP)(*F)|\w (?:['-]\w )*"

julia> v =[m.match for m = eachmatch(regex, sentence)]
12-element Vector{SubString{String}}:
"That's"
"the"
"password"
"PASSWORD"
"123"
"cried"
"the"
"Special"
"Agent"
"So"
"I"
"fled"

See the online Julia demo.