Use of regex in replace() to join split words in a text-CodePudding

I have this sample sentences in a plain text file (utf-8):

today is an interest-
ing day

the "-" on the first line is followed by only \n (I have already stripped all \r from the file, to deal uniformy with different sources) I would like to wrap the 2 lines in 1 line, because of the "-", meaning that the preceding word has been truncated and is continuing in the next line. To join this kind of lines, what I have tried is something along the lines:

text.replace(/[\n-]/g, "")

but does not seem to be working. What is the right way to achieve this ?

I would like to be able to deal with both these possible endings (or similar situations you might anticipate):

interest-\n
interest- \n    (possible blanks inserted before \n)

CodePudding user response：

You can use

text.replace(/\b-\s*\n\b/g, "")
text.replace(/\b-[^\S\r\n]*\n\b/g, "")

See the regex demo. Details:

\b - a word boundary
- - a hyphen
\s* - zero or more whitespaces / [^\S\r\n]* - zero or more horizontal whitespaces (supporting CRLF, CR and LF endings)
\n - a newline char
\b - a word boundary.

See the JavaScript demo:

console.log( "today is an interest- \ning day".replace(/\b-\s*\n\b/g, "") );
console.log( "today is an interest-\ning day".replace(/\b-\s*\n\b/g, "") );

A Unicode-aware pattern that checks for just letters on both ends can look like text.replace(/(?<=\p{L}\p{M}*)-[^\S\r\n]*\n(?=\p{L})/gu, ""), where (?<=\p{L}\p{M}*) checks for a letter optional diacritics before - and (?=\p{L}) checks for a letter after a newline. See the regex demo.

CodePudding user response：

There are three things wrong in your regex for this use:

You have the new line before the -
The [] means a list of characters to match at least one of them
You need to add \s to match whitespace

So try this:

text.replace(/-\s*\n/g, "")